Abstract
We consider high-dimensional Bayesian inverse problems with arbitrary likelihood and product-form Laplace prior for which we provide a certified approximation of the posterior in the Hellinger distance. The approximate posterior differs from the prior only in a small number of relevant coordinates that contribute the most to the update from the prior to the posterior. We propose and analyze a gradient-based diagnostic to identify these relevant coordinates. Although this diagnostic requires computing an expectation with respect to the posterior, we propose tractable methods for the classical case of a linear forward model with Gaussian likelihood. Our methods can be employed to estimate the diagnostic before solving the Bayesian inverse problem via, e.g., Markov chain Monte Carlo (MCMC) methods. After selecting the coordinates, the approximate posterior can be efficiently inferred since most of its coordinates are only informed by the prior. Moreover, specialized MCMC methods, such as the pseudo-marginal MCMC algorithm, can be used to obtain less correlated samples when sampling the exact posterior. We show the applicability of our method using a 1D signal deblurring problem and a high-dimensional 2D super-resolution problem.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Bayesian inverse problems arise in many applications in science and engineering. When performing Bayesian inversion, one tries to characterize a probability distribution on the unknown parameters of a model given some observed data. These data are typically subject to noise, and are often modeled as
where \(A: \mathbb {R}^d \rightarrow \mathbb {R}^m\) is the forward operator, \(x\in \mathbb {R}^d\) is the unknown parameter, and \(\epsilon \in \mathbb {R}^m\) is the noise. In this work, we are concerned with high-dimensional Bayesian inverse problems with the posterior density given by
Here, \(\mathcal {L}:\mathbb {R}^d\rightarrow \mathbb {R}\) denotes the likelihood function of observing the data y given x, and \(\pi _0(x)\) is the prior density. The likelihood function models the relationship between forward operator, error model, and data, e.g., (1). We note that our framework applies in general to any likelihood function.
This work is motivated by, but not limited to, Bayesian inference for image reconstruction. In these applications, heavy-tailed priors are especially popular to preserve sharp edges. To this end, heavy-tailed priors can be imposed directly Hosseini (2017); Markkanen et al. (2019), or via a hierarchical framework Uribe et al. (2022), on the differences between elements of the unknown parameter, i.e., pixels. Classical choices for heavy-tailed priors include \(\alpha \)-stable distributions, such as the Cauchy distribution Suuronen et al. (2023).
Instead of applying a prior formulation on the differences between pixels, we use the fact that natural signals, and therefore images, can be effectively represented in a sparse manner using adapted bases, such as for example point source bases and wavelets bases Cai et al. (2018). Then, one can use the so-called synthesis formulation \(s=W x\) to expand a signal s in a suitable basis W for which x is sparse Elad et al. (2007). In this case, the heavy-tailed Laplace distribution is a typical prior choice to enforce sparsity in x. Indeed, in Simoncelli (1999) it was found that the marginals of wavelet coefficients of photographic images are well approximated by the Laplace distribution.
Following the above arguments, we use a product-form Laplace prior with density equal to
where \(\delta _i >0\) for all \(i\in \{1,2,\dots ,d\}\) are the rate parameters. Consequently, we express the posterior density as
We note that for the case where the forward operator is linear and the likelihood function is Gaussian (linear-Gaussian likelihood), the posterior density (3) can be characterized via the Bayesian LASSO Park and Casella (2008).
In real-world applications, Bayesian inference is often performed in a high-dimensional parameter space. For instance, in imaging science, the number of pixels is very large, resulting in parameter spaces with dimensions of order \(d=\mathcal {O}(10^4)\) or higher. When sampling from non-smooth densities as (3), the proximal unadjusted Langevin (p-ULA) or proximal Metropolis-adjusted Langevin algorithm (p-MALA) Pereyra (2016); Lau et al. (2023) can be used, but their performance deteriorates significantly with the dimensionality of the problem.
Inspired by the certified dimension reduction (CDR) methodology Zahm et al. (2022); Cui and Tong (2021); Li et al. (2023), we propose a new method, the certified coordinate selection, to select the components in x that contribute most to the update from the prior to the posterior. Hence, the efficiency of the aforementioned sampling methods can be improved substantially by restricting them to perform inference on the selected components only.
In principle, the CDR method consists in replacing the likelihood function with a ridge approximation \(x\mapsto \widetilde{\mathcal {L}}(U_r^\intercal x)\) for some matrix \(U_r\in \mathbb {R}^{d\times r}\) with \(r\ll d\) orthogonal columns, and some function \(\widetilde{\mathcal {L}}:\mathbb {R}^r\rightarrow \mathbb {R}\). The matrix \(U_r\) is determined by minimizing an error bound on the Kullback-Leibler (KL) divergence, which can be obtained via logarithmic Sobolev inequalities.
The CDR method has been successfully applied within a number of Bayesian updating strategies, such as the cross-entropy method Ehre et al. (2023), Stein-variational gradient descent Chen and Ghattas (xxxx), transport maps Brennan et al. (xxxx), and in Bayesian inference applied to rare event estimation Uribe et al. (2020). However, it has only been employed in cases where the prior is either Gaussian, or it is normalized by computing a map that pushes forward the original random variable to a standard Gaussian random variable Cui et al. (2022).
Applying the CDR method to a posterior as in (3) is not straightforward, because the Laplace prior does not satisfy the logarithmic Sobolev inequality (see Herbst’s argument in section 5.4.1 in Bakry et al. (2014)). However, the Laplace prior satisfies the Poincaré inequality Bakry et al. (2014), which can be used to bound the Hellinger distance instead of the KL divergence of the low-dimensional posterior approximation Cui and Tong (2021); Li et al. (2023).
In order to fully control the Poincaré constant, we restrict our analysis to matrices \(U_r\), which are coordinate selection matrices, so that \(U_r^\intercal x = (x_{\mathcal {I}_1},\ldots ,x_{\mathcal {I}_r})\), for \({\mathcal {I}}\subset \{1,\ldots ,d\}\) being the set of selected coordinates. That way, we also preserve the interpretability of the dimension reduction, since \(x_r\) are coordinates of the original vector x.
Our main contributions are as follows:
-
We propose to select the relevant coordinates based on the following diagnostic
$$\begin{aligned} h_i = \frac{1}{\delta _i^2} \int _{\mathbb {R}^d} (\partial _i \log \mathcal {L}(x))^2 \pi (x) \textrm{d}x, \end{aligned}$$and we show that the Hellinger distance between the exact and the approximated posterior can be explicitly bounded using \(h_i\).
-
We prove that in the case of a linear-Gaussian likelihood, we only need to estimate the posterior mean and the posterior covariance to compute the diagnostic.
-
We show for the above case how a smoothing approximation to the Laplace prior can be used to compute a diagnostic and define an efficient proposal for the preconditioned Metropolis-adjusted Langevin algorithm (MALA).
-
We test our methods on a 1D signal deblurring task, which is given in the synthesis formulation and a high-dimensional 2D super-resolution example.
The remainder of this paper is structured as follows. In Sect. 2, we present the theoretical part of our method, which comprises the posterior approximation and its certification. In Sect. 3, we outline the general approach to sample the approximate posterior and recall the pseudo-marginal Markov chain Monte Carlo (MCMC) algorithm, which can be used to sample the exact posterior. In Sect. 4, we present detailed methodology for the case of a linear-Gaussian likelihood. In Sect. 5, we test our methods on two numerical examples: a 1D deblurring example and a 2D super-resolution example. We draw the conclusions in Sect. 6.
2 Certified coordinate selection
In this section, we first show how we approximate the posterior and how this approximation can be controlled by an upper bound in the Hellinger distance. The result can then be used to compute a diagnostic \(h\in \mathbb {R}^d\), which ranks the coordinates based on their contribution to the update from the prior to the posterior.
2.1 Posterior approximation
We aim at identifying the set of components in the parameter vector \(x\in \mathbb {R}^d\) that are most informed by the data relative to prior information. To this end, we define the coordinate splitting
where the set \({\mathcal {I}}\subset \{1,\ldots ,d\}\) contains the indices of the informed coordinates, and \({\mathcal {I}^c}=\{1,\dots , d\}{\setminus } \mathcal {I}\) includes the complementary indices. We refer to \(x_{_{\mathcal {I}}}\) as selected coordinates. Notice that if the likelihood is almost constant in \(x_{_{\mathcal {I}^c}}\), the update from prior to posterior happens mainly on \(x_{_{\mathcal {I}}}\).
To formalize this idea, let us introduce the posterior approximation
where \(\pi (x_{_{\mathcal {I}}})\) is the posterior marginal and \(\pi _0(x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\) is the conditional prior. Compared to the exact posterior, which can be factorized as \(\pi (x) = \pi (x_{_{\mathcal {I}}}) \pi (x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\), the approximation \(\widetilde{\pi }(x)\) essentially consists in replacing the conditional posterior \(\pi (x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\) with the conditional prior \(\pi _0(x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\). Combining (5) and (2), we can define the following quasi-optimal approximate posterior
where we call
the quasi-optimal reduced likelihood. As we show in the following proposition, the posterior approximation (6) is a quasi-optimal choice when certifying the approximation using the Hellinger distance
Proposition 1
Let \(\pi (x) \propto \mathcal {L}(x) \pi _0(x)\) be a probability density on \(\mathbb {R}^d\) where \(\pi _0(x)\) is a product-form density, and let \(x=(x_{_{\mathcal {I}}},x_{_{\mathcal {I}^c}})\) be any coordinate splitting. Then, the reduced likelihood \(\widetilde{\mathcal {L}}:\mathbb {R}^{|\mathcal {I}|} \rightarrow \mathbb {R}_{+}\) which minimizes \(H\!\left( {\pi }, {\widetilde{\pi }} \right) \), where \(\widetilde{\pi }(x) \propto \widetilde{\mathcal {L}}(x_{_{\mathcal {I}}}) \pi _0(x)\), is given by
which we call the optimal reduced likelihood. Correspondingly, we call
the optimal approximate posterior.
Moreover, \(\widetilde{\mathcal {L}}^\dagger \) as defined in (7), yields a quasi-optimal posterior approximation \(\widetilde{\pi }^\dagger \) with respect to the Hellinger distance in the sense that
Proof
See section A.1. \(\square \)
2.2 Certifying the approximation
We now provide an upper bound on the Hellinger distance \(H\!\left( {\pi }, { \widetilde{\pi }^\dagger } \right) \) between the posterior defined in (3) and its quasi-optimal approximation (6).
Proposition 2
Consider the probability density defined in (3). Given a coordinate splitting \(x=(x_{_{\mathcal {I}}},x_{_{\mathcal {I}^c}})\), the quasi-optimal approximate posterior \(\widetilde{\pi }^\dagger (x)\) given in (6) satisfies
where the diagnostic \(h \in \mathbb {R}^d\) is given by
Proof
See section A.2 \(\square \)
Remark 1
In case of non-negativity constraints on x, the analogue to the Laplace prior is the exponential prior. The upper bound on the Hellinger distance is the same for this case, since both distributions share the same Poincaré constant.
With the diagnostic h, the coordinate splitting can be performed by finding \(\mathcal {I}^c\) such that
where \(\tau \) is a given desired precision on the Hellinger distance. As a matter of fact, the set \(\mathcal {I}\) contains the indices i associated with the \(r(\tau )\) largest components in h.
Notice that the number of selected coordinates \(r(\tau )\) can be abnormally large, especially if the bound (12) is loose. In this case, we set \(r=\min (r(\tau ), r_{\max })\) for a pre-given \(r_{\max }\) and we let \(\mathcal {I}\) contain the indices of the r largest components in h.
2.3 Estimating the diagnostic
The definition of the diagnostic \(h_i\) in (13) involves an expectation over the posterior, which is expensive to compute. In practice, an approximate posterior expectation can still yield a useful estimate with which efficient dimension reduction via coordinate selection can be achieved. In particular, for posteriors with linear-Gaussian likelihood, we present detailed methodology in section 4 for estimating h. To motivate the estimation of the diagnostic (13) also in the general case, we provide the following proposition showing a straightforward result.
Proposition 3
Let \(\widetilde{\pi }\) be any approximation to \(\pi \) such that there exists \(0<\alpha \le \beta <\infty \) such that
for all x. Then, the approximate diagnostic
satisfies \(\alpha \tilde{h}_i \le h_i \le \beta \tilde{h}_i\). In particular, we have \( h_i = 0 ~\Leftrightarrow ~ \tilde{h}_i = 0. \)
In the general case of a nonlinear non-Gaussian likelihood, importance sampling may be a feasible way to estimate h. Accordingly, we point to Cui and Tong (2021), theorem 4.2, for statistical estimates of the error in the upper bound (12), and to Cui and Tong (2021), proposition 4.5, for bounds on the sampling variance of importance sampling estimates. Finally, Cui and Tong (2021) also discusses adaptive sampling schemes where h is constructed iteratively.
We conclude this section with algorithm 1, which outlines the general procedure of our certified coordinate selection method.
3 Sampling algorithms
In this section, we propose two algorithms for drawing samples from the optimal approximate and the exact posterior, respectively.
3.1 Sampling the approximate posterior
The product-form of the prior allows us to write the optimal approximate posterior as
and thus naturally suggests a simple sampling scheme where the main sampling effort is concentrated on the selected coordinates \(x_{_{\mathcal {I}}}\)Cui and Zahm (2021). The sampling method consists in first drawing samples \(\{x_{_{\mathcal {I}}}^{i}\}_{i=1}^{N}\) from the low-dimensional density \(\widetilde{\pi }^*(x_{_{\mathcal {I}}})\) using a MCMC method. Then, for each sample \(x_{_{\mathcal {I}}}^{(i)}\), we draw a sample \(x_{_{\mathcal {I}^c}}^{(i)}\) from the marginal prior \(\pi _0(x_{_{\mathcal {I}^c}})\). In the end, reassembling \(x^{(i)}=(x_{_{\mathcal {I}}}^{(i)}, x_{_{\mathcal {I}^c}}^{(i)})\) yields samples from the optimal approximate posterior \(\widetilde{\pi }^*(x)\). We summarize this procedure in algorithm 2.
In practice, the optimal reduced likelihood \(\widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}})\) (9) must be approximated to enable sampling from \(\widetilde{\pi }^*(x_{_{\mathcal {I}}})\). Since we expect the likelihood to be mostly flat in the directions of not selected coordinates, a natural approach is to fix \(x_{_{\mathcal {I}^c}}\) in (9) to the prior mean \(\mu _0=0\). Then, we obtain the approximation \(\widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}}) \approx \mathcal {L}(x_{_{\mathcal {I}}}, x_{_{\mathcal {I}^c}}=0)\), which is computationally cheap while giving satisfactory results as has been shown in the numerical examples of Zahm et al. (2022).
3.2 Sampling the exact posterior via the pseudo-marginal MCMC algorithm
The marginal of the quasi-optimal posterior \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}}) = \widetilde{\mathcal {L}}^\dagger (x_{_{\mathcal {I}}}) \pi _0(x_{_{\mathcal {I}}})\) satisfies \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}})=\pi (x_{_{\mathcal {I}}})\). Using this fact, a pseudo-marginal MCMC algorithm Cui and Zahm (2021); Andrieu and Roberts (2009) can be employed, which in combination with a so-called recycling step, samples the exact posterior. Note that the bound in (12) allows us to control the error of this quasi-optimal posterior approximation and therefore provides theoretical justification for using this sampling algorithm.
We outline the pseudo-marginal MCMC algorithm for the i-th iteration in the following. Given the state \(x^{(i-1)} = (x_{_{\mathcal {I}}}^{(i-1)}, x_{_{\mathcal {I}^c}}^{(i-1)})\), a candidate \(z_{_{\mathcal {I}}}^{(i)}\) is drawn from a proposal distribution \(q(\cdot |x_{_{\mathcal {I}}}^{(i-1)})\), which targets \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}})\). Then, the quasi-optimal reduced likelihood \(\widetilde{\mathcal {L}}^\dagger (z_{_{\mathcal {I}}}^{(i)})\) is approximated with M freshly drawn samples \(\{ z_{_{\mathcal {I}^c}}^{(i,j)} \}_{j=1}^M \sim \pi _0(z_{_{\mathcal {I}^c}})\) as
Thus, we obtain a set of candidate samples \(\{z_{_{\mathcal {I}}}^{(i)}, \{ z_{_{\mathcal {I}^c}}^{(i,j)} \}_{j=1}^M \}\), which is accepted with probability
At this point, the samples \(x_{_{\mathcal {I}}}^{(i)}\) follow \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}})=\pi (x_{_{\mathcal {I}}})\). Now to obtain samples \(x_{_{\mathcal {I}^c}}^{(i)}\) from \(\pi (x_{_{\mathcal {I}^c}})\), we can use the following recycling step. We select \(x_{_{\mathcal {I}^c}}^{(i)}\) from \(\{ z_{_{\mathcal {I}^c}}^{(i,j)} \}_{j=1}^M\) according to the discrete probability
We summarize the pseudo-marginal MCMC algorithm in algorithm 3.
4 Methodology for linear-Gaussian likelihood
In this section we describe a detailed application of the certified coordinate selection method for a posterior density in the form
where the noise follows the Gaussian distribution \(\mathcal {N}(0,\Sigma _{\textrm{obs}})\).
Note that to obtain h in (13) we need to compute an expectation over the posterior density. The next lemma shows that, for a linear-Gaussian likelihood, the diagnostic h admits a closed-form expression involving only the posterior mean and the posterior covariance.
Lemma 4
Let \(\mathcal {L}(x) \propto \exp (-\frac{1}{2} \Vert y - A x \Vert _{\Sigma _{\textrm{obs}}^{-1}}^2 )\) where \(\Sigma _{\textrm{obs}}\in \mathbb {R}^{m\times m}\) is positive definite and assume the mean \(\mu \) and the covariance \(\Sigma \) of the probability density \(\pi (x) \propto \mathcal {L}(x) \pi _0(x)\) exist. Then we can compute the diagnostic h as
where \((\cdot )^{\circ 2}\) denotes entry-wise square and \(\Lambda ={\text {diag}}\left( 1/\delta _1^2,\dots ,1/\delta _d^2\right) \).
Proof
See section A.2.2.
If the posterior mean and the posterior covariance are unknown, a first and intuitive choice is to replace them respectively by the prior mean \(\mu _0=0\) and the prior covariance \(\Sigma _0 = 2\Lambda \). This yields
In the following, we show how a more precise estimate of the diagnostic compared to the prior-informed estimate of (22) can be obtained. To this end, we employ a Gaussian approximation at the maximum-a-posteriori (MAP) estimate. Note that the negative logarithm of (20) is strictly convex and that its minimizer, i.e., the MAP-estimate, can be computed efficiently via convex optimization toolboxes even for high-dimensional problems.
4.1 MAP-approximated diagnostic
The density in (20) is unimodal and differs only from a Gaussian in that the norm in the prior is \(l^1\) instead of \(l^2\). This motivates us to employ a common strategy in Bayesian inversion where the posterior density is approximated by a Gaussian centered at the MAP-estimate \(x_\textrm{MAP}\) (e.g., Murphy (2012)). That is, we estimate the mean as \(\mu \approx x_\textrm{MAP}\), and the covariance matrix as
Plugging in \(x_\textrm{MAP}\) for \(\mu \) and \(H^{-1}\) for \(\Sigma \) in (21) we obtain
The non-differentiability of \(|\cdot |\) poses an obstacle in computing (23). Inspired by Vogel (2002), we use the approximation
where we can control the amount of smoothing around 0 with \(0<\varepsilon \ll 1\). With this, we obtain
where we now use \({\text {diag}}\left( \cdot \right) \) to describe a diagonal matrix with diagonal given by the vector argument.
It remains the question of how to choose \(\varepsilon \). It appears natural to choose \(\varepsilon \) very small to obtain a good approximation to the absolute value. However, we have
Since we expect \(x_\textrm{MAP,i}\approx 0\) for many coordinates, \(\varepsilon \) must not be chosen too small to avoid fast, nearly non-smooth changes among the elements in H that lead to numerical instabilities when computing its inverse which is needed for (24). Hence, we set \(\varepsilon \) according to the following heuristic.
Observe that (26) resembles the inverse of the covariance matrix of a Gaussian posterior density constructed by a linear-Gaussian likelihood and a Gaussian prior. In this light \(\delta _i^{-1} \varepsilon ^{-1} (\sqrt{{x_\textrm{MAP,i}}^2 +\varepsilon } )^{3}\) represents the variance of the i-th component. Now our heuristic rule is that these variances should be at least as large as the smallest variance of the chosen Laplace prior. Therefore, we require
We can assume that \(\min _{i=1,\dots ,d} x_\textrm{MAP,i}= 0\) such that we obtain
4.2 Preconditioned MALA
The approximation in (25) enables employing MALA, since it allows for the computation of approximate gradients. MALA is derived by discretizating a Langevin diffusion equation and steers the sampling process by using gradient information of the log-target density Robert and Casella (2004). While different algorithms have been developed to sample non-smooth log-densities (see, e.g., Lau et al. (2023)), the proposed smoothing is simple to implement and computationally cheap. Moreover, combined with the Metropolis step, we obtain convergence to the target density.
We note that for the approximation of the prior gradient in the MALA implementation,
the smoothing parameter \(\varepsilon \) can be different from the one in the heuristic rule (27). In fact, for the MALA-proposal, it should hold \(0<\varepsilon \ll 1\) such that proposals computed from a state near 0 are close to the exact discretized Langevin diffusion. We note in our experiments that even for relatively large choices of \(\varepsilon \), e.g., using (27), MCMC chains appear to have converged. However, the larger \(\varepsilon \) in (28), the more concentrated are the samples around 0. Hence, we choose \(\varepsilon =10^{-8}\) in our numerical experiments.
Finally, we remark that if H (26) is invertible, the inverse \(H^{-1}\) can be used as preconditioner to make the MALA-proposal more efficient in high dimensions Martin et al. (2012); Petra et al. (2014). Furthermore, in the pseudo-marginal MCMC algorithm 3, a preconditioner for the local MALA-proposal for the update of \(x_{_{\mathcal {I}}}\) (line 2) can be obtained by projecting \(H^{-1}\) onto the selected coordinates.
5 Numerical experiments
In this section, we illustrate the performance of our methods in two different applications: a 1D deblurring problem and a 2D super-resolution problem. Algorithm 2 and algorithm 3 are used to obtain samples from the optimal approximate and exact posterior, respectively.
We compute 10 independent MCMC chains in each sampling experiment. This allows us to assess the variability of the sample statistics and to check the convergence of the MCMC chains by means of the estimated potential scale reduction (EPSR) statistic Gelman and Rubin (1992), denoted by \(\hat{R}\). In the following, we obtain \(\hat{R}<1.1\) in each sampling experiment, and thus assume that all our MCMC chains are converged.
We use the Python package arviz Kumar et al. (2019) to compute \(\hat{R}\). The same package is also used to obtain effective sample size (ESS) and credibility interval (CI) (see, e.g., Murphy (2012) for definitions).
Following the arguments outlined in section 4.1 and section 4.2, we use the heuristic rule (27) to chose \(\varepsilon \) whenever computing \(\tilde{h}_\textrm{MAP}\), and set \(\varepsilon =10^{-8}\) in the approximated gradient of our MALA-proposals.
5.1 1D signal deblurring
The main purpose of this experiment is to demonstrate the applicability of our diagnostic when performing the coordinate splitting. Additionally, we show results when using the pseudo-marginal MCMC algorithm 3 to sample the exact posterior, and when sampling from the optimal approximate posterior (algorithm 2).
5.1.1 Problem description
The data y is obtained artificially via
where \(s_{\textrm{true}} \in \mathbb {R}^{1024}\) denotes the piece-wise constant ground truth, G is a Gaussian blur operator with the kernel width 27 and standard deviation 3, and \(e \in \mathbb {R}^{1024}\) is a realization from \(\mathcal {N}(0, \sigma _\textrm{obs}^2)\) with \(\sigma _\textrm{obs}=0.03\). The true signal and the data are shown in Fig. 1.
We employ a 10-level Haar wavelet transform with periodic boundary condition and formulate the Bayesian inverse problem in the coefficients domain. Let W and \(W^\dagger \) denote the discrete wavelet transform and the inverse discrete wavelet transform, respectively. The true coefficients \(x_\textrm{true}=W^\dagger s_\textrm{true}\) are sparse with \(\Vert x_\textrm{true}\Vert _0=60\), see Fig. 2. The posterior density formulated with respect to the coefficients reads
Thus, following our previous notation for the forward operator, we have \(A=GW\). We use the Python package pywt Lee et al. (2019) to compute the discrete wavelet transforms.
To take the different scales of the wavelet coefficients into account, we chose different \(\delta \) for each level of the wavelet basis. We define
for some \(c>0\), where \(\ell (i) \in \{1,\dots ,10\}\) denotes the level of the i-th wavelet coefficient. We plot \(\delta _i\) for \(c=1\) in Fig. 2. Note that for \(c=1\), the prior in (29) on x corresponds to a Besov-\(\mathcal {B}_{11}^1\) prior Kolehmainen et al. (2012); Lassas and Siltanen (2009) on the signal s.
In the following, we investigate the estimation of the diagnostic and the performance of the pseudo-marginal MCMC algorithm 3 while varying the global parameter c. Note that the prior becomes tighter with larger c. We consider the cases \(c\in \{1,5,25\}\).
5.1.2 Bound on the Hellinger distance
In this section, we proceed as follows for each \(c\in \{1,5,25\}\). We compute a MAP-approximation \(\tilde{h}_\textrm{MAP}\) via (24) and a prior-approximation \(\tilde{h}_\textrm{prior}\) via (22). Then, we compare the bound (12) obtained via \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\) to a reference bound, that we compute via a reference diagnostic \(\tilde{h}_\textrm{ref}\). Moreover, we compare the order of coordinates suggested by \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\) to the reference order given by \(\tilde{h}_\textrm{ref}\).
We obtain \(\tilde{h}_\textrm{ref}\) via the Monte Carlo approximation
where \(\{x^{(i)}\}_{i=1}^{N}\) are posterior samples computed with MALA on the full dimension. The samples are taken from all 10 MCMC chains, each comprising of \(2\times 10^6\) samples, which do not include the burn-in period of \(10^5\) samples. Further, we only keep every 100-th sample to reduce correlation resulting in a mean ESS of 500 over all chains and coordinates.
To compute \(\tilde{h}_\textrm{MAP}\), we require \(x_\textrm{MAP}\), which we obtain by using the convex optimization Python package cvxpy Diamond and Boyd (2016); Agrawal et al. (2018). We show \(W x_\textrm{MAP}\) in Fig. 3 and see that all estimates are close to \(s_\textrm{true}\).
The bounds on the Hellinger distance computed via \(\tilde{h}_\textrm{ref}\), \(\tilde{h}_\textrm{MAP}\), and \(\tilde{h}_\textrm{prior}\) for \(c\in \{1,5,25\}\) are shown in the left column of Fig. 4. The curves are generated by first sorting the diagnostics in ascending order and then plotting their cumulative sums. The vertical lines indicate the indices \(\{i|x_{\textrm{true},i}\ne 0\}\).
In the right column of Fig. 4, we illustrate the differences in ordering of coordinates suggested by \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\) in comparison to \(\tilde{h}_\textrm{ref}\) as follows. For \(1\le g \le d\), consider \(\mathcal {S}_a(g) = (s_1, s_2, \dots , s_g)\), \(s_i \in \{1,2,\dots ,d\}\), to be a tuple defining the ascending ordering of the g most important coordinates implied by some diagnostic a. That is, the least and most important coordinates according to \(\mathcal {S}_a(g)\) are the coordinates with indices \(s_1\) and \(s_g\), respectively. Then, given an approximate diagnostic \(\tilde{h}_a\), we can plot for each number of not selected coordinates, i.e., \(g=|\mathcal {I}^c|\), the fraction of correctly chosen coordinates with respect to the reference ordering, i.e., \(|\mathcal {S}_\textrm{ref}(g) \cap \mathcal {S}_a(g)|\, / g\).
The left column of Fig. 4 indicates that as the prior becomes tighter with larger c, the same bound can be retained while more coordinates are included in \(\mathcal {I}^c\), and consequently, a more efficient dimension reduction is possible. Moreover, the upper bounds show that \(\tilde{h}_\textrm{MAP}\) provides in general a similar bound over \(\mathcal {I}^c\) as \(\tilde{h}_\textrm{ref}\). However, for very tight priors, e.g., \(c=25\), all bounds are visually close to each other. The concentration of vertical lines on the right in all rows suggests that most of the indices \(\{i|x_{\textrm{true},i}\ne 0\} \) tend to be included in \(\mathcal {I}\) no matter which approximation to h is chosen. Recall that the diagnostic reveals the coordinates, where the update from prior to posterior information is most evident. Consequently, the remaining vertical lines scattered across the graph correspond to coordinates, where the data cannot be distinguished easily from prior information.
From the right column of Fig. 4, it can be seen that ca. \(70\%\) of coordinates are always correctly selected for all \(c\in \{1,5,25\}\) and both approximate diagnostics. In line with the good visual agreement of the bounds obtained via \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{ref}\), their orderings are also similar: For all \(c\in \{1,5,25\}\), roughly \(95-100\%\) of coordinates are always selected correctly. The graphs’ oscillatory behaviours on the right are to be attributed to the large influence of correctly or incorrectly selected coordinates with respect to relatively small subsets of selected coordinates.
5.1.3 Sampling the exact posterior
We sample the exact posterior (29) for \(c\in \{1,5,25\}\) with MALA on the full dimensional space, as well as with the pseudo-marginal MCMC algorithm 3 with MALA-proposals on the selected coordinates. In the following, we refer to these methods as ‘MALA’ and ‘PM-MALA’, respectively.
To compute \(\nabla \log \pi \) for the MALA-proposals, we use the approximated prior gradient (28). Furthermore, we require the adjoints of W and G for the likelihood gradient. We compute \(W^*\) with the technique from Folberth and Becker (2016), which involves handling the padding of the boundary conditions manually. Regarding the blurring operator, we have \(G^*=G\).
We obtain \(H^{-1}\) via (26) and use only its diagonal as preconditioner for the MALA-proposals to save computational cost. For PM-MALA, where the MALA-proposals are employed to update \(x_{_{\mathcal {I}}}\), we project \(H^{-1}\) on the selected coordinates and also use only its diagonal. We sample \(M=5\) vectors of \(x_{_{\mathcal {I}^c}}\) in each iteration inspired by the numerical experiments in Cui and Zahm (2021).
We sample 10 independent chains for each configuration as follows. In total, we compute \(10^6\) samples for each chain and save every 50-th sample to decrease correlation. We use a burn-in period of \(10^5\) samples during which we adapt the step size to achieve a fixed acceptance rate. Following Roberts and Rosenthal (2001), we target an acceptance rate of 0.574 for x in the MALA runs and for \(x_{_{\mathcal {I}}}\) in the PM-MALA runs.
Note that we need to select enough coordinates in order to achieve a stable acceptance rate during the PM-MALA iterations. Based on some pilot runs and based on the MAP-approximated bounds in Fig. 4, we select \(n_\mathcal {I}\in \{726, 311, 127 \}\) for \(c\in \{1,5,25\}\), respectively. With these choices of \(n_\mathcal {I}\), we expect \(H^2\!\left( {\pi }, {\widetilde{\pi }^\dagger } \right) < 0.2\) for each \(c\in \{1,5,25\}\). In the next section 5.1.4, we sample the quasi-optimal approximate posterior \(\widetilde{\pi }^\dagger \) and check the expected bound numerically.
In Fig. 5 we show the \(99\%\) CI for \(c\in \{1,5,25\}\) in the signal space and observe that the CI becomes narrower with increasing c. However, reduced uncertainty due to large c enables more efficient pseudo-marginal MCMC sampling since \(\mathcal {I}\) can be chosen smaller with increasing c whilst still obtaining good mixing. This can be seen in table 1, where mixing in terms of ESS is significantly larger for PM-MALA than for MALA on the full dimension. In particular, ESS for coordinate indices in \(\mathcal {I}^c\) is close to the total amount of samples since the proposals for \(x_{_{\mathcal {I}^c}}\) are drawn independently. Table 1 also shows that running MALA on the reduced dimensional space tends to allow for larger step sizes, which contributes to improved mixing in \(x_{_{\mathcal {I}}}\).
We note that depending on the complexity of the forward operator and the choice of M, the pseudo-marginal algorithm 3 may require significantly more runtime than a Metropolis-Hastings algorithm with the same proposal density in full dimensions. Recall that M corresponds to the number of forward model evaluations in each sampling step (algorithm 3, line 4). In our experiments, we set \(M=5\), and PM-MALA requires approximately the same runtime as MALA.
5.1.4 Sampling the approximate posterior
In this section we use algorithm 2 to sample the optimal approximate posterior (6) for \(c\in \{1,5,25\}\). To this end, we employ a MALA-proposal to sample \(x_{_{\mathcal {I}}}\) in line 1. As outlined in section 3.1, we approximate the optimal reduced likelihood (9) by fixing not selected coordinates to the prior mean such that the optimal approximate posterior reads
We select \(n_\mathcal {I}\in \{726, 311, 127 \}\) for \(c\in \{1,5,25\}\), respectively. With these choices of \(n_\mathcal {I}\), we expect \(H\!\left( {\pi }, {\widetilde{\pi }} \right) ^2 \le 0.2\) according to \(\tilde{h}_{\textrm{MAP}}\) in Fig. 4. An estimation of the Hellinger distance based on samples from \(\widetilde{\pi }\) would allow for assessing the quality of the optimal approximate posterior and for checking the tightness of our bound (12). However, computing a numerical estimate of the Hellinger distance based on samples is hard since it tends to be unstable due to the unknown normalizing constants of \(\pi \) and \(\widetilde{\pi }\).
Instead, we can obtain a numerical estimate of another bound on the Hellinger distance based on samples from any approximation \(\widetilde{\pi }\), which is independent of the normalizing constants as
where \(x^{(i)} \sim \widetilde{\pi }\), and \(\rho \) and \(\tilde{\rho }\) are the exact and the approximate unnormalized posterior densities, respectively. See section A.3 for the derivation of (32). While we can use this bound to assess the quality of the approximate posterior (31), it does not allow for any conclusions on the tightness of our bound (12), which we estimate through the MAP-approximated diagnostic.
As in the previous section, we compute a preconditioner for the MALA-proposals by projecting the diagonal of \(H^{-1}\) onto the selected coordinates. We sample again 10 independent chains of \(2 \times 10^6\) samples and an additional burn-in period of \(10^5\) samples with adapting step size targeting an acceptance rate of 0.574. We thin the chains to decrease auto-correlation by keeping only every 100-th sample.
We show the sampling results in table 2. For each \(c\in \{1,5,25\}\), the sample-approximated upper bound is about half of the expected upper bound of 0.2, which we obtain via \(\tilde{h}_\textrm{MAP}\). This suggests that the optimal approximate posterior is relatively close to the exact posterior. Moreover, the standard deviation across the chains of the sample-approximated upper bound is small, indicating that the estimate (33) is indeed stable.
5.2 2D super-resolution microscopy
The purpose of this experiment is to show that our coordinate selection method works well in high dimensions and that the optimal approximate posterior can be used to perform efficient inference. The test problem is inspired by the application of stochastic optical reconstruction microscopy (STORM) from Zhu et al. (2012). A similar example was considered in the Bayesian context in Durmus et al. (2018). STORM is a super-resolution microscopy method based on single-molecule stochastic switching, where the goal is to detect molecule positions in live cell imaging. The images are obtained by a microscope detecting the photon count of the (fluorescence) photoactivated molecules.
5.2.1 Problem description
We consider a microscopic image \(y\in \mathbb {R}^m\), which is obtained from a 2D pixel-array by concatenation in the usual column-wise fashion. Here, we set \(m=32^2=1024\). In STORM, we want to estimate precise molecule positions by computing a super-resolution image \(x\in \mathbb {R}^{d}\). In this example, we set the oversampling ratio \(k=4\), which leads to \(d=mk^2=16 384\). Based on the kernel from the optical measurement instrument given in Zhu et al. (2012), we generate the forward operator \(A\in \mathbb {R}^{m\times d}\). The data y is obtained via
where \(e \in \mathbb {R}^m\) is simulated from \(\mathcal {N}(0, \sigma _\textrm{obs}^2)\).
Similar as in Zhu et al. (2012), we generate the ground-truth image \(x_\textrm{true}\) for the high photon count case with 50 uniformly distributed molecules on a field of size \(4\,\upmu \textrm{m}\times 4\,\upmu \textrm{m}\). The intensity of each molecule is simulated from a lognormal distribution with mode 3000 and standard deviation 1700. In Fig. 6 we show the ground-truth image and the data, which is obtained according to (34) with \(\sigma _\textrm{obs}=30\).
We use a Laplace prior due to the sparse behavior of \(x_\textrm{true}\), which leads to the posterior density
In Fig. 6 we also show the MAP-estimate with \(\delta =1.275\), where \(\delta \) is chosen based on the visual quality after some pilot runs.
5.2.2 Bound on the Hellinger distance
We use the MAP-approximation (24) to estimate the diagnostic and to compute the bound on the Hellinger distance, which we show in the left panel in Fig. 7. It is obvious that by using the MAP-approximation we can detect the coordinates of interest very accurately, which may be due to the good quality of the MAP-estimate. Although we obtain very large bounds on the Hellinger distance for this example, we can still employ the diagnostic to detect the most relevant coordinates to perform uncertainty quantification on the molecule positions.
5.2.3 Sampling the approximate posterior and uncertainty quantification
We use the MAP-approximated diagnostic to select 1000 coordinates, which we show in the center of Fig. 7. The posterior density is again approximated as in (31). We sample the posterior by using the No-U-Turn sampler (NUTS) Hoffman and Gelman (xxxx) implemented in the Python package pyro Bingham et al. (2018). After sampling 10 independent chains with 30000 burn-in samples and 10000 posterior samples for each chain, we obtain converged chains with \(\max _i \hat{R}_i = 1.01\), and an averaged ESS (over all chains and components) equal to 168.
In this example, we cannot estimate the bound on the Hellinger distance via (32), since the ratio \(\frac{\rho ( x^{(i)} )}{\tilde{\rho }(x^{(i)})}\) computed with our samples \(\{x^{(i)}\}_{i=1}^N \sim \widetilde{\pi }(x)\) is unstable. However, as it can be observed from the center plot in Fig. 7, our diagnostic is able to select the correct molecule coordinates and the relevant neighbourhoods around them. Therefore, we can still use the samples from \(\widetilde{\pi }(x)\) to perform uncertainty quantification on the intensity of the photons and on the true molecule positions as follows.
To illustrate the uncertainty in the intensity, we plot the \(99\%\) CI for the selected molecules in the right figure of Fig. 7. The large ranges in CI can be contributed to the large ranges in photon intensity. Further, we observe that the approximate posterior tends to have larger CI at the true molecule positions, which are marked in red.
Now we estimate the uncertainty in the true molecule positions in the super-resolution grid by applying the following procedure. We select the pixels of the 50 largest posterior means as the detected molecule positions. Then, we move each of these posterior means pixel-wise vertically and horizontally until they leave the \(99\%\) CI of their neighbouring pixels. The corresponding distances reflect the uncertainty in vertical and horizontal direction. The average, taken over both directions, amounts to 3.82 pixels, or \(118.4 \textrm{nm}\). We note that this result is in agreement with the results in Zhu et al. (2012).
6 Conclusions
We outlined a coordinate selection method for high-dimensional Bayesian inverse problems with product-form Laplace prior. Inspired by the CDR methodology, we defined an approximate posterior density by replacing the likelihood with a ridge approximation. The ridge approximation is constructed such that it varies mainly on the coordinates which contribute mostly to the update from the prior to the posterior. Based on a bound in the Hellinger distance between the exact and a quasi-optimal approximate posterior, we then derived a diagnostic vector \(h \in \mathbb {R}^d\), which can be used to select the important coordinates.
After performing the coordinate selection, it is relatively easy to sample the approximate posterior. An additional advantage of our coordinate splitting is that advanced MCMC algorithms, such as delayed acceptance MCMC or pseudo-marginal MCMC, can be employed to sample the exact posterior.
The computation of h involves, however, integrating over the posterior density. For the case of a linear forward operator with additive Gaussian error, we presented a tractable methodology for estimating the diagnostic h before performing Bayesian inference via, e.g., MCMC methods.
The numerical results indicate that our methodology, which estimates the diagnostic based on a MAP-estimate, succeeds in revealing the most important coordinates. This enabled us to sample the approximate posterior very efficiently. Furthermore, the coordinate splitting allowed us to employ the pseudo-marginal MCMC algorithm to sample the exact posterior. Here our results show that the pseudo-marginal MCMC algorithm with MALA-proposals on the selected coordinates performs significantly better in terms of mixing of the sample chains when compared to MALA on the full-dimensional space.
Our methodology for estimating the diagnostic based on a MAP-estimate hinges on a smoothing approximation of the prior. This introduces an additional parameter \(\varepsilon \) which we fix following a heuristic rule. However, other ways for estimating the diagnostic not only in the case of a linear-Gaussian likelihood, but also for more general problems with non-linear forward operator and/or non-Gaussian likelihood should be investigated.
Moreover, we approximate the optimal reduced likelihood by setting the non-selected coordinates to zero (the prior mean). While this approximation yields good results in the first example, the approximation deteriorates in the second high-dimensional example. Therefore, better approximations to the optimal reduced likelihood should be explored as well.
References
Agrawal, A., Verschueren, R., Diamond, S., Boyd, S.: A rewriting system for convex optimization problems. J. Control Decis. 5(1), 42–60 (2018)
Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo computations (2009)
Bakry, D., Gentil, I., Ledoux, M.: Analysis and Geometry of Markov Diffusion Operators, vol. 348. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-00227-9
Bingham, E., Chen, J.P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., Goodman, N.D.: Pyro: deep universal probabilistic programming. J. Mach. Learn. Res. 2019(20), 1–6 (2018)
Brennan, M.C., Bigoni, D., Zahm, O., Spantini, A., Marzouk, Y.: Greedy Inference with Structure-Exploiting Lazy Maps. Adv. Neural. Inf. Process. Syst. 33, 8330–8342 (2020)
Cai, X., Pereyra, M., McEwen, J.D.: Uncertainty quantification for radio interferometric imaging–I. Proximal MCMC methods. Mon. Not. R. Astron. Soc. 480(3), 4154–4169 (2018). https://doi.org/10.1093/mnras/sty2004
Chen, P., Ghattas, O.: Projected Stein Variational Gradient Descent. Adv. Neural. Inf. Process. Syst. 33, 1947–1958 (2020)
Cui, T., Tong, X.T.: A unified performance analysis of likelihood-informed subspace methods. Bernoulli 28, 2788–2815 (2021)
Cui, T., Zahm, O.: Data-free likelihood-informed dimension reduction of Bayesian inverse problems. Inverse Prob. 37(4), 045009 (2021). https://doi.org/10.1088/1361-6420/abeafb
Cui, T., Tong, X.T., Zahm, O.: Prior normalization for certified likelihood-informed subspace detection of Bayesian inverse problems. Inverse Prob. 38(12), 124002 (2022). https://doi.org/10.1088/1361-6420/ac9582
Diamond, S., Boyd, S.: CVXPY: a Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17(83), 1–5 (2016)
Durmus, A., Moulines, É., Pereyra, M.: Efficient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM J. Imaging Sci. 11(1), 473–506 (2018). https://doi.org/10.1137/16M1108340
Ehre, M., Flock, R., Fußeder, M., Papaioannou, I., Straub, D.: Certified dimension reduction for Bayesian updating with the cross-entropy method. SIAM ASA J. Uncertain. Quantif. 11(1), 358–388 (2023)
Elad, M., Milanfar, P., Rubinstein, R.: Analysis versus synthesis in signal priors. Inverse Prob. 23(3), 947–968 (2007). https://doi.org/10.1088/0266-5611/23/3/007
Folberth, J., Becker, S.: Efficient adjoint computation for wavelet and convolution operators [lecture notes]. IEEE Signal Process. Mag. 33(6), 135–147 (2016). https://doi.org/10.1109/MSP.2016.2594277
Gelman, A., Rubin, D.B.: Inference from iterative simulation using multiple sequences. Stat. Sci. 7(4), 457–472 (1992). (Accessed 2024-02-29)
Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)
Hosseini, B.: Well-posed Bayesian inverse problems with infinitely divisible and heavy-tailed prior measures. SIAM ASA J. Uncertain. Quantif. 5(1), 1024–1060 (2017)
Kolehmainen, V., Lassas, M., Niinimäki, K., Siltanen, S.: Sparsity-promoting Bayesian inversion. Inverse Prob. 28, 025005 (2012). https://doi.org/10.1088/0266-5611/28/2/025005
Kumar, R., Carroll, C., Hartikainen, A., Martin, O.: ArviZ a unified library for exploratory analysis of Bayesian models in Python. J. Open Source Softw. 4(33), 1143 (2019). https://doi.org/10.21105/joss.01143
Lassas, M., Siltanen, S.: Discretization-invariant Bayesian inversion and Besov space priors. Inverse Probl. Imaging 3(1), 87–122 (2009)
Lau, T.T.-K., Liu, H., Pock, T.: Non-Log-Concave and Nonsmooth Sampling via Langevin Monte Carlo Algorithms (2023). arXiv preprint arXiv:2305.15988
Lee, G.R., Gommers, R., Waselewski, F., Wohlfahrt, K., O’Leary, A.: PyWavelets: a Python package for wavelet analysis. J. Open Source Softw. 4(36), 1237 (2019). https://doi.org/10.21105/joss.01237
Li, M.T., Marzouk, Y., Zahm, O.: Principal Feature Detection via \(\phi \)-Sobolev Inequalities. arXiv preprint arXiv:2305.06172 (2023)
Markkanen, M., Roininen, L., Huttunen, J.M., Lasanen, S.: Cauchy difference priors for edge-preserving Bayesian inversion. J. Inverse Ill Posed Probl. 27(2), 225–240 (2019)
Martin, J., Wilcox, L.C., Burstedde, C., Ghattas, O.: A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion. SIAM J. Sci. Comput. 34(3), 1460–1487 (2012). https://doi.org/10.1137/110845598
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning Series, MIT Press, Cambridge (2012)
Park, T., Casella, G.: The Bayesian Lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008). https://doi.org/10.1198/016214508000000337
Pereyra, M.: Proximal Markov chain Monte Carlo algorithms. Stat. Comput. 26(4), 745–760 (2016). https://doi.org/10.1007/s11222-015-9567-4
Petra, N., Martin, J., Stadler, G., Ghattas, O.: A computational framework for infinite-dimensional Bayesian inverse problems, part II: stochastic Newton MCMC with application to ice sheet flow inverse problems. SIAM J. Sci. Comput. 36(4), 1525–1555 (2014). https://doi.org/10.1137/130934805
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Texts in Statistics, Springer New York, New York (2004). https://doi.org/10.1007/978-1-4757-4145-2
Roberts, G.O., Rosenthal, J.S.: Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. (2001). https://doi.org/10.1214/ss/1015346320
Simoncelli, E.P.: Modeling the joint statistics of images in the wavelet domain. In: Unser, M.A., Aldroubi, A., Laine, A.F. (eds.) SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, Denver, CO, pp. 188–195 (1999). https://doi.org/10.1117/12.366779
Suuronen, J., Soto, T., Chada, N.K., Roininen, L.: Bayesian inversion with \(\alpha \)-stable priors. Inverse Prob. 39(10), 105007 (2023)
Uribe, F., Papaioannou, I., Marzouk, Y.M., Straub, D.: Cross-entropy-based importance sampling with failure-informed dimension reduction for rare event simulation. SIAM ASA J. Uncertain. Quantif. 9, 818–847 (2020)
Uribe, F., Dong, Y., Hansen, P.C.: Horseshoe priors for edge-preserving linear Bayesian inversion. SIAM J. Sci. Comput. 45, B337–B365 (2022)
Vogel, C.R.: Computational methods for inverse problems. Soc. Ind. Appl. Math. (2002). https://doi.org/10.1137/1.9780898717570
Zahm, O., Cui, T., Law, K., Spantini, A., Marzouk, Y.: Certified dimension reduction in nonlinear Bayesian inverse problems. Math. Comput. 91(336), 1789–1835 (2022). https://doi.org/10.1090/mcom/3737
Zhu, L., Zhang, W., Elnatan, D., Huang, B.: Faster STORM using compressed sensing. Nat. Methods 9(7), 721–723 (2012). https://doi.org/10.1038/nmeth.1978
Acknowledgements
RF and YD were supported by a Villum Investigator Grant (No. 25893) from The Villum Foundation. FU was supported by the Research Council of Finland through the Flagship of Advanced Mathematics for Sensing, Imaging and Modelling, and the Centre of Excellence of Inverse Modelling and Imaging (decision numbers 359183 and 353095, respectively).
Funding
Open access funding provided by Technical University of Denmark
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation and numerical experiments were performed by R.F. The first draft of the manuscript was written by R.F. and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Proofs
Proofs
1.1 A.1 Proposition 1
We begin by introducing the normalizing constants for \(\pi \), \(\widetilde{\pi }\) and \(\widetilde{\pi }^*\):
where we have used that the prior has product-form. Recall that for a coordinate splitting \(x=(x_{_{\mathcal {I}}},x_{_{\mathcal {I}^c}})\) we want to control the approximation \(\widetilde{\pi }(x) \propto \widetilde{\mathcal {L}}(x_{_{\mathcal {I}}}) \pi _0(x)\) for \(\pi (x) \propto \mathcal {L}(x) \pi _0(x)\) in the Hellinger distance.
We split the remaining proof into two parts. In the first part we will show that the choice
for \(\widetilde{\mathcal {L}}\) minimizes the Hellinger distance and is thus the optimal reduced likelihood function. In the second part we will show that \(\widetilde{\mathcal {L}}^\dagger \) is quasi-optimal with respect to the Hellinger distance.
1.1.1 A.1.1 Optimal reduced likelihood function
For an approximate posterior defined by any reduced likelihood function, we can write
On the contrary, if we use the optimal reduced likelihood, the Hellinger distance reads
Combining the two results, we obtain
which concludes the first part of the proof.
1.1.2 A.1.2 Quasi-optimal reduced likelihood function
For the reduced likelihood \(\widetilde{\mathcal {L}}^\dagger (x_{_{\mathcal {I}}})\) we have \(\widetilde{Z}^\dagger =Z\). Further, we can write
which implies \(\widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}}) \le \widetilde{\mathcal {L}}^\dagger (x_{_{\mathcal {I}}})\) and \(\widetilde{Z}^*\le \widetilde{Z}^\dagger \) (which we already know because \(\widetilde{Z}^\dagger =Z\) and from section A.1 we have \(0 \le H\!\left( {\pi }, {\widetilde{\pi }^*} \right) ^2 = 1 - \frac{\sqrt{\widetilde{Z}^*}}{\sqrt{Z}}\) ).
With this we can write
which concludes the proof.
1.2 A.2 Proposition 2
Resuming from (A1) we have
According to (), \(\pi _0(x_{_{\mathcal {I}^c}})\) satisfies the Poincaré inequality so that
where \(\Lambda ={\text {diag}}\left( 1/\delta _1^2, 1/\delta _2^2, \dots , 1/\delta _d^2\right) \). We let \({ g(x_{_{\mathcal {I}}}, x_{_{\mathcal {I}^c}}) } = \sqrt{\mathcal {L}(x_{_{\mathcal {I}}},x_{_{\mathcal {I}^c}})}\), so that
Hence, we obtain
1.2.1 A.2.1 Poincaré inequality
The following proposition is a restatement of proposition 4.4.1 in Bakry et al. (2014).
Proposition 5
For the exponential probability density function \(\nu (x)=\delta \exp {\left( -\delta x\right) }\) on \(\mathbb {R}_+\) with rate parameter \(\delta >0\) and any differentiable function \(f:\mathbb {R}\rightarrow \mathbb {R}\) with \(f(0)=0\), the Poincaré inequality reads
Proof
Using \({\text {Var}}_{\nu }\left( f\right) = {\text {E}}_{\nu }\left[ f^2\right] -{\text {E}}_{\nu }\left[ f\right] ^2\) we can write
To go from line 2 to 3, observe that for \(h(t)=f(t) f'(t)\) we have
After applying a Cauchy-Schwarz inequality, we obtain the desired result. \(\square \)
Corollary 6
For the Laplace probability density function \(\mu (x)=\tfrac{\delta }{2} \exp {\left( -\delta |x|\right) }\) with rate parameter \(\delta >0\) on \(\mathbb {R}\) and any differentiable function \(f:\mathbb {R}\rightarrow \mathbb {R}\), the Poincaré inequality reads
Proof
In the following we assume without loss of generality \(f(0)=0\). For any such f we have
where \(\nu =\mathrm {Exponential(\delta )}\).
Applying the Poincaré inequality for \(\nu \) yields
\(\square \)
Corollary 7
For the product-form Laplace probability density function \(\mu (x)=\prod _{i=1}^d \delta _i\exp (-\delta _i|x_i|)\) on \(\mathbb {R}^d\) with rate parameters \(\delta _1, \delta _2, \dots \delta _d >0\), the Poincaré inequality
holds for any differentiable function \(f:\mathbb {R}^d \rightarrow \mathbb {R}\).
Proof
The result follows directly from the stability under products of Poincaré inequalities, see Bakry et al. (2014). For the sake of completeness, we give the proof for \(d=2\) (a simple recursion permits to extend to any \(d>2\)).
Let \(f:x\mapsto f(x_1,x_2)\) be a differentiable function. Without loss of generality, we assume f is centered \(\int f \textrm{d}\mu =0\). For \(F(x_1)=\int f(x_1,x_2)\textrm{d}\mu _2(x_2)\) the Poincaré inequality permits to write
where we used a Jensen’s inequality for the last step. This gives the result for \(d=2\).\(\square \)
1.2.2 A.2.2 Lemma 4
We consider the case of a Gaussian likelihood (for fixed data y) of the form
Then, continuing from (A2) we can write
Given the covariance \(\Sigma _{XX}\) and mean \(\mu _{X}\) of the probability density \(\pi (x)\) we substitute \(z=(y - A x)\). Then, the mean and covariance of Z read \(\mu _{Z} = y - A \mu _{X}\) and \(\Sigma _{ZZ} = A \Sigma _{XX} A^\intercal \), respectively. Therefore,
1.3 A.3 Numerical estimation of a bound on the Hellinger distance
For two probability densities \(\pi =\frac{\rho }{Z}\) and \(\widetilde{\pi }=\frac{\tilde{\rho }}{\widetilde{Z}}\), where \(Z\) and \(\tilde{Z}\) are the normalization constants, we can write
Furthermore, we have
so that we obtain
In addition,
Because \(|\tilde{Z}-Z|= |\sqrt{\tilde{Z}}-\sqrt{Z}||\sqrt{\tilde{Z}}+\sqrt{Z}|\) we get
So, in the end we have
Note that due to symmetry in the Hellinger distance, we also have
Therefore,
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Flock, R., Dong, Y., Uribe, F. et al. Certified coordinate selection for high-dimensional Bayesian inversion with Laplace prior. Stat Comput 34, 134 (2024). https://doi.org/10.1007/s11222-024-10445-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10445-1