1 Introduction

In Bayesian inverse problems, one is interested in approximating the posterior distribution of a set of unobservable quantities, x, conditioned on indirect measurements, y (Stuart 2010). Often the posterior distribution depends on a scalar hyper-parameter, \(\theta \in \Theta \subseteq {\mathbb {R}}\), e.g. the noise variance: one can either perform hyper-parameter selection with an Empirical Bayes (EB) approach targeting the conditional posterior \(p^{\theta ^\star }(x\mid y)\) with the hyper-parameter set to the value which maximizes the marginal likelihood, \(\theta ^\star :={\hbox {arg max}}_{\theta \in \Theta }\{p^{\theta }(y)\}\), sometimes termed type-II maximum likelihood (Good 1965), or else marginalize out the hyper-parameter through a Fully Bayesian (FB) approach, targeting the posterior \(p(x\mid y)\). However, both approaches often result in costly procedures.

One relatively common tool for approximating posterior distributions arising in Bayesian inverse problems are Sequential Monte Carlo (SMC) samplers (Del Moral et al. 2006). SMC samplers construct an artificial sequence of distributions such that the first one can be readily sampled from and the last one coincides with the distribution of interest; a set of particles is drawn from the first density, and evolves gradually to approximate each distribution in the sequence.

Compared to standard alternatives, SMC samplers provide distinct advantages that make them a suitable choice when analytical solutions are not available. In contrast to Variational Bayesian inference (Fox and Roberts 2012), they need not approximate the posterior distribution with a member of a specific class of functions, often chosen for computational convenience, thus avoiding the problem of misspecification. In comparison to MCMC (Gilks et al. 1995), they feature simultaneous exploration of multiple areas of the state space through the multiple particles, thus improving the chances of fully exploring multi-modal posterior distributions and have often been found to give better performance at given computational cost—see, for example, (Del Moral et al. 2006, Section 4.2) and references therein.

In most implementations of SMC samplers for Bayesian inverse problems, the samples obtained at intermediate iterations are discarded, because intermediate iterations are only used to facilitate the approximation of the target distribution. Not directly using these samples, except perhaps to estimate a normalizing constant, seemingly results in a substantial waste of computational resources. Indeed, we have recently witnessed a growing number of studies that attempt to exploit/recycle particles from previous iterations in the final estimates (Gramacy et al. 2010; Drovandi et al. 2019; Le Thu et al. 2016; Dau and Chopin 2022). Gramacy et al. (2010) propose to recycle particles at different iterations by considering a weighted sum of all the approximated distributions in order to maximise the Effective Sample Size (ESS). Alternatively, Le Thu et al. (2016) propose to combine particles from past SMC samplers iterations considering the so called Deterministic Mixture Weight estimator; a solution derived to combine weighted particles drawn from different proposal distributions. Recently Drovandi et al. (2019) developed a method which allows the samples from each generation of the algorithm to be used to approximate integrals over a part of the state space.

In this work we show that, for a large class of hierarchical Bayesian inverse problems featuring scalar hyper-parameters, the intermediate iterations of properly designed SMC samplers can be used to perform selection of the hyper-parameter and/or averaging with respect to it, making EB/FB approaches feasible. All of this has only a negligible additional computational cost and, in the case of averaging, it also entails recycling of the particles at intermediate iterations, thus reducing the typical waste of computational resources.

The key idea underlying the proposed method is to define the auxiliary sequence of distributions in such a way that each distribution is a posterior distribution conditioned on a different value of the hyper-parameter. Such construction turns out to be extremely simple under certain conditions, for instance when the hyper-parameter appears only in the likelihood and the likelihood belongs to the natural exponential family; under other circumstances, finding the right sequence can be more challenging. Given the sequence, the estimate of the normalizing constant, naturally produced by SMC samplers, corresponds to an estimate of the evidence for the specific value of the hyper-parameter, which then allows maximum likelihood or Bayesian inference on the hyper-parameter.

We provide the right tempering sequence for two different models largely used in inverse problems:

  • when the likelihood belongs to the Natural Exponential Family (NEF): here the tempering sequence obtained by raising the likelihood to a growing power between zero and one results in a proper sequence of densities that can be interpreted as posterior distributions;

  • when the conditional posterior for a subset of variables \(x_1\) can be analytically computed, and an SMC sampler is used only to approximate the posterior on the remaining variables \(x_2\). For this class of models, which includes among others Conditionally Linear Gaussian (CLG) models, the auxiliary distribution sequence devised for the first case does not, in general, have the desired properties, therefore we devise alternative sequences that can be used fruitfully in two special sub-cases.

The most straightforward application of the proposed method is the context of additive Gaussian noise inverse problems; here the interest is in the estimation of the joint posterior distribution for the state variables and the noise variance or the posterior distribution for the state variables conditioned on the estimated value for the noise variance.

As a first examples we consider the problem of recovering the mean of a Gaussian distribution from noisy observations, showing that the proposed approach performs as well as alternative approaches but with significant advantages in computational time. Then we show numerical results for a real world problem encountered in source analysis of Magneto/Electro-Encephalography data, in this case we show that the proposed approach provides reliable results and a substantial reduction of computational cost with respect to alternative approaches.

2 Motivating example: source estimation in magneto/ electro-encephalography

Magneto-/Electro-EncephaloGraphy (M/E-EG) are two non-invasive medical imaging techniques that record the magnetic/electric field on the scalp; from these recordings, it is possible to estimate the underlying neural currents (Hämäläinen et al. 1993). Using the dipolar assumption, this problem consists of estimating an unknown number of point sources, called dipoles, each one defined by two quantities:

  • a location in the brain volume, conveniently represented as the index r of a cell of a discretized brain (or voxel); dipole location is assumed to be fixed in time;

  • a 3-D vector q representing orientation and intensity of the neural current at the specified voxel, and changing dynamically in time.

The inference problem can be formalized as

$$\begin{aligned} y (t)= & {} \sum _{i=1}^d G(r_i)q_i(t)+\varepsilon (t) \end{aligned}$$
(1a)
$$\begin{aligned} \varepsilon (t)\sim & {} {\mathcal {N}}\left( 0, \theta ^2 \Sigma \right) \end{aligned}$$
(1b)

where: \(t = 1,\dots , T\) is a time index; y(t) is an array containing the data recorded by all M/E-EG sensors at time t; d is the (unknown) number of dipoles; \(G(r_i)\) is the so called lead-field matrix, representing the magnetic/electric field generated by a unitary dipole located at \(r_i\); \(\varepsilon (t)\) is additive Gaussian noise whose (spatial) covariance matrix \(\Sigma \) is known up to a scale factor \(\theta \). We observe that the lead-field matrix entries are non-linear functions of the source locations \(r_i\), therefore linear inversion models are not directly applicable to this inverse problem.

This model was originally adopted in Sorrentino et al. (2013, 2014), where all unknown parameters were sampled with an SMC sampler, leading to high computational cost for long time series; in Sommariva and Sorrentino (2014) a Rao-Blackwellized version was presented that imposed a Gaussian prior on the q variables and exploited the CLG structure, allowing to treat long time series with reduced computational cost. Finally, in Viani et al. (2021) a hierarchical model was presented that overcomes the limitations of the Gaussian prior by using a hyper-prior on the prior variance, thus substantially reducing the dependence on this hyper-parameter. Defining \({\textbf{y}}:=(y(1), \dots , y(T))\) and \({\textbf{q}}_{1:d}:=(q_{1:d}(1), \dots , q_{1:d}(T))\), the posterior distribution decomposes as:

$$\begin{aligned} \begin{aligned}&p^{\theta }(d, r_{1:d}, {\textbf{q}}_{1:d},\lambda \mid {\textbf{y}}) \\ {}&\quad = p^{\theta }({\textbf{q}}_{1:d} \mid {\textbf{y}}, d, r_{1:d}, \lambda )p^{\theta }(d, r_{1:d},\lambda \mid {\textbf{y}}) \end{aligned} \end{aligned}$$
(2)

where the conditional posterior \(p^{\theta }({\textbf{q}}_{1:d} \mid {\textbf{y}}, d, r_{1:d}, \lambda )\) can be computed analytically, and only the second factor on the right hand side of (2) has to be approximated via Monte Carlo. Importantly, there remains a dependence on the hyper-parameter \(\theta \), namely the overall noise level, whose value has to be estimated.

3 SMC samplers for Bayesian inverse problems

In this section we provide a brief summary of a class of SMC samplers that are often used for the approximation of posterior distributions in Bayesian inference problems. Notice that SMC samplers can be applied in more general situations, not analyzed in this paper; for further details on general SMC samplers algorithms the reader is referred to Del Moral et al. (2006, 2007).

Consider a Bayesian inference problem where the aim is to approximate the posterior distribution

$$\begin{aligned} p(x \mid y) = \frac{p(x) p(y \mid x)}{p(y)} \end{aligned}$$
(3)

where y represents the data and x the unknown parameters. The posterior distribution is often a complex distribution in a possibly high-dimensional space and is typically difficult to sample from directly.

SMC samplers provide an effective way to sample such complex distributions, and can be briefly summarized as follows.

The first step is to define a sequence of intermediate densities:

$$\begin{aligned}{} & {} \bigl \{p_t(x \mid y)\bigl \}_{t=0}^T, \end{aligned}$$
(4a)
$$\begin{aligned}{} & {} p_T(x \mid y) = p(x \mid y), \end{aligned}$$
(4b)
$$\begin{aligned}{} & {} p_t(x \mid y) \simeq p_{t+1}(x \mid y), \end{aligned}$$
(4c)

that “smoothly” transition from an easy-to-sample initial density \(p_0\) to the posterior density \(p_T\). Condition (4c) is required in order to guarantee a smooth transition toward the target density and hence to allow a good approximation of \(p_{t+1}\) to be obtained from the corresponding approximation of \(p_t\).

A natural, but not mandatory, choice in Bayesian inference is to reach the posterior density by starting from the prior and increasing the power of the likelihood using the so called geometric bridge, or tempering path (Syed et al. 2021; Chopin and Papaspiliopoulos 2020; Bernton et al. 2019; Neal 2001):

$$\begin{aligned}{} & {} p_t(x \mid y) \propto p(x)p(y \mid x)^{\alpha _t}, \end{aligned}$$
(5a)
$$\begin{aligned}{} & {} 0 = \alpha _0< \alpha _1<... < \alpha _T = 1. \end{aligned}$$
(5b)

Once the sequence of distributions has been selected, SMC samplers work as follows (for convenience we assume that the prior can be sampled from directly although this is not necessary):

  • sample a set of N weighted particles \(\{{\textbf{x}}^{(0)}; {\textbf{W}}^{(0)}\}\) from the initial distribution \(p_0\) with \({\textbf{W}}^{(0)} \equiv 1/N\).

  • for \(t=1,\ldots ,T\):

    1. 1.

      perform one, or more, Markov Chain Monte Carlo (MCMC) step/s; such as Metropolis Hastings step/s

    2. 2.

      perform an Importance Sampling (IS) step from the current distribution \(p_{t-1}\) to the next distribution \(p_t\) updating the un-normalized importance weights and normalizing them using the relations

      $$\begin{aligned} {\textbf{w}}^{(t)}_i&= {\textbf{w}}^{(t-1)}_i \frac{\gamma _{t}({\textbf{x}}^{(t-1)}_i \mid y)}{\gamma _{t-1}({\textbf{x}}^{(t-1)}_i\mid y)}, \end{aligned}$$
      (6a)
      $$\begin{aligned} {\textbf{W}}^{(t)}_i&:= \frac{{\textbf{w}}^{(t)}_i}{\sum _{n=1}^N {\textbf{w}}^{(t)}_{n}}, \end{aligned}$$
      (6b)

      for \(i=1,\ldots ,N\); where \(\gamma _t\) represents the un-normalized t-th distribution of the sequence (4a); in the case of (5a), for example, \(\gamma _t(x\mid y) = p(x) p(y\mid x)^{\alpha _t}\).

      At this point one obtains an approximation of the t-th distribution of the sequence as:

      $$\begin{aligned} {{\hat{p}}}_t(x\mid y) = \sum _{n=1}^N {\textbf{W}}_n^{(t)} \delta _{x_n^{(t)}}(x). \end{aligned}$$
      (7)

      In this step one also obtains an estimator of the normalizing constant of the distribution \(p_t\), crucial for model selection in general and for the proposed method in particular. It can be easily evaluated; for simplicity, assuming that resampling occurs at every step, as the product over time of the average of the un-normalized importance weights at each time:

      $$\begin{aligned} {\hat{p}}_t(y) = \prod _{s=1}^t\frac{1}{N}\sum _{n=1}^{N}{\textbf{w}}^{(s)}_{n}. \end{aligned}$$
      (8)

      the expression in the case that resampling is conducted adaptively is the corresponding product over resampling times of the average of the weights accumulated since the last resampling time (see, e.g., (Guarniero et al. 2017, p. 1641) for an explicit expression).

    3. 3.

      perform a resampling step to avoid degeneracy of the importance weights (Douc et al. 2005; Gerber et al. 2019). A widely used strategy is to perform resampling whenever the Effective Sample Size (ESS) (see, e.g, Liu 2008) is under a fixed threshold. If resampling is performed, replace \(\{{\textbf{x}}^{(t)}\}\) with the collection of resampled particles and set \({\textbf{w}}_i^{(t)}=1\) and \({\textbf{W}}_i^{(t)}=1/N\) for each i.

One important property of SMC samplers comes from Eqs. (6a)–(6b) which allow the evaluation of the importance weights at time t using only the particles at the previous step. This allows the sequence of steps described in the algorithm to be carried out in a different order, with step 2 and 3 being carried out pre-emptively. By conducting resampling in this way before the mutation step, a more diverse collection of particle values is obtained at no additional cost. This modification further allows an adaptive choice of the actual sequence of densities, as defined in (4a), through an online selection of the next exponent (Del Moral et al. 2012; Sorrentino et al. 2014).

4 Selection/averaging of the hyper-parameter

Let \(\Theta \subseteq {\mathbb {R}}\) and consider a Bayesian inverse problem depending on a hyper-parameter \(\theta \in \Theta \). We are now going to show how an SMC sampler can be used both to select a specific value for the hyper-parameter and/or to approximate the joint posterior distribution \(p(x,\theta \mid y)\) at no additional cost with respect to the SMC sampler that approximates the conditional posterior \(p^{\theta }(x\mid y)\).

The key idea underlying the proposed method is to construct an SMC sampler whose target distribution is \(p^{\theta ^\star }(x\mid y)\) for some value \(\theta ^\star \in \Theta \), and whose intermediate distributions are posterior distributions corresponding to different values of the hyper-parameter for a set of values \(\Theta _{0:T}:=\{\theta \in \Theta \;:\; \theta = \theta (t);\; t=0, \cdots , T\}\)

$$\begin{aligned} p_t^{\theta ^*}(x\mid y) = p^{\theta (t)}(x\mid y) = \frac{p^{\theta (t)}(y\mid x)p^{\theta (t)}(x)}{p^{\theta (t)}(y)}. \end{aligned}$$
(9)

Given the sequence above, one can estimate pointwise the evidence for the hyper-parameter \(p^{\theta }(y)\) for \(\theta \in \Theta _{0:T}\) through the Importance Sampling step within SMC sampler (8). Under regularity assumptions for \(p^{\theta }(y)\) w.r.t. \(\theta \) one can interpolate this finite set of values to obtain a smooth approximation of the evidence and, assuming the availability of a hyper-prior \(p(\theta )\), that we assume to be negligible outside a compact set \([\theta _{\text {min}}, \theta _{\text {max}}]\), an approximation of the marginal posterior \({\hat{p}}(\theta \mid y)\).

For an EB approach, one can first find the mode of the interpolating function properly weighted

$$\begin{aligned} {\bar{\theta }}={\hbox {arg max}}_{\theta \in [\theta _{\text {min}}, \theta _{\text {max}}]}\{{\hat{p}}(\theta \mid y)\}, \end{aligned}$$
(10)

where we assume that the range of \(\Theta _{0:T}\) contains \(\theta ^*,\theta _{\text {min}}\) and \(\theta _{\text {max}}\). This can be done numerically by binary search, using importance sampling to estimate the marginal likelihood of values of \(\theta \) between those in \(\Theta _{0:T}\). We can then apply importance sampling to obtain an approximation of \(p^{{\bar{\theta }}}(x\mid y)\).

In order to avoid degeneration of importance weights, one should do importance sampling from \(p^{\theta ({\bar{t}})}(x\mid y)\), where \(\theta ({{\bar{t}}})\) is the closest value to \({{\bar{\theta }}}\) such that the support and tails of \(p^{\theta ({{\bar{t}}})}(x\mid y)\) are larger and heavier, respectively, than those of \(p^{{{\bar{\theta }}}} (x\mid y)\); for instance, assuming that \(\{\theta (t)\}_{t=0,\dots ,T}\) is a decreasing sequence, and that the distributions tails become lighter as \(\theta \) becomes smaller, we shall select the iteration

$$\begin{aligned} {{\bar{t}}} = =\max \{t :\; \theta (t) > {{\bar{\theta }}}\}. \end{aligned}$$
(11)

For a FB approach one obtains an approximation of the posterior

$$\begin{aligned} p(\theta \mid y) \propto p^{\theta }(y) p(\theta ) \end{aligned}$$
(12)

for \(\theta \in \Theta _{0:T}\), allowing to compute estimates such as the posterior mean or mode for the hyper-parameter.

In addition, it is possible to approximate the marginal posterior of the parameters

$$\begin{aligned} \begin{aligned} p(x\mid y)&= \int p(x,\theta \mid y)d\theta \\ {}&= \int p^{\theta }(x\mid y)p(\theta \mid y)d\theta \\ {}&\propto \int p^{\theta }(x\mid y)p^{\theta }(y)p(\theta )d\theta . \end{aligned} \end{aligned}$$
(13)

taking into account uncertainty on parameters deriving from uncertainty on the hyper-parameter. This can be done by considering all particles at all iterations and re-weighting them

$$\begin{aligned} \begin{aligned} {{\hat{p}}}(x\mid y)&= \sum _{t=0}^T\sum _{n=1}^N \bigg \{ {\textbf{W}}^{(t)}_n \delta _{x}(x^{(t)}_n) {{\hat{p}}}^{\theta (t)}(y) \\&\quad \times p(\theta (t)) g^{(t)}(\Theta _{0:T})\bigg \} \end{aligned} \end{aligned}$$
(14)

where \(g^{(t)}\) is a function representing the interpolation weights.

For example, in the case of a standard quadrature method such as the trapezoidal rule we get

$$\begin{aligned} g_t(\Theta _{0:T}) = \left\{ \begin{array}{ll} \Vert \theta _1 - \theta _0\Vert /2 &{} t = 0\\ \Vert \theta _{t+1} - \theta _{t-1}\Vert /2 &{} 1<t<T \\ \Vert \theta _T - \theta _{T-1}\Vert /2 &{} t = T\\ \end{array} \right. \end{aligned}$$
(15)

but of course more sophisticated options are available (Zhou et al. 2016).

The additional computational cost required for calculating (12)–(14) is negligible compared to that of the approximation of \(p^{\theta ^\star }(x\mid y)\) directly with an SMC sampler employing likelihood tempering.

Moreover, the proposed FB approach has the advantage of making use of particles at all iterations, thus avoiding the usual waste of computational resources.

As a last point we remark that, in the FB case, it is possible to modify the hyper-prior without re-running the SMC sampler, provided that its support does not increase, by simply re-weighting the contributions of each generation of samples: this allows cheap prior sensitivity analysis, an important aspect to consider in applied Bayesian analyses, at a very small computational cost.

The construction of sequence (9) is not always straightforward. In the following, we consider an inverse problem whose likelihood belongs to the NEF and the prior does not depend on the hyper-parameter, deriving sequence (9) for two distinct cases:

  1. 1.

    the case where SMC samplers are used to approximate the full posterior distribution;

  2. 2.

    the case where the conditional posterior for a subset of variables \(x_1\) can be analytically computed, and a Rao-Blackwellized SMC sampler is used to approximate the posterior on the remaining variables \(x_2\).

4.1 Case 1: vanilla SMC samplers for the full posterior distribution

As the likelihood belongs to the Natural Exponential Family (NEF) with natural scalar hyper-parameter \(\theta \in \Theta \subseteq {\mathbb {R}}\), it has the following density

$$\begin{aligned} p^{\theta }(y\mid x) = \exp (\theta T(y\mid x)-A_{\theta }) \end{aligned}$$
(16)

where \(T(y\mid x)\) is a sufficient statistic and \(A_{\theta }\) represents the log-normalizing constant.

Proposition 1

Let \(p^{\theta } \in NEF\) with sufficient statistic T and canonical parameter \(\theta \) s.t. \(p^{\theta }(x) = \exp (\theta T(x)-A_{\theta })\) and \(\alpha \ne 0\), then:

$$\begin{aligned}{}[p^{\theta }(x)]^{\alpha } = \exp (A_{\alpha \theta }-\alpha A_{\theta }) p^{\alpha \theta }(x) \end{aligned}$$

By the previous proposition, whose trivial proof is provided in Appendix A, it is straightforward to show that the sequence (5a) naturally provides an evaluation of the joint posterior distribution \(p(x,\theta \mid y)\) for the set of values \(\Theta _{1:T}=\{\theta \in \Theta \;:\; \theta = \theta \alpha _t;\; t=1, \cdots , T\}\).

As an example, in the case of an inverse problem with additive Gaussian noise of unknown variance, the distributions of the sequence are posterior distributions corresponding to a decreasing variance \(\sigma (t) = \sigma (T)/\sqrt{\alpha _t}\) where \(\sigma (T)\) represents the noise standard deviation at the very last iteration of the SMC samplers.

The Gaussian distribution with unknown standard deviation is one of the most common distribution for the likelihood in the models for which we envisage this method being most useful. It is not the case that every distribution that is a member NEF family will exhibit decreasing dispersion as \(\alpha \) increases like the Gaussian does. This underlies the difficulties with so-called fractionated priors in the Consensus Monte Carlo algorithm (Scott et al. 2016) as discussed in (Rendell et al. 2021, Section 2.4). However, there are many distributions in the NEF family for which this diminishing dispersion property holds, which are suitable for the tempering sequence (5a), for example:

  • Exponential distribution with natural parameter \(\theta =\lambda \), with density

    $$\begin{aligned} p(x) = \lambda \exp (-\lambda x) \end{aligned}$$

    for \(x \ge 0\);

  • Laplace distribution with natural parameter \(\theta =\lambda \), with density:

    $$\begin{aligned} p(x) = \frac{\lambda }{2} \exp \left( -\lambda \vert x-\mu \vert \right) , \end{aligned}$$

    for known location \(\mu \);

  • Lognormal distribution with natural parameter \(\theta =1/2\sigma ^2\) and density

    $$\begin{aligned} p(x) = \frac{1}{x \sigma \sqrt{2 \pi }} \exp \left( - \frac{(\log (x))^2}{2\sigma ^2} \right) .\end{aligned}$$
  • Pareto distribution with known minimum value, \(x_m\), and rate \(\lambda \) for which the natural parameter is \(-(\lambda +1)\) and the density:

    $$\begin{aligned}\frac{ \lambda x_m^\lambda }{x^{\lambda +1}},\end{aligned}$$

    for \(x \ge x_m\).

4.2 Case 2: Rao-Blackwellized SMC samplers

We now consider the case where the unknown variable x can be decomposed into a pair of components \(x = (x_1,x_2)\), and:

  • the prior on \(x_1\) belongs to the NEF with respect to a hyper-parameter \(\lambda \)

    $$\begin{aligned} p(x_1\mid \lambda )=\exp (\lambda S(x_1)-A_{\lambda }) \end{aligned}$$
    (17)

    where \(S(x_1)\) is a sufficient statistic and \(A_{\lambda }\) is the log-normalization constant;

  • the conditional posterior \(p^{\theta }(x_1\mid x_2,\lambda ,y)\) can be computed analytically.

Under these assumptions, in the natural decomposition of the joint posterior density

$$\begin{aligned} p^{\theta }(x_1,x_2,\lambda \mid y) = p^{\theta }(x_1\mid x_2,\lambda ,y)p^{\theta }(x_2, \lambda \mid y), \end{aligned}$$
(18)

only the second factor of the right hand side needs to be approximated by an SMC sampler, thus reducing the variance of the importance weights and improving the quality of the approximation. This class of models is widely used and appreciated in applications; in particular, an SMC sampler targeting the marginal posterior \(p^{\theta }(x_2, \lambda \mid y)\) typically leads to more accurate estimates than an SMC sampler targeting the full posterior and using the same computational resources (Murphy and Russell 2001).

As a consequence of the hypothesis that both the likelihood (16) and the prior on the Rao-Blackwellized variable (17) belong to the NEF, the marginal likelihood turns out to be

$$\begin{aligned} \begin{aligned} p^{\theta }(y\mid x_2,\lambda )&= \int p^{\theta }(y\mid x_1,x_2,\lambda )p^{\lambda }(x_1) dx_1 \\&= \int \exp \biggl ( \biggl \langle \begin{bmatrix} \theta \\ \lambda \end{bmatrix}, \begin{bmatrix} T(y\mid x_1,x_2)\\ S(x_1) \end{bmatrix} \biggl \rangle \\ {}&\quad - (A_{\theta }+A_{\lambda })\biggl ) dx_1. \end{aligned} \end{aligned}$$
(19)

In most cases, the marginal likelihood in equation (19) does not have a closed form solution; below we show two special cases in which it does.

4.2.1 Additive statistic for the likelihood

If the statistic \(T(y\mid x_1,x_2)\) of the full likelihood (16) is the sum of two statistics \(T(y\mid x_1)\) and \(T(y\mid x_2)\), then the marginal likelihood also belongs to the NEF with respect to the same parameter

$$\begin{aligned} \begin{aligned} p^{\theta }(y \mid x_2,\lambda ) =&\int \exp \biggl (\theta \bigl (T(y\mid x_1)+ T(y\mid x_2)\bigl )\\ {}&+\lambda S(x_1)-(A^{(1)}_{\theta }+A^{(2)}_{\theta }+A_{\lambda })\biggl ) dx_1 \\ {}&\propto \exp \left( \theta T(y\mid x_2) - A^{(2)}_{\theta }\right) . \end{aligned} \end{aligned}$$
(20)

For this particular subclass of models, the natural sequence (5a) is still valid, as the marginal likelihood is still in the NEF.

From the practical viewpoint, this case corresponds to a Bayesian inverse problem with multiple unknown parameters, where both the prior and the posterior for two distinct subsets of parameters factorize. One can perform inference analytically for one collection of parameters and resort to Monte Carlo only for the remainder.

4.2.2 Conditionally linear gaussian model

If both the full likelihood and the prior on \(x_1\) have normal distribution

$$\begin{aligned} p^{\theta }(y\mid x_1,x_2) \sim&{\mathcal {N}}(\mu (x_2)x_1, \theta ^2\Sigma ) \end{aligned}$$
(21a)
$$\begin{aligned} p(x_1\mid \lambda ) \sim&{\mathcal {N}}(\eta , \Gamma _{\lambda }). \end{aligned}$$
(21b)

it is well known that both the marginal likelihood (2) and the conditional posterior (Sommariva and Sorrentino 2014) are Gaussian with known mean and variance

$$\begin{aligned}{} & {} p^{\theta }(y\mid x_2,\lambda ) \nonumber \\{} & {} \quad \sim {\mathcal {N}}\left( \mu (x_2)\eta ,\mu (x_2)^t \Gamma _{\lambda } \mu (x_2) + \theta ^2\Sigma \right) \end{aligned}$$
(22a)
$$\begin{aligned}{} & {} p^{\theta }(x_1 \mid x_2,\lambda ,y) \sim {\mathcal {N}}\bigl ({{\bar{\mu }}}, {{\bar{\Sigma }}}\bigl ); \end{aligned}$$
(22b)

where

$$\begin{aligned} {{\bar{\mu }}}&:= \Gamma _{\lambda } \mu (x_2)^t (\mu (x_2) \Gamma _{\lambda } \mu (x_2)^t + \theta ^2\Sigma )^{-1}y;\\ {{\bar{\Sigma }}}&:= \Gamma _{\lambda } - \Gamma _{\lambda } \mu (x_2)^t (\mu (x_2) \Gamma _{\lambda } \mu (x_2)^t)^{-1} \mu (x_2) \Gamma _{\lambda }. \end{aligned}$$

In this case, the marginal likelihood is not in the NEF with respect to the parameter \(\theta \) and the natural sequence (5a) does not work. Indeed, by applying to the CLG model the same sequence constructed in the general case, one would get

$$\begin{aligned}{} & {} p_t^{\theta }(x_2,\lambda \mid y) \propto p(x_2,\lambda ) p^{\theta }(y\mid x_2,\lambda )^{\alpha _t} \end{aligned}$$
(23a)
$$\begin{aligned}{} & {} p^{\theta }(y \mid x_2,\lambda )^{\alpha _t} \nonumber \\{} & {} \quad \propto {\mathcal {N}}\left( y;\; \eta \mu (x_2), \frac{1}{\alpha _t}\left( \mu (x_2)^t \Gamma _{\lambda } \mu (x_2) + \theta ^2\Sigma \right) \right) , \nonumber \\ \end{aligned}$$
(23b)

since the marginal likelihood also embodies the prior on the marginalized variable \(x_1\), the exponent also affects the prior for \(x_1\); therefore, as already observed in Sommariva and Sorrentino (2014), the distributions of this sequence cannot be considered as (marginals of) posterior distributions under the same prior.

Alternatively, one could consider the sequence of marginals of the natural sequence for the approximation of the complete posterior density:

$$\begin{aligned} \begin{aligned} p_t^{\theta }(x_2,\lambda \mid y)&:= \int p_t^{\theta }(x_1,x_2,\lambda \mid y) dx_1 \end{aligned} \end{aligned}$$
(24)

However, also this choice leads to a sequence of distributions that cannot be interpreted as posterior distributions under different values of \(\theta \); this happens because, as shown in Appendix A (Corollary 1 and Proposition 2), the integral in (24) is

$$\begin{aligned} \begin{aligned} p_t^{\theta }(x_2, \lambda \mid y)&\propto \int p(x_1,x_2,\lambda ) p^{\theta }(y\mid x_1,x_2,\lambda )^{\alpha _t} dx_1 \\&= p(x_2,\lambda )\int p(x_1\mid x_2,\lambda ) p^{\theta }(y\mid x_1,x_2,\lambda )^{\alpha _t} dx_1 \\ {}&= p(x_2,\lambda ) \ell _t(\lambda ){\mathcal {N}}\left( \!y; \eta \mu (x_2), \mu (x_2)^t \Gamma _{\lambda } \mu (x_2) + \frac{\theta ^2}{\alpha _t} \Sigma \!\right) \!,\!\!\!\! \end{aligned} \nonumber \\ \end{aligned}$$
(25)

where the Gaussian distribution can be interpreted as the marginal likelihood of the CLG model, with a different value of \(\theta \), but the normalization constant \(\ell _t(\lambda )\), defined as in Corollary 1 in Appendix A, depends on the hyper-parameter \(\lambda \) and thus actually modifies the distribution.

However, it is not difficult to devise a proper sequence of intermediate distributions for the case of a CLG model. In fact, it is sufficient to explicitly remove the \(\lambda \)-dependent normalization factor from (25) and construct the sequence as:

$$\begin{aligned} \begin{aligned}&p_t^{\theta }(x_2,\lambda \mid y) \propto p^{\theta }(x_2,\lambda ) \\&\quad \times {\mathcal {N}}\left( y;\; \eta \mu (x_2), \mu (x_2)^t \Gamma _{\lambda } \mu (x_2) + \frac{\theta ^2}{\alpha _t}\Sigma \right) . \end{aligned} \end{aligned}$$
(26)

With this definition we can apply the proposed approach to a CLG model while also exploiting Rao-Blackwellization.

5 Toy example

We proceed with a numerical validation of the proposed approach by first using a toy exampleFootnote 1; following the arguments in Sect. 4, we compare the results with natural alternatives for Fully Bayesian (FB) and Empirical Bayes (EB) approaches. In the following, particularly in the pictures, we denote by PropEB and PropFB the results obtained by the proposed method performing Empirical Bayes and Fully Bayesian approaches, respectively.

5.1 Setup

Consider an inverse problem where the aim is to reconstruct the mean of a Gaussian waveform of known variance \(\sigma ^2\), given noisy measurements y(t), i.e.

$$\begin{aligned} y(t)= & {} {\mathcal {N}}(t;\; \mu , \sigma ^2) + \varepsilon (t) \end{aligned}$$
(27a)
$$\begin{aligned} \varepsilon (t)\sim & {} {\mathcal {N}}(0, \theta ^2). \end{aligned}$$
(27b)

where \({\mathcal {N}}(t;\; \mu , \sigma ^2)\) is the probability density function of a Gaussian of mean \(\mu \) and standard deviation \(\sigma \), evaluated at t.

We assume observations are available at I points separated by unit intervals \(\{ t_i\}_{i=1}^I\) and we want to make inference on the Gaussian mean.

Fig. 1
figure 1

The figure shows in the first row an example of data without noise while in the second row the same data with the addition of noise

5.2 Data generation

Data \({\textbf{y}}=\left( y(t_1), \dots , y(t_I)\right) \) are generated considering \(I=100\) measurements in the interval \([-5,\;5]\) obtained by perturbing the Gaussian density at each observation time independently with additive Gaussian noise of zero mean and standard deviation \(\theta _{\text {true}} \sim {\mathcal {U}}[0.1,\; 0.2]\).

With these settings, we generate 100 independent realizations of the dataset in order to test the proposed algorithm.

5.3 Prior and likelihood

  • We assume \(p(\mu ) \sim {\mathcal {U}}\left( [-5,\;5]\right) \) as a truncation of the Jeffrey’s prior to the convex hull of the measurements;

  • we assume \(p(\theta ) \sim \Gamma (2, 4\theta ^\star )\), where \(\theta ^\star \) is an estimated value for the hyper-parameter;

  • we assume conditional independence between observations given the parameter, obtaining a simple factorization for the likelihood

    $$\begin{aligned} p^{\theta }({\textbf{y}}\mid \mu ) = \prod _{t=1}^T p^{\theta }(y(t) \mid \mu ). \end{aligned}$$
    (28)

5.4 Algorithm settings

For each of the 100 generated datasets, we compare the results obtained with the proposed method with those obtained with a FB approach and an EB approach.

Each SMC sampler used has the following settings:

  • number of particles set to 100 as a compromise between performances and quality of the approximation;

  • \(\theta ^\star =\min \{\theta _{\text {true}}\}/2\); this allows the true value \(\theta _{\text {true}}\) to be within the range of values explored by the proposed method during SMC sampler iterations;

  • number of iterations set to 500, with the sequence of exponents from 0 to 1 evenly spaced on a logarithmic scale in order to guarantee a smooth transition between intermediate distributions;

  • resampling step performed by means of systematic resampling (Douc et al. 2005) whenever the effective sample size is lower than half of the number of particles;

  • Gaussian proposal kernel for the MCMC step.

5.5 Comparison with alternative approaches

We compare the performances of the proposed method with those of two alternatives, one performing an Empirical Bayes approach and the other one performing a Fully Bayesian approach.

5.5.1 Empirical Bayes approach

For the EB approach we first obtain a maximum a posteriori estimate for the hyper-parameter:

$$\begin{aligned} {{\hat{\theta }}}_{\text {MAP}} = {\hbox {arg max}}_{\theta }\{{\hat{p}}(\theta \mid {\textbf{y}})\}; \end{aligned}$$
(29)

where \({\hat{p}}(\theta \mid {\textbf{y}}) \) is obtained by considering \(M=100\) evenly spaced samples in the interval \([-5,\;5]\) for the mean \(\mu \):

$$\begin{aligned}{\hat{p}}(\theta \mid {\textbf{y}})= \frac{1}{M}\sum _{i=1}^M p(\mu _i,\theta \mid {\textbf{y}})\end{aligned}$$

and then selecting the maximum value obtained over an evenly spaced grid of 500 points for \(\theta \in [\theta ^\star , \;50 \cdot \theta _{true}]\)

Once an estimate for the hyper-parameter is obtained, we consider an SMC sampler targeting the posterior distribution \(p^{{{\hat{\theta }}}_{\text {MAP}}}(\mu \mid {\textbf{y}})\).

5.5.2 Fully Bayesian approach

For the FB approach we consider an SMC sampler targeting the posterior distribution \(p(\mu ,\theta \mid {\textbf{y}})\), i.e. the hyper-parameter is sampled by the SMC sampler like all other parameters; the posterior distribution for the hyper-parameter is then obtained by marginalizing the joint distribution. The SMC sampler sequence chosen for the implementation is the tempering sequence given by Eqs. (5a) (5b).

Fig. 2
figure 2

Estimation error for the hyper-parameter \(\theta \) (top row) and for the parameter \(\mu \) (second row). Fully Bayesian, Proposed Fully Bayesian, Empirical Bayes and Proposed Empirical Bayes are shown in red, blue, yellow and green, respectively

5.6 Results

We analyze the performances in terms of selection of the parameter and hyper-parameter considering the Posterior Mean (PM) and the maximum a posteriori (MAP) estimators, and compute the estimation error as the Euclidean distance between the true and the estimated value of the hyper-parameter and parameter. In Fig. 2 we report the corresponding boxplots.

We notice that the proposed approach features similar performances as the alternative approaches, either the EB and the FB, in terms of estimation error, while keeping a substantially lower computational cost (Fig. 3). In the case of the FB approach, the proposed method also features a larger ESS (Fig. 4).

Fig. 3
figure 3

Computational time for the different utilized methods. The computational time is referred to a MacBook Pro (13-inch, M1, 2020) with 8 GB of memory

Fig. 4
figure 4

Effective Sample size for the four different approaches

Fig. 5
figure 5

Illustrative example of the posterior for the hyper-parameter (first row), the marginal of the joint posterior for the parameter (second row) and the conditional posterior for the parameter (third row); the dotted green line specify the true value.

5.6.1 Sample result

For illustrative purposes, in this section we show results from one specific dataset taken from the 100 simulations used in the previous section.

In Fig. 5 we show the output obtained by the proposed method and by the two alternative approaches, specifically by showing:

  • the approximated posterior distribution for the hyper-parameter;

  • the approximated posterior distribution for the parameter obtained in a FB approach;

  • the approximated posterior distribution for the parameter in an EB approach.

As far as the approximation of the marginal posterior of the hyper-parameter is concerned, both approximations peak around the correct value, i.e. \(\theta _{\text {true}}=0.24\). Regarding the approximations for the posterior of the parameter, we observe that all the approximated distributions peak at a value close to the true value (zero) well within their support.

6 Application to source imaging in magneto/electro-encephaloGraphy

In this Section we present the resultsFootnote 2obtained with the application of the Rao-Blackwellized SMC samplers with the proposed method described in Sect. 4.2 for the resolution of the M/E-EG inverse problem (Sommariva and Sorrentino 2014) introduced as a motivating example in Sect. 2.

Fig. 6
figure 6

Example of noise-free (top panel) and noisy (bottom panel) simulated data. Red vertical lines delimit the portion of data actually used for inference

Fig. 7
figure 7

Estimation error for the hyper-parameter \(\theta \) (top row), for the parameter \(\mu \), computational time and number of likelihood evaluations (second row). Fully Bayesian, Proposed Fully Bayesian and Proposed Empirical Bayes are shown in red, blue and green, respectively. The computational time is referred to a MacBook Pro (13-inch, M1, 2020) with 8 GB of memory

6.1 Data generation

Data \({\textbf{y}}=(y(1),\dots ,y(T))\) are generated with the following configuration:

  • brain discretization \(\Omega \) with 8193 voxels;

  • number of EEG channels: 59;

  • number of dipoles: \(d=4\);

  • dipole position \(r_i\): randomly drawn, with uniform distribution among the voxels, with the constraint that the distance between the two dipoles is larger than 3 cm; the constraint was set in order to allow for identifiability of the four dipoles;

  • dipole moment \(q_i\): orientation chosen among the three orthogonal directions, as the one that maximizes signal strength; unit dipole strength;

  • noise standard deviation: \(\theta _{\text {true}}\sim {\mathcal {U}}[1, 10]\).

With these settings, we generate 100 independent realizations of the dataset in order to test the proposed algorithm; Fig. 6 shows one example of the obtained data.

6.2 Prior and likelihood

We assume that all parameters are a priori independent, being \(x = (d, \lambda , r_{1:d})\), the prior density is therefore

$$\begin{aligned} p(x) = p(d)p(\lambda ) \prod _{i=1}^{d} p(r_i), \end{aligned}$$
(30)

where we specify:

$$\begin{aligned}&p(d) = \text {Poisson}(d;1);\\&p\left( \log (\lambda )\right) = {\mathcal {U}}\left( \lambda ;\left[ -8,\;-5\right] \right) ;\\&p(r_i)={\mathcal {U}}\left( r;\Omega \right) . \end{aligned}$$

We recall that the variables \(q_i\) can be marginalized out thanks to a Gaussian prior assumption, and therefore are not considered in the SMC algorithm.

We assume that noise is not correlated in time, corresponding to conditional independence between data recorded at different time points; the likelihood thus factorizes

$$\begin{aligned} p^{\theta }({\textbf{y}}\mid x) = \prod _{t=1}^T p^{\theta }(y(t) \mid d, \lambda , r_{1:d}). \end{aligned}$$
(31)

6.3 Algorithm settings

Each SMC sampler was applied with the following settings:

  • analysis window corresponding to the interval [40, 60], as shown in Fig. 6, i.e. analysis windows centered in the peak of the signal;

  • number of particles set to 200, as a compromise between performances and quality of the approximation;

  • \(\theta ^\star =\min \{\theta _{\text {true}}\}/2\); this allows the true value \(\theta _{\text {true}}\) to be within the range of values explored by the Proposed method during SMC sampler iterations; the order of magnitude of noise is typically known in this kind of data, therefore it would not be difficult to apply a similar reasoning to experimental data;

  • number of iterations set to 200, with the sequence of exponents from 0 to 1 evenly spaced on a logarithmic scale;

  • resampling step performed by means of systematic resampling (Douc et al. 2005) whenever the effective sample size is lower than half of the number of particles;

  • MCMC kernels as described in Sommariva and Sorrentino (2014).

Fig. 8
figure 8

Posterior probability maps for source localization obtained with the Fully Bayesian (left panel), the Proposed Fully Bayesian (central panel) and Proposed Empirical Bayes(right panel) approaches. Results are visualized on a discretized brain as black dots, the blue dots represent the probability regions higher than 10%, the purple stars are the estimated dipoles while the green cross are the true ones

6.4 Performance metrics

We consider the performances in terms of selection of the hyper-parameter and in terms of localization of current dipoles.

The estimates considered for the hyper-parameter are the MAP and the PM of the marginal posterior \(p(\theta \mid {\textbf{y}})\), while the estimates for the number and the localization are defined as:

  • estimator for number of dipoles: \({\hat{d}} = {\hbox {arg max}}_{d \in {\mathbb {N}}} (p(d \mid {\textbf{y}}))\)

  • estimator for dipole location: we construct \({\hat{d}}\) clusters and then obtain \({\hat{r}}_{i}\), for \(i=1,\dots ,{\hat{d}}\), as the peak of the marginal posterior \(p(r \mid {\textbf{y}}, {\hat{d}})\) in the i-th cluster.

We note that the location estimates are a little-nonstandard in the statistics literature, but this strategy is widespread in the mutiple-object tracking literature (see, e.g., Sorrentino et al. 2013) as a natural solution to the label-switching problem in this context.

As the number of dipoles is estimated from the data, the true and estimated number of dipoles might differ; for this reason, in order to evaluate the localization error we consider the Optimal Sub-Pattern Assignment (OSPA) metric (Ristic et al. 2011), defined as follows:

$$\begin{aligned} \text {OSPA}({\hat{r}}_{1:{\hat{d}}}, r_{1:d}) = \min _{\phi }\sum _{i=1}^{\min \{{\hat{d}}, d\}} \Vert {\hat{r}}_{i} - r_{\phi (i)}\Vert \end{aligned}$$
(32)

where the minimum is taken over all possible permutations, \(\phi \), of \(\{1,\ldots ,d\}\).

6.5 Results

In Fig. 7 we report the boxplots for the performance metrics and the computational costs of the tested algorithms.

Our results indicate that the proposed approach performs slightly better than the alternative in terms of parameter and hyper-parameter estimation. The computational cost of the proposed approach is considerably lower than the one of the alternative approach; the difference is less evident than in the case of the toy example. This difference can be explained by the combined effect of the variable dimension model, i.e. the SMC sampler exploring spaces with different number of sources, and the sampling of the hyper-parameter: when the sampled hyper-parameter is large, the SMC sampler tends to prefer configurations with fewer sources whose likelihood calculation is less expensive.

Fig. 9
figure 9

The figure shows the posterior for the parameter \(\theta \) approximated with the Fully Bayesian approach (red) and with the Proposed Fully Bayesian (blue). The green dotted line represents the true value for the hyper-parameter \(\theta =10\)

6.5.1 Sample result

For illustrative purposes, in this Section we show results from one specific dataset taken from the 100 simulations used in the previous Section.

In Fig. 8 we show the posterior distribution for the source location \(p(r \mid {\textbf{y}}, {\hat{d}})\) approximated respectively by the FB, PropFB and the PropEB approaches. All the three posterior are reciprocally similar and both methods estimate four sources in the brain with similar supports.

In Fig. 9 we show the approximated posterior distributions for the hyper-parameter provided by the two algorithms. Again we can observe that the two approximations are similar to each other and peaked around the correct value \(\theta _{\text {true}}=10\). We remark that the set of possible values for the hyper-parameter visited during the SMC iterations for the proposed method goes from 153 to 0.5; the plot focusses on a subset of this range in order to better illustrate the posterior density.

7 Conclusions

We presented a method that allows us to simultaneously conduct Fully Bayesian and Empirical Bayes estimation for a class of problems with a scalar hyperparameter at negligible additional cost.

Experiments show that the method performs slightly better than the natural alternatives, but with important differences. The proposed approach is more versatile in several ways: it allows to compute maximum likelihood/a posteriori estimates of the hyper-parameter; it allows to recycle the SMC samples for a different hyper-prior; it allows hyper-parameter selection via marginal maximum likelihood, and to provide estimates of the unknown parameters for a specific value of the hyper-parameter. In addition, when it comes to averaging across different values of the hyper-parameter, it provides substantially more Monte Carlo samples, potentially allowing better approximations of the posterior and resulting in better estimates of the unknowns.

Importantly, all these advantages are obtained essentially for free, i.e. at no additional computational cost; in addition, the proposed approach exploits samples at all iterations, thus simultaneously overcoming one of the known limitations of SMC samplers, i.e. the fact that intermediate samples are usually discarded.

Finally, although this article is dedicated to exploiting the particular structure present in a class of problems with scalar hyper-parameters in a way which yields both standard and empirical Bayesian estimates simultaneously with little overhead, it also suggests a path to efficiently performing empirical Bayesian estimation in a broader class of models. Estimating the gradient of the marginal likelihood with respect to the hyper-parameter using the current particle set would in principle allow the adaptive specification of a sequence of hyper-parameter values (and hence posterior distributions) which converges towards that which maximises the marginal likelihood. Such an approach is in the spirit of the SOUL (De Bortoli et al. 2021) and PGD-type (Kuntz et al. 2023) algorithms but would employ sequential Monte Carlo in order to provide sample approximations rather than Langevin-type dynamics. There are two ways one could view such an algorithm: as a Monte Carlo approximation of a gradient-based optimizer for the hyperparameter; or as an adaptive SMC sampler in which the sequence of distributions is specified by following an approximate gradient direction in the space of parameters. Such an approach would provide a natural strategy for performing empirical Bayes in settings with multivariate hyper-parameters; however, outside the scalar setting explored in this article one would lose the option to extract Fully Bayesian results from the same simulation. Exploring this rather different approach is beyond the scope of this manuscript but provides an interesting avenue for future exploration.