Cost free hyper-parameter selection/averaging for Bayesian inverse problems with vanilla and Rao-Blackwellized SMC samplers

In Bayesian inverse problems, one aims at characterizing the posterior distribution of a set of unknowns, given indirect measurements. For non-linear/non-Gaussian problems, analytic solutions are seldom available: Sequential Monte Carlo samplers offer a powerful tool for approximating complex posteriors, by constructing an auxiliary sequence of densities that smoothly reaches the posterior. Often the posterior depends on a scalar hyper-parameter, for which limited prior information is available. In this work, we show that properly designed Sequential Monte Carlo (SMC) samplers naturally provide an approximation of the marginal likelihood associated with this hyper-parameter for free, i.e. at a negligible additional computational cost. The proposed method proceeds by constructing the auxiliary sequence of distributions in such a way that each of them can be interpreted as a posterior distribution corresponding to a different value of the hyper-parameter. This can be exploited to perform selection of the hyper-parameter in Empirical Bayes (EB) approaches, as well as averaging across values of the hyper-parameter according to some hyper-prior distribution in Fully Bayesian (FB) approaches. For FB approaches, the proposed method has the further benefit of allowing prior sensitivity analysis at a negligible computational cost. In addition, the proposed method exploits particles at all the (relevant) iterations, thus alleviating one of the known limitations of SMC samplers, i.e. the fact that all samples at intermediate iterations are typically discarded. We show numerical results for two distinct cases where the hyper-parameter affects only the likelihood: a toy example, where an SMC sampler is used to approximate the full posterior distribution; and a brain imaging example, where a Rao-Blackwellized SMC sampler is used to approximate the posterior distribution of a subset of parameters in a conditionally linear Gaussian model.


Introduction
In Bayesian inverse problems, one is interested in approximating the posterior distribution of a set of unobservable quantities, x, conditioned on indirect measurements, y [21].Often the posterior distribution depends on a scalar hyper-parameter, θ ∈ Θ ⊆ R, e.g. the noise variance: one can either perform hyper-parameter selection with an Empirical Bayes (EB) approach targeting the conditional posterior p θ (x | y) with the hyper-parameter set to the value which maximizes the marginal likelihood, θ := arg max θ∈Θ {p θ (y)}, sometimes termed type-II maximum likelihood [8], or else marginalize out the hyper-parameter through a Fully Bayesian (FB) approach, targeting the posterior p(x | y).However, both approaches often result in costly procedures.
One relatively common tool for approximating posterior distributions arising in Bayesian inverse problems are Sequential Monte Carlo (SMC) samplers [3].SMC samplers construct an artificial sequence of distributions such that the first one can be readily sampled from and the last one coincides with the distribution of interest; a set of particles is drawn from the first density, and evolves gradually to approximate each distribution in the sequence.
In most implementations of SMC samplers for Bayesian inverse problems, the samples obtained at intermediate iterations are discarded, because intermediate iterations are only used to facilitate the approximation of the target distribution.Not directly using these samples, except perhaps to estimate a normalizing constant, seemingly results in a substantial waste of computational resources.Indeed, we have recently witnessed a growing number of studies that attempt to exploit/recycle particles from previous iterations in the final estimates [9,12,16].Gramacy et al. [9] propose to recycle particles at different iterations by considering a weighted sum of all the approximated distributions in order to maximise the Effective Sample Size (ESS).Alternatively, Nguyen et al. [16] propose to combine particles from past SMC samplers iterations considering the so called Deterministic Mixture Weight estimator ; a solution derived to combine weighted particles drawn from different proposal distributions.Recently South et al. [12] developed a method which allows the samples from each generation of the algorithm to be used to approximate integrals over a part of the state space.
In this work we show that, for a large class of hierarchical Bayesian inverse problems, the intermediate iterations of properly designed SMC samplers can be used to perform selection of the hyper-parameter and/or averaging with respect to it, making EB/FB approaches feasible.All of this has only a negligible additional computational cost and, in the case of averaging, it also entails recycling of the particles at intermediate iterations, thus reducing the typical waste of computational resources.
The key idea underlying the proposed method is to define the auxiliary sequence of distributions in such a way that each distribution is a posterior distribution conditioned on a different value for the hyper-parameter.Such construction turns out to be extremely simple under certain conditions, for instance when the hyper-parameter appears only in the likelihood and the likelihood belongs to the natural exponential family; under other circumstances, finding the right sequence can be more challenging.Given the sequence, the estimate of the normalizing constant, naturally produced by SMC samplers, corresponds to an estimate of the evidence for the specific value of the hyper-parameter, which then allows maximum likelihood or Bayesian inference on the hyper-parameter.
We provide the right tempering sequence for two different models largely used in inverse problems: • when the likelihood belongs to the Natural Exponential Family (NEF): here the tempering sequence obtained by raising the likelihood to a growing power between zero and one results in a proper sequence of densities that can be interpreted as posterior distributions; • when the conditional posterior for a subset of variables x 1 can be analytically computed, and an SMC sampler is used only to approximate the posterior on the remaining variables x 2 .For this class of models, which includes among others Conditionally Linear Gaussian (CLG) models, the auxiliary distribution sequence devised for the first case does not, in general, work fine, therefore we devise alternative sequences that can be used fruitfully in two special subcases.
The most straightforward application of the proposed method is the context of additive Gaussian noise inverse problems; here the interest is in the estimation of the joint posterior distribution for the state variables and the noise variance or the posterior distribution for the state variables conditioned on the estimated value for the noise variance.
As a first examples we consider the problem of recovering the mean of a Gaussian distribution from noisy observations, showing that the proposed approach performs as well as alternative approaches but with significant advantages in computational time.Then we show numerical results for a real world problem encountered in source analysis of Magneto/Electro-Encephalography data, in this case we show that the proposed approach provides reliable results and a substantial reduction of computational cost with respect to alternative approaches.

Motivating example: source estimation in magneto/ electro-encephalography
Magneto-/Electro-EncephaloGraphy (M/E-EG) are two non-invasive medical imaging techniques that record the magnetic/electric field on the scalp; from these recordings, it is possible to estimate the underlying neural currents [11].Using the dipolar assumption, this problem consists of estimating an unknown number of point sources, called dipoles, each one defined by two quantities: • a location in the brain volume, conveniently represented as the index r of a cell of a discretized brain (or voxel ); dipole location is assumed to be fixed in time; • a 3-D vector q representing orientation and intensity of the neural current at the specified voxel, and changing dynamically in time.
The inference problem can be formalized as where: t = 1, . . ., T is a time index; y(t) is an array containing the data recorded by all M/E-EG sensors at time t; d is the (unknown) number of dipoles; G(r i ) is the so called lead-field matrix, representing the magnetic/electric field generated by a unitary dipole located at r i ; ε(t) is additive Gaussian noise whose (spatial) covariance matrix Σ is known up to a scale factor θ.
This model was originally adopted in [19,20], where all unknown parameters were sampled with an SMC sampler, leading to high computational cost for long time series; in [18] a Rao-Blackwellized version was presented that imposed a Gaussian prior on the q variables and exploited the CLG structure, allowing to treat long time series with reduced computational cost.Finally, in [23] a hierarchical model was presented that overcomes the limitations of the Gaussian prior by using a hyper-prior on the prior variance, thus substantially reducing the dependence on this hyper-parameter.Defining y := (y(1), . . ., y(T )) and q 1:d := (q 1:d (1), . . ., q 1:d (T )), the posterior distribution decomposes as: where the conditional posterior p θ (q 1:d | y, d, r 1:d , λ) can be computed analytically, and only the second factor on the right hand side of (2) has to be approximated via Monte Carlo.Importantly, there remains a dependence on the hyper-parameter θ, namely the overall noise level, whose value has to be estimated.

SMC Samplers for Bayesian inverse problems
In this Section we provide a brief summary of a class of SMC samplers that are often used for the approximation of posterior distributions in Bayesian inference problems.Notice that SMC samplers can be applied in more general situations, not analyzed in this paper; for further details on general SMC samplers algorithms the reader is referred to [3,4].Consider a Bayesian inference problem where the aim is to approximate the posterior distribution where y represents the data and x the unknown parameters.The posterior distribution is often a complex distribution in a possibly highdimensional space and is typically difficult to sample from directly.SMC samplers provide an effective way to sample such complex distributions, and can be briefly summarized as follows.The fist step is to define a sequence of intermediate densities: that "smoothly" transition from an easy-tosample initial density p 0 to the posterior density p T .Condition (4c) is required in order to guarantee a smooth transition toward the target density and hence to allow a good approximation of p t+1 to be obtained from the corresponding approximation of p t .A natural, but not mandatory, choice in Bayesian inference is to reach the posterior density by starting from the prior and increasing the power of the likelihood using the so called geometric bridge, or tempering path [1,2,15,22]: Once the sequence of distributions has been selected, SMC samplers work as follows: • sample a set of N weighted particles {x (0) ; W (0) } from the initial distribution p 0 • for t = 1, . . ., T : 1. perform one, or more, Markov Chain Monte Carlo (MCMC) step/s; such as Metropolis Hastings step/s 2. perform an Importance Sampling (IS) step from the current distribution p t−1 to the next distribution p t updating the un-normalized importance weights and normalizing them using the relations At this point one obtains an approximation of the t-th distribution of the sequence as: In this step one also obtains an estimator of the normalizing constant of the distribution p t , crucial for model selection in general and for the proposed method in particular.It can be easily evaluated; for simplicity, assuming that resampling occurs at every step, as the product over time of the average of the unnormalized importance weights at eah time: the expression in the case that resampling is conducted adaptively can be found, is the corresponding product over resampling times of the average of the weights accumulated since the last resampling time (see, e.g., [10, p. 1641] for an explicit expression).3. perform a resampling step to avoid degeneracy of the importance weights [6,7].A widely used strategy is to perform resampling whenever the Effective Sample Size (ESS) (see, e.g, [13]) is under a fixed threshold.
One important property of SMC samplers comes from equation (6b) which allows the evaluation of the importance weights at time t using only the particles at the previous step.This allows the order of the first two steps of the algorithm to be reversed, which further allows an adaptive choice of the actual sequence of densities, as defined in (4a), through an online selection of the next exponent [5,20].

Selection/averaging of the hyper-parameter
Let Θ ⊆ R and consider a Bayesian inverse problem depending on a hyper-parameter θ ∈ Θ.We are now going to show how an SMC sampler can be used both to select a specific value for the hyper-parameter and/or to approximate the joint posterior distribution p(x, θ | y) at no additional cost with respect to the SMC sampler that approximates the conditional posterior p θ (x | y).
The key idea underlying the proposed method is to construct an SMC sampler whose target distribution is p θ (x | y) for some value θ ∈ Θ, and whose intermediate distributions are posterior distributions corresponding to different values of the hyper-parameter for a set of values Θ 0: Given the sequence above, one can estimate pointwise the evidence for the hyper-parameter p θ (y) for θ ∈ Θ 0:T through Importance Sampling (8).Under regularity assumptions for p θ (y) w.r.t.θ one can interpolate this finite set of values to obtain a smooth approximation of the evidence and, assuming the availability of a hyper-prior p(θ), that we assume to be negligible outside a compact set [θ min , θ max ], an approximation of the marginal posterior p(θ | y).
For an EB approach, one can first find the mode of the interpolating function properly weighted θ = arg max θ∈[θmin,θmax] {p(θ | y)}, (10) where we assume that the range of Θ 0:T contains θ * , θ min and θ max .This can be done numerically by binary search, using importance sampling to estimate the marginal likelihood of values of θ between those in Θ 0:T .We can then apply importance sampling to obtain an approximation of p θ (x | y).
In order to avoid degeneration of importance weights, one should do importance sampling from p θ( t) (x | y), where θ( t) is the closest value to θ such that the support and tails of p θ( t) (x | y) are larger and heavier, respectively, than those of p θ (x | y); for instance, assuming that {θ(t)} t=0,...,T is a decreasing sequence, and that the distributions tails become lighter as θ becomes smaller, we shall select the iteration t = arg min{t : θ(t) > θ}.(11) For a FB approach one obtains an approximation of the posterior p(θ | y) ∝ p θ (y)p(θ) (12) for θ ∈ Θ 0:T , allowing to compute estimates such as the posterior mean or the maximum a posteriori for the hyper-parameter.In addition, it is possible to approximate the marginal posterior of the parameters that takes into account uncertainty on parameters deriving from uncertainty on the hyperparameter.This can be done by considering all particles at all iterations and re-weighting them where g (t) is a function representing the interpolation weights.For example, in the case of a standard quadrature method such as the trapezoidal rule we get but of course more sophisticated options are available [24].
The additional computational cost required for calculating ( 12) -( 14) is negligible compared to the one needed for the approximation of p θ (x | y) directly with an SMC sampler employing likelihood tempering.
Moreover, the proposed FB approach has the advantage of making usage of particles at all iterations, thus avoiding the usual waste of computational resources.
As a last point we remark that, in the FB case, it is possible to modify the hyper-prior without re-running the SMC sampler: this allows cheap prior sensitivity analysis, an important aspect to consider in applied Bayesian analyses, at a very small computational cost.
The construction of sequence ( 9) is not always straightforward.In the following, we consider an inverse problem whose likelihood belongs to the NEF and the prior does not depend on the hyperparameter, deriving sequence (9) for two distinct cases: 1. the case where SMC samplers are used to approximate the full posterior distribution; 2. the case where the conditional posterior for a subset of variables x 1 can be analytically computed, and a Rao-Blackwellized SMC sampler is used to approximate the posterior on the remaining variables x 2 .

Case 1: vanilla SMC samplers for the full posterior distribution
As the likelihood belongs to the Natural Exponential Family (NEF) with natural scalar hyperparameter θ ∈ Θ ⊆ R, it has the following density where T (y | x) is a sufficient statistic and A θ represents the log-normalizing constant.
Proposition 1 Let p θ ∈ N EF with sufficient statistic T and canonical parameter θ s.t.p θ (x) = exp(θT (x) − A θ ) and α = 0, then: By the previous proposition, whose trivial proof is provided in Appendix, it is straightforward to show that the sequence (5a) naturally provides an evaluation of the joint posterior distribution p(x, θ | y) for the set of values Θ As an example, in the case of an inverse problem with additive Gaussian noise of unknown variance, the distributions of the sequence are posterior distributions corresponding to a decreasing variance σ(t) = σ / √ α t where σ represents the noise standard deviation at the very last iteration of the SMC samplers.

Case 2: Rao-Blackwellized SMC samplers
We now consider the case where the unknown variable x can be decomposed into a pair of components x = (x 1 , x 2 ), and: • the prior on x 1 belongs to the NEF with respect to a hyper-parameter where S(x 1 ) is a sufficient statistic and A λ is the log-normalization constant; • the conditional posterior p θ (x 1 | x 2 , λ, y) can be computed analytically.
Under these assumptions, in the natural decomposition of the joint posterior density only the second factor of the right hand side needs to be approximated by an SMC sampler, thus reducing the variance of the importance weights and improving the quality of the approximation.This class of models is widely used and appreciated in applications; in particular, an SMC sampler targeting the marginal posterior p θ (x 2 , λ | y) typically leads to more accurate estimates than an SMC sampler targeting the full posterior and using the same computational resources [14].
As a consequence of the hypothesis that both the likelihood (16) and the prior on the Rao-Blackwellized variable (17) belong to the NEF, the marginal likelihood turns out to be In most cases, the marginal likelihood in equation (19) does not have a closed form solution; below we show two special cases in which it does.

Additive statistic for the Likelihood
If the statistic T (y | x 1 , x 2 ) of the full likelihood (16) is the sum of two statistics T (y | x 1 ) and T (y | x 2 ), then the marginal likelihood also belongs to the NEF with respect to the same parameter For this particular subclass of models, the natural sequence (5a) is still valid, as the marginal likelihood is still in the NEF.

Conditionally Linear Gaussian Model
If both the full likelihood and the prior on x 1 have normal distribution it is well known that both the marginal likelihood (2) and the conditional posterior [18] are Gaussian with known mean and variance where: In this case, the marginal likelihood is not in the NEF with respect to the parameter θ and the natural sequence (5a) does not work.Indeed, by applying to the CLG model the same sequence constructed in the general case, one would get since the marginal likelihood also embodies the prior on the marginalized variable x 1 , the exponent also affects the prior for x 1 ; therefore, as already observed in [18], the distributions of this sequence cannot be considered as (marginals of) posterior distributions under the same prior.
Alternatively, one could consider the sequence of marginals of the natural sequence for the approximation of the complete posterior density: However, also this choice leads to a sequence of distributions that cannot be interpreted as posterior distributions under different values of θ; this happens because, as shown in Appendix (Corollary 1 and Proposition 2), the integral in ( 24) is where the Gaussian distribution can be interpreted as the marginal likelihood of the CLG model, with a different value of θ, but the normalization constant t (λ), defined as in Proposition 1 in Appendix, depends on the hyper-parameter λ and thus actually modifies the distribution.However, it is not difficult to devise a proper sequence of intermediate distributions for the case of a CLG model.In fact, it is sufficient to explicitly remove the λ-dependent normalization factor from (25) and construct the sequence as: With this definition we can apply the proposed approach to a CLG model while also exploiting Rao-Blakwellization.

Toy Example
We proceed with a numerical validation of the proposed approach by first using a toy example1 ; following the arguments in Section 4, we compare the results with natural alternatives for Fully Bayesian (FB) and Empirical Bayes (EB) approaches.

Setup
Consider an inverse problem where the aim is to reconstruct the mean of a Gaussian waveform of known variance σ 2 , given noisy measurements y(t), i.e.
where N (t; µ, σ 2 ) is the probability density function of a Gaussian of mean µ and standard deviation σ, evaluated at t.
We assume observations are available at I points separated by unit intervals {t i } I i=1 and we want to make inference on the Gaussian mean.

Data Generation
Data y = (y(t 1 ), .With these settings, we generate 100 independent realizations of the dataset in order to test the proposed algorithm.
Fig. 1 The figure shows in the first row an example of data without noise while in the second row the same data with the addition of noise.

Prior and likelihood
• We assume p(µ) ∼ U ([−5, 5]) as a truncation of the Jeffrey's prior to the convex hull of the measurements; • we assume p(θ) ∼ Γ(2, 4θ ), where θ is an estimated value for the hyper-parameter; • we assume conditional independence between observations given the parameter, obtaining a simple factorization for the likelihood

Algorithm settings
For each of the 100 generated datasets, we compare the results obtained with the proposed method with the one obtained with a FB approach and an EB approach.Each SMC sampler used has the following settings: • number of particles set to 100 as a compromise between performances and quality of the approximation; • θ = min{θ true }/2; this allows the true value θ true to be within the range of values explored by the Proposed method during SMC sampler iterations; • number of iterations set to 500, with the sequence of exponents growing exponentially in order to guarantee a smooth transit between intermediate distributions; • resampling step performed by means of systematic resampling [6] whenever the effective sample size is lower than half of the number of particles; • Gaussian kernel for the MCMC step.

Comparison with alternative approaches
We compare the performances of the proposed method with those of two alternatives, one performing an Empirical Bayes approach and the other one performing a Fully Bayesian approach.
In the following, particularly in the pictures, we denote by PropEB and PropFB the results obtained by the proposed method performing Empirical Bayes and Fully Bayesian approaches, respectively.

Empirical Bayes Approach
For the EB approach we first obtain an estimate for the maximum a posteriori for the hyperparameter: where p(θ | y) is obtained by considering M = 100 evenly spaced samples in the interval [−5, 5] : and then selecting the maximum value obtained over an evenly spaced grid of 500 points for θ ∈ [θ , 50 • θ true ] Once an estimate for the hyper-parameter is obtained, we consider an SMC sampler targeting the posterior distribution p θMAP (µ | y).

Fully Bayesian Approach
For the FB approach we consider an SMC sampler targeting the posterior distribution p(µ, θ | y), i.e. the hyper-parameter is sampled by the SMC sampler like all other parameters; the posterior distribution for the hyper-parameter is then obtained by marginalizing the joint distribution.

Results
We analyze the performances in terms of selection of the parameter and hyper-parameter considering the Posterior Mean (PM) and the maximum a posteriori (MAP) estimators, and compute the Euclidean distance between the true and the estimated value of the (hyper-)parameter.
In Figure 2 we report the boxplots obtained for the (hyper-)parameter estimation error over the 100 datasets: • in the first row (left panel) we show the error in the estimation of the hyper-parameter considering as estimate the MAP and the PM respectively for the FB approach and the PropFB one.• in the first row (right panel) we show the error in the estimation of the hyper-parameter considering as estimate the MAP for the EB and PropEB approach.Due to the structure of the proposed algorithm, the MAP estimate for the PropEB is the same obtained with the PropFB.• in the second row (left panel) we show the error in the estimation of the parameter committed respectively by the FB and PropFB approaches considering as estimate the MAP and the PM.• in the second row (right panel) we show the error in the estimation of the parameter committed respectively by the EB and PropEB approaches.
We notice that the proposed approach features similar performances as the alternative approaches, either the EB and the FB, in terms of estimation error, while keeping a substantially lower computational cost (Fig. 3).In the case of the FB approach, the proposed method also features a larger ESS (Fig. 4).

Sample result
For illustrative purposes, in this Section we show results from one specific dataset taken from the 100 simulations used in the previous Section.
In Figure 5 we show the output obtained by the proposed method and by the two alternative approaches, specifically by showing: • in the first row, the approximated posterior distribution for the hyper-parameter; • in the second row, the approximated posterior distribution for the parameter obtained in a FB approach; • in the third row, the approximated posterior distribution for the parameter in an EB approach.
As far as the approximation of the marginal posterior of the hyper-parameter is concerned, both approximations peak around the correct value, i.e.  θ true = 0.20.Regarding the approximations for the posterior of the parameter, we observe that all the approximated distributions peak at a positive value but contain the true value (zero) well within their support. .

Application to source imaging in Magneto/ Electro-EncephaloGraphy
In this Section we present the results obtained with the application of the Rao-Blackwellized SMC samplers with the proposed method described in Section 4.2 for the resolution of the M/E-EG inverse problem [18] introduced as motivating example in Section 2.

Data Generation
Data y = (y(1), . . ., y(T )) are generated with the following configuration: • brain discretization Ω with 8193 voxels; • number of M/E-EG channels: 59; • number of dipoles: d = 2; • dipole position r i : randomly drawn, with uniform distribution among the voxels, with the constraint that the distance between the two dipoles is larger than 3cm; the constraint was Fig. 5 Illustrative example of the posterior for the hyper-parameter (first row), the marginal of the joint posterior for the parameter (second row) and the conditional posterior for the parameter (third row).
Fig. 6 Example of the simulated data used for the analysis as recordings performed by 59 EEG channels each second from 0 to 100.In the first row of the plot we show the synthetic data without noise while in the second row we show the same data with the addition of additive Gaussian noise.The region between the red vertical lines is that which is observed.
set in order to allow for identifiability of the two dipoles; • dipole moment q i : orientation chosen among the three orthogonal directions, as the one that maximizes signal strength; unit dipole strength; • noise standard deviation: θ true ∼ U [1,100].
With these settings, we generate 50 independent realizations of the dataset in order to test the proposed algorithm; Figure 6 shows one example of the obtained data.

Prior and likelihood
We assume that all parameters are a priori independent, being x = (d, λ, r 1:d ), the prior density is therefore where we specify: We assume that noise is not correlated in time, corresponding to conditional independence between data recorded at different time points; the likelihood thus factorizes (31)

Algorithm settings
Each SMC sampler was applied with the following settings: • analysis window corresponding to the interval [40, 60], as shown in figure 6, i.e. analysis windows centered in the peak of the signal; • number of particles set to 100 as a compromise between performances and quality of the approximation; • θ = min{θ true }/2; this allows the true value θ true to be within the range of values explored by the Proposed method during SMC sampler iterations; the order of magnitude of noise is typically known in this kind of data, therefore it would not be difficult to apply a similar reasoning to experimental data; • number of iterations set to 100, with the sequence of exponents growing exponentially in order to guarantee a smooth transit between intermediate distributions; • resampling step performed by means of systematic resampling [6] whenever the effective sample size is lower than half of the number of particles; • MCMC kernels as described in [18].

Performance metrics
We consider the performances in terms of selection of the hyper-parameter and in terms of localization of current dipoles.
The estimates considered for the hyperparameter are the MAP and the PM of the We note that the location estimates are a littlenonstandard in the statistics literature, but this strategy is widespread in the mutiple-object tracking literature (see, e.g., [19]) as a natural solution to the label-switching problem in this context.
As the number of dipoles is estimated from the data, the true and estimated number of dipoles might differ; for this reason, in order to evaluate the localization error we consider the Optimal Sub-Pattern Assignment (OSPA) metric [17], defined as follows: where the minimum is taken over all possible permutations, φ, of {1, . . ., d}.

Results
In Figure 7 we report the boxplots obtained for the hyper-parameter estimation and the localization error over the 50 datasets: • in the first row we show the error in the estimation of the hyper-parameter considering as estimate the MAP and the PM respectively for the FB approach and the PropFB one; • in the left panel of the second row we show the OSPA metric respectively for the FB, PropFB and PropEB approaches; • in the right panel of the second row we report the CPU time.
Our results indicate that the proposed approach performs slightly better than the alternative in terms of hyper-parameter estimation, while localization error as measured by the OSPA metric is not significantly different.The computational cost of the proposed approach is considerably lower than the one of the alternative approach; the difference is much more evident Fig. 9 The figure shows the posterior for the parameter θ approximated with the Fully Bayesian approach (red) and with the Proposed Fully Bayesian (blue).
than in the case of the toy example.This difference can be explained by the combined effect of the variable dimension model, i.e. the SMC sampler exploring spaces with different number of sources, and the sampling of the hyper-parameter: when the sampled hyper-parameter gets small, the SMC sampler tends to prefer configurations with larger number of sources whose likelihood calculation is more expensive.

Sample result
For illustrative purposes, in this Section we show results from one specific dataset taken from the 50 simulations used in the previous Section.
In Figure 8 we show the posterior distribution for the source location p(r | y, d) approximated respectively by the FB (left panel) and the PropFB (right panel) approaches.Both posterior are reciprocally similar and both methods estimate two sources, one in the left and one in the right hemisphere.
In Figure 9 we show the approximated posterior distributions for the hyper-parameter provided by the two algorithms.Again we can observe that the two approximations are similar to each other and peaked around the correct value.θ true = 22.

Conclusions
We presented a method that allows to perform at the same time and with a limited computational cost Fully Bayesian and Empirical Bayes approaches.
Experiments show that the method performs slightly better than the natural alternatives, but with important differences.The proposed approach is more versatile in several ways: it allows to compute maximum likelihood/a posteriori estimates of the hyper-parameter; it allows to recycle the SMC samples for a different hyperprior; it allows hyper-parameter selection via marginal maximum likelihood, and to provide estimates of the unknown for a specific value of the hyper-parameter.In addition, when it comes to averaging across different values of the hyperparameter, it provides substantially more Monte Carlo samples, potentially allowing better approximations of the posterior and resulting in better estimates of the unknowns.
Importantly, all these advantages are obtained essentially for free, i.e. at no additional computational cost; in addition, the proposed approach exploits samples at all iterations, thus simultaneously overcoming one of the known limitations of SMC samplers, i.e. the fact that intermediate samples are usually discarded.
Finally, although this article is dedicated to exploiting the particular structure present in a class of problems with univariate hyperparameters in a way which yields both standard and empirical Bayesian estimates simultaneously with little overhead, it also suggests a path to efficiently performing empirical Bayesian estimation in a broader class of models.By estimating the gradient of the marginal likelihood with respect to the hyperparameter using the current particle set would in principle allow the adaptive specification of a sequence of hyper-parameter values (and hence posterior distributions) which converges towards that which maximises the marginal likelihood.This is beyond the scope of this manuscript but provides an interesting avenue for future exploration.
where the determinant of the covariance matrix is equal to Therefore, from a well known result on Gaussian densities, we obtain the thesis . ., y(t I )) are generated considering I = 100 measurements in the interval [−5, 5] obtained by perturbing the Gaussian density at each observation time independently with additive Gaussian noise of zero mean and standard deviation θ true ∼ U[0.1, 0.2].

Fig. 2
Fig.2Box-plots for the error in the estimation of the hyper-parameter θ (first row), the error in the estimation of the parameter µ (second row).There are shown respectively in red, blue, yellow and green the error committed considering the MAP and PM estimates for the Fully Bayesian, Proposed Fully Bayesian, Empirical Bayes and Proposed Empirical Bayes approaches.

Fig. 3
Fig. 3 Computational time for the utilized methods.Performances are referred to a MacBook Pro (13-inch, M1, 2020) with 8 GB of memory.

Fig. 4
Fig.4Effective Sample size respectively for the Fully Bayesian, Empirical Bayes, Proposed Fully Bayesian and Proposed Empirical Bayes approaches.

Fig. 7
Fig.7Relative error for the approximation of the hyper-parameter θ (first row), the boxplots for the OSPA metric in centimeters (second row, left panel) and the computational time in seconds (second row, right panel) considering the Fully Bayesian (red), the Proposed Fully Bayesian (blue) and the Proposed Empirical Bayes (green) approaches.The computational time is referred to a MacBook Pro (13-inch, M1, 2020) with 8 GB of memory.
marginal posterior p(θ | y), while the estimates for the number and the localization are defined as: • estimator for number of dipoles: d = arg max d∈N (p(d | y)) • estimator for dipole location: we construct d clusters and than obtain ri , for i = 1, . . ., d, as the peak of the marginal posterior p(r | y, d) in the i-th cluster.

Fig. 8
Fig.8Posterior probability maps for source localization obtained with the Fully Bayesian (left panel) and the Proposed Fully Bayesian (right panel) approaches.Results are visualized on a discretized brain as black dots, the blue dots represent the high probability regions while the purple cross are the estimated dipoles.