Introduction

Modern computational facilities and recent advances in computational techniques have made the use of Markov Chain Monte Carlo (MCMC) methods feasible for some large-scale Bayesian inverse problems (BIP), where the goal is to characterize the posterior distribution of a set of parameters \(\theta \) of a computational model which describes some physical phenomena, conditioned on some (usually indirectly) measured data y. However, some computational difficulties are prone to arise when dealing with difficult to explore posteriors, i.e., posterior distributions that are multi-modal, or that concentrate around a non-linear, lower-dimensional manifold, since some of the more commonly-used Markov transition kernels in MCMC algorithms, such as random walk Metropolis (RWM) or preconditioned Crank-Nicholson (pCN), are not well-suited in such situations. This in turn can make the computational time needed to properly explore these complicated target distributions arbitrarily long. Some recent works address these issues by employing Markov transitions kernels that use geometric information (Beskos et al. 2017); however, this requires efficient computation of the gradient of the posterior density, which might not always be feasible, particularly when the underlying computational model is a so-called “black-box”.

In recent years, there has been an active development of computational techniques and algorithms to overcome these issues using a tempering strategy (Dia 2019; Earl and Deem 2005; Latz et al. 2018; Miasojedow et al. 2013; Vrugt et al. 2009). Of particular importance for the work presented here is the Parallel Tempering (PT) algorithm (Earl and Deem 2005; Łącki and Miasojedow 2016; Miasojedow et al. 2013) (also known as replica exchange), which finds its origins in the physics and molecular dynamics community. The general idea behind such methods is to simultaneously run K independent MCMC chains, where each chain is invariant with respect to a flattened (referred to as tempered) version of the posterior of interest \(\mu \), while, at the same time, proposing to swap states between any two chains every so often. Such a swap is then accepted using the standard Metropolis-Hastings (MH) acceptance-rejection rule. Intuitively, chains with a larger smoothing parameter (referred to as temperature) will be able to better explore the parameter space. Thus, by proposing to exchange states between chains that target posteriors at different temperatures, it is possible for the chain of interest (i.e., the one targeting \(\mu \)) to mix faster, and to avoid the undesirable behavior of some MCMC samplers of getting “stuck” in a mode. Moreover, the fact that such an exchange of states is accepted with the typical MH acceptance-rejection rule, will guarantee that the chain targeting \(\mu \) remains invariant with respect to such probability measure (Earl and Deem 2005).

Tempering ideas have been successfully used to sample from posterior distributions arising in different fields of science, ranging from astrophysics to machine learning (Desjardins et al. 2010; Earl and Deem 2005; Miasojedow et al. 2013; Van Der Sluys et al. 2008). The works (Madras and Randall 2002; Woodard et al. 2009) have studied the convergence of the PT algorithm from a theoretical perspective and provided minimal conditions for its rapid mixing. Moreover, the idea of tempered distributions has not only been applied in combination with parallel chains. For example, the simulated tempering method (Marinari and Parisi 1992) uses a single chain and varies the temperature within this chain. In addition, tempering forms the basis of efficient particle filtering methods for stationary model parameters in Sequential Monte Carlo settings (Beskos et al. 2016, 2015; Kahle et al. 2019; Kantas et al. 2014; Latz et al. 2018) and Ensemble Kalman Inversion (Schillings and Stuart 2017).

A generalization over the PT approach, originating from the molecular dynamics community, is the so-called Infinite Swapping (IS) algorithm (Dupuis et al. 2012; Plattner et al. 2011). As opposed to PT, this IS paradigm is a continuous-time Markov process and considers the limit where states between chains are swapped infinitely often. It is shown in Dupuis et al. (2012) that such an approach can in turn be understood as a swap of dynamics, i.e., kernel and temperature (as opposed to states) between chains. We remark that once such a change in dynamics is considered, it is not possible to distinguish particles belonging to different chains. However, since the stationary distribution of each chain is known, importance sampling can be employed to compute posterior expectations with respect to the target measure of interest. Infinite Swapping has been successfully applied in the context of computational molecular dynamics and rare event simulation (Doll et al. 2012; Lu and Vanden-Eijnden 2013; Plattner et al. 2011); however, to the best of our knowledge, such methods have not been implemented in the context of Bayesian inverse problems.

In light of this, the current work aims at importing such ideas to the BIP setting, by presenting them in a discrete-time Metropolis-Hastings Markov chain Monte Carlo context. We will refer to these algorithms as Generalized Parallel Tempering (GPT). We emphasize, however, that these methods are not a time discretization of the continuous-time Infinite Swapping presented in Dupuis et al. (2012), but, in fact, a discrete-time Markov process inspired by the ideas presented therein with suitably defined state-dependent probabilities of swapping states or dynamics. We now summarize the main contributions of this work.

We begin by presenting a generalized framework for discrete time PT in the context of MCMC for BIP, and then proceed to propose, analyze and implement two novel state-dependent PT algorithms inspired by the ideas presented in Dupuis et al. (2012).

Furthermore, we prove that our GPT methods have the right invariant measure, by showing reversibility of the generated Markov chains, and prove their ergodicity. Finally, we implement the proposed GPT algorithms for an array of Bayesian inverse problems, comparing their efficiency to that of an un-tempered, (single temperature), version of the underlying MCMC algorithm, and standard PT. For the base method to sample at the cold temperature level, we use Random Walk Metropolis (RWM) (Sects. 5.35.6) or preconditioned Crank-Nicolson (Sect. 5.7), however, we emphasize that our methods can be used together with any other, more efficient base sampler. Experimental results show improvements in terms of computational efficiency of GPT over un-tempered RWM and standard PT, thus making the proposed methods attractive from a computational perspective. From an implementation perspective, the swapping component of our proposed methods is rejection-free, thus effectively eliminating some tuning parameters on the PT algorithm, such as swapping frequency.

We notice that a PT algorithm with state-dependent swapping probabilities has been proposed in Łącki and Miasojedow (2016), however, such a work only consider pairwise swapping of chains and a different construction of the swapping probabilities, resulting in a less-efficient sampler, at least for the BIPs addressed in this work.

Our ergodicity result relies on an \(L_2\) spectral gap analysis. It is known Rudolf (2012) that when a Markov chain is both reversible and has a positive \(L_2\)-spectral gap, one can in turn provide non-asymptotic error bounds on the mean square error of an ergodic estimator of the chain. Our bounds on the \(L_2\)-spectral gap, however, are far from being sharp and could possibly be improved using e.g., domain decomposition ideas as in Woodard et al. (2009). Such analysis is left for a future work.

The rest of this paper is organized as follows. Section 2 is devoted to the introduction of the notation, Bayesian inverse problems, and Markov chain Monte Carlo methods. In Sect. 3, we provide a brief review of (standard) PT (Sect. 3.2), and introduce the two versions of the GPT algorithm in Sects. 3.3 and 3.4, respectively. In fact, we present a general framework that accommodates both the standard PT algorithms and our generalized versions. In Sect. 4, we recall some of the standard theory of Markov chains in Sect. 4.1 and state the main theoretical results of the current work (Theorems 1 and 2) in Sect. 4.2. The proof of these Theorems is given by a series of Propositions in Sect. 4.2. We present some numerical experiments in Sect. 5, and draw some conclusions in Sect. 6.

Problem setting

Notation

Let \((W,\left\| \cdot \right\| )\) be a separable Banach space with associated Borel \(\sigma \)-algebra \({\mathcal {B}}(W)\), and let \(\nu _W\) be a \(\sigma \)-finite “reference” measure on W. For any measure \(\mu \) on \((W, {\mathcal {B}}(W))\) that is absolutely continuous with respect to \(\nu _W\) (in short \(\mu \ll \nu _W\)), we define the Radon-Nikodym derivative \(\pi _\mu {:}{=} \frac{\mathrm {d}\mu }{\mathrm {d}\nu _W}\). We denote by \(\overline{{\mathcal {M}}}(W)\) the set of real-valued signed measures on \((W,{\mathcal {B}(W)})\), and by \({\mathcal {M}}(W)\subset \overline{{\mathcal {M}}}(W)\) the set of probability measures on \((W,{\mathcal {B}(W)})\).

Let \(W_1, W_2\) be two separable Banach spaces with reference measures \(\nu _{W_1},\nu _{W_2}\), and let \(\mu _1 \ll \nu _{W_1}, \mu _{2} \ll \nu _{W_2}\) be two probability measures, with corresponding densities given by \(\pi _{\mu _1},\pi _{\mu _2}\). The product of these two measures is defined by

$$\begin{aligned} {\varvec{\mu }}(A)&=\left( \mu _1 \times \mu _{2}\right) (A)\\&= \iint _A \pi _{\mu _1}(\theta _1) \pi _{\mu _2}(\theta _2)\nu _{W_1}(\mathrm {d}\theta _1)\nu _{W_2}(\mathrm {d}\theta _2), \end{aligned}$$

for all \(A\in {\mathcal {B}}(W_1 \times W_2).\) Joint measures on \((W_1\times W_2, {\mathcal {B}}(W_1,\times W_2))\) will always be written in boldface.

A Markov kernel on a Banach space W is a function \(p:W\times {\mathcal {B}}(W)\rightarrow [0,1]\) such that

  1. 1.

    For each A in \({\mathcal {B}}(W)\), the mapping \( W\ni \theta \mapsto p(\theta ,A)\), is a \({\mathcal {B}}(W)\)-measurable real-valued function.

  2. 2.

    For each \(\theta \) in W, the mapping \({\mathcal {B}}(W)\ni A\mapsto p(\theta ,A)\), is a probability measure on \((W,{\mathcal {B}}(W))\).

We denote by P the Markov transition operator associated to p, which acts on measures as \(\nu \mapsto \nu P\in {\mathcal {M}}(W),\) and on functions as \(f\mapsto Pf,\) with Pf measurable on \((W,{\mathcal {B}}(W)),\) such that

$$\begin{aligned} (\nu P) (A)&=\int _W p(\theta ,A)\nu (\mathrm {d}\theta ), \quad \forall A\in {\mathcal {B}(W)},\\ (Pf)(\theta )&=\int _W f(z)p(\theta ,\mathrm {d}z), \quad \forall \theta \in W . \end{aligned}$$

Let \(P_k,\ k=1,2,\) be Markov transition operators associated to kernels \(p_k:W_k\times {\mathcal {B}}(W_k)\mapsto [0,1].\) We define the tensor product Markov operator \({\mathbf {P}}{:}{=}P_1\otimes P_2\) as the Markov operator associated with the product measure \({\mathbf {p}}({\varvec{\theta }},\cdot )=p_1(\theta _1,\cdot )\times p_2(\theta _2,\cdot ), \ {\varvec{\theta }}=(\theta _1,\theta _2)\in W_1\times W_2\). In particular, \({\varvec{\nu }}{\mathbf {P}}\) is the measure on \((W_1 \times W_2, {\mathcal {B}}(W_1 \times W_2))\) that satisfies

$$\begin{aligned} ({\varvec{\nu }}{\mathbf {P}}) (A_1 \!\times \! A_2)&\!=\!\!\iint _{W_1 \times W_2}\! p_1(\theta _1,A_1)p_2(\theta _2,A_2){\varvec{\nu }}(\mathrm {d}\theta _1, \mathrm {d}\theta _2), \end{aligned}$$

for all \(A_1 \in {\mathcal {B}}(W_1)\) and \(A_2 \in {\mathcal {B}}(W_2)\). Moreover, \(({\mathbf {P}}f): W_1 \times W_2 \rightarrow {\mathbb {R}}\) is the function given by

$$\begin{aligned} ({\mathbf {P}}f)({\varvec{\theta }})&=\iint _{W_1 \times W_2}f(z_1,z_2)p_1(\theta _1,\mathrm {d}z_1)p_2(\theta _2,\mathrm {d}z_2) , \end{aligned}$$

for an appropriate \(f: W_1 \times W_2 \rightarrow {\mathbb {R}}\).

We say that a Markov operator P (resp. \({\mathbf {P}}\)) is invariant with respect to a measure \(\nu \) (resp. \({\varvec{\nu }}\)) if \(\nu P=\nu \) (resp. \({\varvec{\nu }}{\mathbf {P}}={\varvec{\nu }}\) ). A related concept to invariance is that of reversibility; a Markov kernel \(p:W\times {\mathcal {B}(W)}\mapsto [0,1]\) is said to be reversible (or \(\nu \)-reversible) with respect to a measure \(\nu \in {\mathcal {M}}(W)\) if

$$\begin{aligned} \int _B p(\theta ,A)\nu (\mathrm {d}\theta )=\int _A p(\theta ,B)\nu (\mathrm {d}\theta ), \quad \forall A,B\in {\mathcal {B}(W)}. \end{aligned}$$
(1)

For two given \(\nu \)-invariant Markov operators \(P_1,P_2\), we say that \(P_1P_2\) is a composition of Markov operators, not to be confused with \( P_1\otimes P_2\). Furthermore, given a composition of K \(\nu \)-invariant Markov operators \(P_c{:}{=}P_1P_2\dots P_K\), we say that \(P_c\) is palindromic if \(P_1=P_K,\) \(P_2=P_{K-1},\) ..., \(P_k=P_{K-{k+1}}, \ k=1,2\dots ,K\). It is known (see, e.g., Brooks et al. (2011) section 1.12.17) that a palindromic, \(\nu \)-invariant Markov operator \(P_c\) has an associated Markov transition kernel \(p_c\) which is \(\nu \)-reversible.

Bayesian inverse problems

Let \((\varTheta ,\left\| \cdot \right\| _\varTheta )\) and \((Y,\left\| \cdot \right\| _Y)\) be separable Banach spaces with associated Borel \(\sigma \)-algebras \({\mathcal {B}}(\varTheta ),\ {\mathcal {B}}(Y)\). In Bayesian Inverse Problems we aim at characterizing the probability distribution of a set of parameters \(\theta \in \varTheta \) conditioned on some measured data \(y\in Y\), where the relation between \( \theta \) and y is given by:

$$\begin{aligned} {y}={{\mathcal {F}}}(\theta )+{\varepsilon }, \quad \varepsilon \sim \mu _\text {noise}. \end{aligned}$$
(2)

Here \(\varepsilon \) is some random noise with known distribution \(\mu _\mathrm {noise}\) (assumed to have a density \(\pi _\text {noise}\) with respect to some reference measure \(\nu _Y\) on Y) and \({\mathcal {F}}:\varTheta \mapsto Y\) is the so-called forward mapping operator. Such an operator can model, e.g., the numerical solution of a possibly non-linear Partial Differential Equation (PDE) which takes \( \theta \) as a set of parameters. We will assume that the data y is finite dimensional, i.e., \(Y=\mathbb {R}^{d_y}\) for some \(d_y\ge 1\), and that \(\theta \sim \mu _\mathrm {pr}\). Furthermore, we define the potential function \(\varPhi (\theta ;y):\varTheta \times Y\mapsto \mathbb {R}\) as

$$\begin{aligned} \varPhi (\theta ;y)=-\log \left[ \pi _\text {noise}(y-{\mathcal {F}}(\theta ))\right] , \end{aligned}$$
(3)

where the function \(\varPhi (\theta ;y)\) is a measure of the misfit between the recorded data y and the predicted value \({\mathcal {F}}(\theta )\), and often depends only on \(\left\| y-{\mathcal {F}}(\theta )\right\| _Y\). The solution to the Bayesian inverse problem is given by (see, for example, Latz et al. (2019) Theorem 2.5)

$$\begin{aligned} {\pi (\theta )}{:}{=}\pi (\theta |y)=\frac{1}{Z}{e^{-\varPhi (\theta ;y)}}{\pi _\mathrm {pr}(\theta )}, \end{aligned}$$
(4)

where \(\mu \) (with corresponding \(\nu _\varTheta \)-density \(\pi \)) is referred to as the posterior measure and

$$\begin{aligned} Z{:}{=} \int _\varTheta \exp (-\varPhi (\theta ;y))\mu _\text {pr}(\mathrm {d}\theta ). \end{aligned}$$

The Bayesian approach to the inverse problem consists of updating our knowledge concerning the parameter \(\theta \), i.e., the prior, given the information that we observed in Eq. (2). One way of doing so is to generate samples from the posterior measure \(\mu ^y\). A common method for performing such a task is to use, for example, the Metropolis-Hastings algorithm, as detailed in the next section. Once samples \(\{\theta ^{(n)}\}_{n=0}^N\) have been obtained, the posterior expectation \(\mathbb {E}_{\mu ^y}[\mathcal {Q}]\) of some \(\mu ^y\)-integrable quantity of interest \(\mathcal {Q}:\varTheta \mapsto \mathbb {R}\) can be approximated by the following ergodic estimator

$$\begin{aligned} \mathbb {E}_{\mu ^y}[{\mathcal {Q}}]\approx \widehat{{\mathcal {Q}}}{:}{=}\frac{1}{N-b} \sum _{n=b}^N {\mathcal {Q}}(\theta ^{(n)}), \end{aligned}$$

where \(b<N\) is the so-called burn-in period, used to reduce the bias typically associated to MCMC algorithms.

Metropolis-Hastings and tempering

We briefly recall the Metropolis-Hastings algorithm (Hastings 1970; Metropolis et al. 1953). Let \(q_\mathrm {prop}:\varTheta \times {\mathcal {B}}(\varTheta )\mapsto [0,1]\) be an auxiliary kernel. For \(n=1,2,\dots ,\) a candidate state \(\theta ^*\) is sampled from \(q_\mathrm {prop}(\theta ^n,\cdot )\), and proposed as the new state of the chain at step \({n+1}\). Such a state is then accepted (i.e., we set \(\theta ^{n+1}=\theta ^*)\), with probability \(\alpha _\text {MH}\),

$$\begin{aligned} \alpha _\text {MH}(\theta ^n,\theta ^*)=\min \left\{ 1, \frac{\pi _\mu (\theta ^*)q_\mathrm {prop}(\theta ^*,\theta ^n)}{\pi _\mu (\theta ^n)q_\mathrm {prop}(\theta ^n,\theta ^*)} \right\} , \end{aligned}$$

otherwise the current state is retained, i.e., \(\theta ^{n+1}=\theta ^n\). Notice that, with a slight abuse of notation, we are denoting kernel and density by the same symbol \(q_\mathrm {prop}\). The Metropolis-Hastings algorithm induces the Markov transition kernel \(p:\varTheta \times {\mathcal {B}}(\varTheta )\mapsto [0,1]\)

$$\begin{aligned} p(\theta ,A)&= \int _A\alpha _\mathrm {MH}(\theta ,\theta ^*)q_\mathrm {prop}(\theta ,\mathrm {d\theta ^*}) \\&\qquad + \delta _{\theta }(A)\int _\varTheta (1-\alpha _\mathrm {MH}(\theta ,\theta ^*))q_\mathrm {prop}(\theta ,\mathrm {d\theta ^*}), \end{aligned}$$

for every \(\theta \in \varTheta \) and \(A\in {\mathcal {B}}(\varTheta )\). In most practical algorithms, the proposal state \(\theta ^*\) is sampled from a state-dependent auxiliary kernel \(q_\mathrm {prop}(\theta ^n,\cdot )\). Such is the case for random walk Metropolis or preconditioned Crank Nicolson, where \(q_\mathrm {prop}(\theta ^n,\cdot )={\mathcal {N}}(\theta ^n,\Sigma )\) or \(q_\mathrm {prop}(\theta ^n,\cdot )={\mathcal {N}}(\sqrt{1-\rho ^2}\theta ^n,\rho \Sigma ),\) \(0<\rho <1\), respectively. However, these types of localized proposals tend to present some undesirable behaviors when sampling from certain difficult measures, which are, for example, concentrated over a manifold or are multi-modal (Earl and Deem 2005). In the first case, in order to avoid a large rejection rate, the “step-size” \(\left\| \Sigma ^{1/2}\right\| \) of the proposal kernel must be quite small, which will in turn produce highly-correlated samples. In the second case, chains generated by these localized kernels tend to get stuck in one of the modes. In either of these cases, very long chains are required to properly explore the parameter space.

One way of overcoming such difficulties is to introduce tempering. Let \(\mu _k, \mu _\text {pr}\) be probability measures on \((\varTheta ,{\mathcal {B}}(\varTheta )),\) \(k=1,\dots ,K,\) such that all \(\mu _k\) are absolutely continuous with respect to \(\mu _\text {pr}\), and let \(\{T_k\}_{k=1}^K\) be a set of K temperatures such that \(1=T_1<T_2<\dots <T_K\le \infty \). In a Bayesian setting, \(\mu _\text {pr}\) corresponds to the prior measure and \(\mu _k, k=1,\dots ,K\) correspond to posterior measures associated to different temperatures. Denoting by \(\pi _k\) the \(\mu _\mathrm {pr}\)-density of \(\mu _k\), we set

$$\begin{aligned} \pi _k(\theta )&{:}{=}\frac{e^{-\varPhi (\theta ;y)/T_k}}{Z_k},\quad \theta \in \varTheta , \end{aligned}$$
(5)

where \(\quad Z_k{:}{=}\int _\varTheta e^{-\varPhi (\theta ;y)/T_k} \mu _\text {pr}(\mathrm {d}\theta ) ,\) and with \(\varPhi (\theta ;y)\) as the potential function defined in (3). In the case where \(T_K=\infty \), we set \(\mu _K=\mu _\text {pr}\). Notice that \(\mu _1\) corresponds to the target posterior measure.

We say that for \(k=2,\dots , K,\) each measure \(\mu _k\) is a tempered version of \(\mu _1\). In general, the \({1/T_k}\) term in (5) serves as a “smoothing”Footnote 1 factor, which in turn makes \(\mu _k\) easier to explore as \(T_k\rightarrow \infty \). In PT MCMC algorithms, we sample from all posterior measures \(\mu _k\) simultaneously. Here, we first use a \(\mu _k\)-reversible Markov transition kernel \(p_k\) on each chain, and then, we propose to exchange states between chains at two consecutive temperatures, i.e., chains targeting \(\mu _k,\mu _{k+1}, \ k\in \{1,\dots ,K-1\}\). Such a proposed swap is then accepted or rejected with a standard Metropolis-Hastings acceptance rejection step. This procedure is presented in Algorithm 1. We remark that such an algorithm can be modified to, for example, propose to swap states every \(N_s\) steps of the chain, or to swaps states between two chains \(\mu _i,\mu _j\), with ij chosen randomly and uniformly from the index set \(\{1,2,\dots ,K\}\). In the next section we present the generalized PT algorithms which swap states according to a random permutation of the indices drawn from a state dependent probability.

figure a

Generalizing parallel tempering

Infinite Swapping was initially developed in the context of continuous-time MCMC algorithms, which were used for molecular dynamics simulations. In continuous-time PT, the swapping of the states is controlled by a Poisson process on the set \(\{1,\ldots ,K\}\). Infinite Swapping is the limiting algorithm obtained by letting the waiting times of this Poisson process go to zero. Hence, we swap the states of the chain infinitely often over a finite time interval. We refer to Dupuis et al. (2012) for a thorough introduction and review of Infinite Swapping in continuous-time. In Section 5 of the same article, the idea to use Infinite Swapping in time-discrete Markov chains was briefly discussed. Inspired by this discussion, we present two Generalizations of the (discrete-time) Parallel Tempering strategies. To that end, we propose to either (i) swap states in the chains at every iteration of the algorithm in such a way that the swap is accepted with probability one, which we will refer to as the Unweighted Generalized Parallel Tempering (UGPT), or (ii), swap dynamics (i.e., swap kernels and temperatures between chains) at every step of the algorithm. In this case, importance sampling must also be used when computing posterior expectations since this in turn provides a Markov chain whose invariant measure is not the desired one. We refer to this approach as Weighted Generalized Parallel Tempering (WGPT). We begin by introducing a common framework to both PT and the two versions of GPT.

Let \((\varTheta ,\left\| \cdot \right\| _{\varTheta })\) be a separable Banach space with associated Borel \(\sigma \)-algebra \({\mathcal {B}}(\varTheta )\). Let us define the K-fold product space \(\varTheta ^K {:}{=} {\times }^{K}_{k=1} \varTheta \), with associated product \(\sigma \)-algebra \({\mathcal {B}}^K{:}{=} \bigotimes ^K_{k=1} {\mathcal {B}}(\varTheta )\), as well as the product measures on \((\varTheta ^K,{\mathcal {B}}^K)\)

$$\begin{aligned} {\varvec{\mu }}{:}{=}{\times }^{K}_{k=1}\mu _k, \end{aligned}$$
(6)

where \(\mu _k\) \(k=1,\dots , K\) are the tempered measures with temperatures \(1= T_1<T_2<T_3<\dots < T_K\le \infty \) introduced in the previous section. Similarly, we define the product prior measure \({\varvec{\mu }}_\mathrm {pr}{:}{=}\times _{k=1}^K\mu _\mathrm {pr}.\) Notice that \({\varvec{\mu }}\) has a density \({\varvec{\pi }}({\varvec{\theta }})\) with respect to \({\varvec{\mu }}_\text {prior}\) given by

$$\begin{aligned}&{\varvec{\pi }}({\varvec{\theta }})=\prod _{k=1}^{K}\pi _k(\theta _k), \quad {\varvec{\theta }}{:}{=}(\theta _1,\dots ,\theta _K)\in \varTheta ^K, \end{aligned}$$

with \(\pi _i(\theta )\) added subscript given as in (5). The idea behind the tempering methods presented herein is to sample from \({\varvec{\mu }}\) (as opposed to solely sampling from \(\mu _1\)) by creating a Markov chain obtained from the successive application of two \({\varvec{\mu }}\)-invariant Markov kernels \({\mathbf {p}}\) and \({\mathbf {q}}\), to some initial distribution \({\varvec{\nu }}\), usually chosen to be the prior \({\varvec{\mu }}_\mathrm {pr}\). Each kernel acts as follows. Given the current state \({\varvec{\theta }}^n=(\theta ^n_1,\dots ,\theta ^n_K),\) the kernel \({\mathbf {p}}\), which we will call the standard MCMC kernel, proposes a new, intermediate state \({\tilde{{\varvec{\theta }}}}^{n+1}=({\tilde{\theta }}^{n+1}_1,\dots ,{\tilde{\theta }}^{n+1}_K),\) possibly following the Metropolis-Hastings algorithm (or any other algorithm that generates a \({\varvec{\mu }}\)-invariant Markov operator). The Markov kernel \({\mathbf {p}}\) is a product kernel, meaning that each component \({\tilde{\theta }}_k^n\), \(k=1\dots ,K,\) is generated independently of the others. Then, the swapping kernel \({\mathbf {q}}\) proposes a new state \({\varvec{\theta }}^{n+1}=(\theta _1^{n+1},\dots ,\theta _K^{n+1})\) by introducing an “interaction” between the components of \({\tilde{{\varvec{\theta }}}}^{(n+1)}.\) This interaction step can be achieved, e.g., in the case of PT, by proposing to swap two components at two consecutive temperatures, i.e., components k and \(k+1\), and accepting this swap with a certain probability given by the usual Metropolis-Hastings acceptance-rejection rule. In general, the swapping kernel is applied every \(N_s\) steps of the chain, for some \(N_s\ge 1\). We will devote the following subsection to the construction of the swapping kernel \({\mathbf {q}}\).

The swapping kernel \({\mathbf {q}}\)

Define \({\mathscr {S}}_K\) as the collection of all the bijective maps from \(\{1,2,\dots ,K \}\) to itself, i.e., the set of all K! possible permutations of \(\mathrm {id}{:}{=}\{1,\dots ,K\}\). Let \(\sigma \in {\mathscr {S}}_K\) be a permutation, and define the swapped state \({\varvec{\theta }}_\sigma {:}{=}(\theta _{\sigma (1)},\dots ,\theta _{\sigma (K)}),\) and the inverse permutation \(\sigma ^{-1}\in {\mathscr {S}}_K\) such that \(\sigma \circ \sigma ^{-1}=\sigma ^{-1}\circ \sigma =\mathrm {id}\). In addition, let \(S_K\subseteq {\mathscr {S}}_K\) be any subset of \({\mathscr {S}}_K\) closed with respect to inversion, i.e., \(\sigma \in S_K\implies \sigma ^{-1}\in S_K\). We denote the cardinality of \(S_K\) by \(|S_K|.\)

Example 1

As a simple example, consider a Standard PT as in Algorithm 1 with \(K=4\). In this case, we attempt to swap two contiguous temperatures \(T_i\) and \(T_{i+1}, \ i=1,2,3\). Thus, \(S_K\) is the set of permutations \(\{\sigma _{1,2},\sigma _{2,3},\sigma _{3,4}\}\) with:

$$\begin{aligned} \sigma _{1,2}&=(2,1,3,4), \\ \sigma _{2,3}&=(1,3,2,4), \\ \sigma _{3,4}&=(1,2,4,3). \end{aligned}$$

Notice that each permutation is its own inverse; for example:

$$\begin{aligned} \sigma _{1,2}(\sigma _{1,2})=\sigma _{1,2}((2,1,3,4))=(1,2,3,4)=\mathrm {id}. \end{aligned}$$

To define the swapping kernel \({\mathbf {q}}\), we first need to define the swapping ratio and swapping acceptance probability.

Definition 1

(Swapping ratio) We say that a function \(r:\varTheta ^K\times S_K \mapsto [0,1]\) is a swapping ratio if it satisfies the following two conditions:

  1. 1.

    \(\forall {\varvec{\theta }}\in \varTheta ^K\), \(r({\varvec{\theta }},\cdot )\) is a probability mass function on \(S_K\).

  2. 2.

    \(\forall \sigma \in S_K\), \(r(\cdot ,\sigma )\) is measurable on \((\varTheta ^K,{\mathcal {B}}^K)\).

Definition 2

(Swapping acceptance probability)

Let \({\varvec{\theta }}\in \varTheta ^K\) and \(\sigma ,\ \sigma ^{-1}\in S_K\). We call swapping acceptance probability the function \(\alpha _{\mathrm {{swap}}}:\varTheta ^K\times S_K\mapsto [0,1] \) defined as

$$\begin{aligned} \alpha _{\mathrm {{swap}}}({\varvec{\theta }},\sigma )={\left\{ \begin{array}{ll} \min \left\{ 1,\frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )r({\varvec{\theta }}_\sigma ,\sigma ^{-1})}{{\varvec{\pi }}({\varvec{\theta }})r({\varvec{\theta }},\sigma )} \right\} , &{} \text {if }r({\varvec{\theta }},\sigma )>0,\\ 0 &{} \text {if }r({\varvec{\theta }},\sigma )=0. \end{array}\right. } \end{aligned}$$

We can now define the swapping kernel \({\mathbf {q}}\).

Definition 3

(Swapping kernel) Given a swapping ratio \(r:\varTheta ^K\times S_K\mapsto [0,1]\) and its associated swapping acceptance probability \(\alpha _\mathrm {swap}:\varTheta ^K\times S_K\mapsto [0,1]\), we define the swapping Markov kernel \({\mathbf {q}}:\varTheta ^K\times {\mathcal {B}}^K\mapsto [0,1]\) as

$$\begin{aligned}&{\mathbf {q}} ({\varvec{\theta }}, B)=\sum _{\sigma \in S_K}r({\varvec{\theta }},\sigma )\left[ (1-\alpha _{\mathrm {{swap}}}({\varvec{\theta }},\sigma ))\delta _{{\varvec{\theta }}} (B)\right. \nonumber \\&\left. +\,\alpha _{\mathrm {{swap}}}({\varvec{\theta }},\sigma )\delta _{{\varvec{\theta }}_\sigma }(B) \right] , \qquad {\varvec{\theta }}\in \varTheta ^K,\ B\in {\mathcal {B}}^K, \end{aligned}$$
(7)

where \(\delta _{{\varvec{\theta }}}(B)\) denotes the Dirac measure in \({\varvec{\theta }}\), i.e., \(\delta _{{\varvec{\theta }}}(B)=1\) if \({\varvec{\theta }}\in B\) and 0 otherwise.

The swapping mechanism should be understood in the following way: given a current state of the chain \({\varvec{\theta }}\in \varTheta ^K,\) the swapping kernel samples a permutation \(\sigma \) from \(S_K\) with probability \(r({\varvec{\theta }},\sigma )\) and generates \({\varvec{\theta }}_\sigma .\) This permuted state is then accepted as the new state of the chain with probability \(\alpha _{\mathrm {swap}}({\varvec{\theta }},\sigma )\). Notice that the swapping kernel follows a Metropolis-Hastings-like procedure with “proposal” distribution \(r({\varvec{\theta }},\sigma )\) and acceptance probability \(\alpha _{\mathrm {swap}}({\varvec{\theta }},\sigma )\).

Moreover, as detailed in the next proposition, such a kernel is reversible with respect to \({\varvec{\mu }}\), since it is a Metropolis-Hastings type kernel.

Proposition 1

The Markov kernel \({\mathbf {q}} \) defined in (7) is reversible with respect to the product measure \({\varvec{\mu }}\) defined in (6).

Proof

Let \(A,B\in {\mathcal {B}}^K\). We want to show that

$$\begin{aligned} \int _A q({\varvec{\theta }},B){\varvec{\mu }}(\mathrm {d}{\varvec{\theta }})=\int _B q({\varvec{\theta }},A){\varvec{\mu }}(\mathrm {d}{\varvec{\theta }}). \end{aligned}$$

Thus,

$$\begin{aligned}&\int _A q({\varvec{\theta }},B){\varvec{\mu }}(\mathrm {d}{\varvec{\theta }})= \underbrace{\sum _{\sigma \in S_K}\int _A r({\varvec{\theta }},\sigma )\alpha _\text {swap}({\varvec{\theta }},\sigma )\delta _{{\varvec{\theta }}_\sigma }(B){\varvec{\pi }}({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})}_{I} \\ {}&+ \underbrace{\sum _{\sigma \in S_K}\int _A r({\varvec{\theta }},\sigma )\left( 1-\alpha _\text {swap}({\varvec{\theta }},\sigma )\right) \delta _{{\varvec{\theta }}}(B){\varvec{\pi }}({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}) }_{II}. \end{aligned}$$

Let \(A_\sigma {:}{=}\{{\varvec{z}} \in \varTheta ^K : {\varvec{z}}_{\sigma ^{-1}}\in A\}\), and, for notational simplicity, write \(\min \{a,b\}=\{a\wedge b\}, \ a,b\in \mathbb {R}\). From I, we have:

$$\begin{aligned}&I=\sum _{\sigma \in S_K}\int _A \left\{ 1\wedge \frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )r({\varvec{\theta }}_\sigma ,\sigma ^{-1})}{{\varvec{\pi }}({\varvec{\theta }})r({\varvec{\theta }},\sigma )}\right\} r({\varvec{\theta }},\sigma ){\varvec{\pi }}({\varvec{\theta }})\delta _{{\varvec{\theta }}_\sigma }(B){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\sum _{\sigma \in S_K}\int _A \left\{ 1\wedge \frac{{\varvec{\pi }}({\varvec{\theta }})r({\varvec{\theta }},\sigma )}{{\varvec{\pi }}({\varvec{\theta }}_\sigma )r({\varvec{\theta }}_{\sigma },\sigma ^{-1})}\right\} r({\varvec{\theta }}_\sigma ,\sigma ^{-1}){\varvec{\pi }}({\varvec{\theta }}_\sigma )\delta _{{\varvec{\theta }}_\sigma }(B){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}). \end{aligned}$$

Then, noticing that \({\varvec{\mu }}_\mathrm {pr}\) is permutation invariant, we get

$$\begin{aligned}&I=\sum _{\sigma \in S_K}\int _{A_\sigma } \left\{ 1\wedge \frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma ^{-1}})r({\varvec{\theta }}_{\sigma ^{-1}},\sigma )}{{\varvec{\pi }}({\varvec{\theta }})r({\varvec{\theta }},\sigma ^{-1})}\right\} \\&\qquad \qquad \times r({\varvec{\theta }},\sigma ^{-1}){\varvec{\pi }}({\varvec{\theta }})\delta _{{\varvec{\theta }}}(B){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\sum _{\sigma \in S_K}\int _{A_\sigma \cap B} \left\{ 1\wedge \frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma ^{-1}})r({\varvec{\theta }}_{\sigma ^{-1}},\sigma )}{{\varvec{\pi }}({\varvec{\theta }})r({\varvec{\theta }},\sigma ^{-1})}\right\} \\&\qquad \qquad \times r({\varvec{\theta }},\sigma ^{-1}){\varvec{\pi }}({\varvec{\theta }})\delta _{{\varvec{\theta }}}(B){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}) \\&=\sum _{\sigma \in S_K}\int _{B} \left\{ 1\wedge \frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma ^{-1}})r({\varvec{\theta }}_{\sigma ^{-1}},\sigma )}{{\varvec{\pi }}({\varvec{\theta }})r({\varvec{\theta }},\sigma ^{-1})}\right\} \\&\qquad \qquad \times r({\varvec{\theta }},\sigma ^{-1}){\varvec{\pi }}({\varvec{\theta }})\delta _{{\varvec{\theta }}}(A_\sigma ){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\sum _{\sigma \in S_K}\int _{B} \left\{ 1\wedge \frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma ^{-1}})r({\varvec{\theta }}_{\sigma ^{-1}},\sigma )}{{\varvec{\pi }}({\varvec{\theta }})r({\varvec{\theta }},\sigma ^{-1})}\right\} \\&\qquad \qquad \times r({\varvec{\theta }},\sigma ^{-1}){\varvec{\pi }}({\varvec{\theta }})\delta _{{\varvec{\theta }}_{\sigma ^{-1}}}(A){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\sum _{\sigma \in S_K}\int _{B} r({\varvec{\theta }},\sigma ^{-1}){\varvec{\pi }}({\varvec{\theta }})\alpha _\text {swap}({\varvec{\theta }},\sigma ^{-1})\delta _{{\varvec{\theta }}_{\sigma ^{-1}}}(A){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\sum _{\sigma \in S_K}\int _B r({\varvec{\theta }},\sigma ){\varvec{\pi }}({\varvec{\theta }})\alpha _\mathrm {swap}({\varvec{\theta }},\sigma )\delta _{{\varvec{\theta }}_\sigma }(A){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}). \end{aligned}$$

For the second term II we simply have

$$\begin{aligned}&II=\sum _{\sigma \in S_K}\int _A r({\varvec{\theta }},\sigma )(1-\alpha _\mathrm {swap}({\varvec{\theta }},\sigma ))\delta _{{\varvec{\theta }}}(B){\varvec{\pi }}({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\sum _{\sigma \in S_K}\int _{A\cap B} r({\varvec{\theta }},\sigma )(1-\alpha _\mathrm {swap}({\varvec{\theta }},\sigma ))\delta _{{\varvec{\theta }}}(B){\varvec{\pi }}({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\sum _{\sigma \in S_K}\int _{B} r({\varvec{\theta }},\sigma )(1-\alpha _\mathrm {swap}({\varvec{\theta }},\sigma ))\delta _{{\varvec{\theta }}}(A){\varvec{\pi }}({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}). \end{aligned}$$

\(\square \)

This generic form of the swapping kernel provides the foundation for both PT and GPT. We describe these algorithms in the following subsections.

The parallel tempering case

We first show how a PT algorithm that only swaps states between the \(i^\mathrm {th}\) and \(j^\mathrm {th}\) components of the chain can be cast in the general framework presented above. To that end, let \(\sigma _{i, j}\) be the permutation of \((1,2,\dots ,K),\) which only permutes the \(i^\mathrm {th}\) and \(j^\mathrm {th}\) components, while leaving the other components invariant (i.e., such that \(\sigma (i)=j\), \(\sigma (j)=i,\) and \(\sigma (k)=k\), \(k\ne i,k\ne j\)). We can take \(S_K=\{\sigma _{i, j},\ i,j=1,\dots ,K\}\) and define the PT swapping ratio between components i and j by \(r_{i,j}^\mathrm {(PT)}:\varTheta ^K\times S_K\mapsto [0,1]\) as

$$\begin{aligned} r_{i,j}^\mathrm {(PT)}({\varvec{\theta }},\sigma ){:}{=}{\left\{ \begin{array}{ll} 1 &{} \text { if }\sigma =\sigma _{i,j},\\ 0 &{} \text { otherwise}. \end{array}\right. } \end{aligned}$$

Notice that this implies that \(r_{i,j}^\mathrm {(PT)}({\varvec{\theta }}_\sigma ,\sigma ^{-1})=r_{i,j}^\mathrm {(PT)}({\varvec{\theta }},\sigma )\) since \(\sigma ^{-1}_{i,j}=\sigma _{i,j}\) and \(r_{i,j}^\mathrm {(PT)}\) does not depend on \({\varvec{\theta }}\), which in turn leads to the swapping acceptance probability \(\alpha _{\mathrm {swap}}^\mathrm {(PT)}:\varTheta ^K\times S_K\mapsto [0,1]\) defined as:

$$\begin{aligned} \alpha _{\mathrm {swap}}^\mathrm {{(PT)}}({\varvec{\theta }},\sigma _{i, j})&{:}{=}\min \left\{ 1, \frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma _{i, j}})}{{\varvec{\pi }}({\varvec{\theta }})} \right\} , \\ \alpha _{\mathrm {swap}}^\mathrm {(PT)}({\varvec{\theta }},\sigma )&=0, \ \sigma \ne \sigma _{i,j}. \end{aligned}$$

Thus, we can define the swapping kernel for the Parallel Tempering algorithm that swaps components i and j as follows:

Definition 4

(Pairwise Parallel Tempering swapping kernel) Let \({\varvec{\theta }}\in \varTheta ^K\), \(\sigma _{i, j} \in S_K\). We define the Parallel Tempering swapping kernel, which proposes to swap states between the \(i^\mathrm {th}\) and \(j^\mathrm {th}\) chains as \({\mathbf {q}}_{i, j}^\mathrm {{(PT)}}:\varTheta ^K\times {\mathcal {B}}^K\mapsto [0,1] \) given by

$$\begin{aligned} {\mathbf {q}}_{i, j}^\mathrm {{(PT)}} ({\varvec{\theta }}, B)=&\sum _{\sigma \in S_K}r_{i,j}^\mathrm {{(PT)}}({\varvec{\theta }},\sigma )\left( (1-\alpha _{\mathrm {{swap}}}^\mathrm {{{(PT)}}}({\varvec{\theta }},\sigma ))\delta _{{\varvec{\theta }}} (B)\right. \\&\left. +\,\alpha _{\mathrm {{swap}}}^\mathrm {{{(PT)}}}({\varvec{\theta }},\sigma )\delta _{{\varvec{\theta }}_\sigma }(B) \right) \\ =&\left( 1-\min \left\{ 1,\frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma _{i,j}})}{{\varvec{\pi }}({\varvec{\theta }})} \right\} \delta _{{\varvec{\theta }}}(B) \right) \\&+\,\min \left\{ 1,\frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma _{i,j}})}{{\varvec{\pi }}({\varvec{\theta }})} \right\} \delta _{{\varvec{\theta }}_{\sigma _{i,j}}}(B), \quad \forall B\in {\mathcal {B}}^K. \end{aligned}$$

In practice, however, the PT algorithm considers various sequential swaps between chains, which can be understood by applying the composition of kernels \({\mathbf {q}}^\mathrm {(PT)}_{i, j}{\mathbf {q}}^\mathrm {(PT)}_{k, \ell }\dots \) at every swapping step. In its most common form Brooks et al. (2011); Earl and Deem (2005); Miasojedow et al. (2013), the PT algorithm, hereafter referred to as standard PT (which on a slight abuse of notation we will denote by PT), proposes to swap states between chains at two consecutive temperatures. Its swapping kernel \({\mathbf {q}}^\mathrm {(PT)}: \varTheta ^K\times {\mathcal {B}}^K\mapsto [0,1]\) is given by

$$\begin{aligned} {\mathbf {q}}^\mathrm {(PT)}{:}{=}{\mathbf {q}}^\mathrm {(PT)}_{1, 2}{\mathbf {q}}^\mathrm {(PT)}_{2, 3} ... {\mathbf {q}}^\mathrm {(PT)}_{K-1, K}. \end{aligned}$$

Moreover, the algorithm described in Earl and Deem (2005), proposes to swap states every \(N_s\ge 1\) steps of MCMC. The complete kernel for the PT kernel is then given by Brooks et al. (2011); Earl and Deem (2005); Miasojedow et al. (2013)

$$\begin{aligned} {\mathbf {p}}^\mathrm {(PT)}{:}{=}{\mathbf {q}}^\mathrm {(PT)}_{1, 2}{\mathbf {q}}^\mathrm {(PT)}_{2, 3} ... {\mathbf {q}}^\mathrm {(PT)}_{K-1, K}{\mathbf {p}}^{N_s}, \end{aligned}$$
(8)

where \({\mathbf {p}}\) is a standard reversible Markov transition kernel used to evolve the individual chains independently.

Remark 1

Although the kernel \({\mathbf {p}}\) as well as each of the \({\mathbf {q}}_{i,i+1}\) are \({\varvec{\mu }}\)-reversible, notice that (8) does not have a palindromic structure, and as such it is not necessarily \({\varvec{\mu }}\)-reversible. One way of making the PT algorithm reversible with respect to \({\varvec{\mu }}\) is to consider the palindromic form

$$\begin{aligned}&{\mathbf {p}}^\mathrm {(RPT)}{:}{=} \!\left( {\mathbf {q}}^\mathrm {(PT)}_{1, 2}{\mathbf {q}}^\mathrm {(PT)}_{2, 3} ... {\mathbf {q}}^\mathrm {(PT)}_{K-1, K}\right) {\mathbf {p}}^{N_s}\!\left( {\mathbf {q}}^\mathrm {(PT)}_{K, K-1}...{\mathbf {q}}^\mathrm {(PT)}_{3, 2}{\mathbf {q}}^\mathrm {(PT)}_{2, 1}\right) \!, \end{aligned}$$

where RPT stands for Reversible Parallel Tempering. In practice, there is not much difference between \({\mathbf {p}}^\mathrm {(RPT)}\) and \({\mathbf {p}}^\mathrm {(PT)}\), however, under the additional assumption of geometric ergodicity of the chain (c.f Sect. 4) having a reversible kernel is useful to compute explicit error bounds on the non-asymptotic mean square error of an ergodic estimator (Rudolf 2012).

Unweighted generalized parallel tempering

The idea behind the Unweighted Generalized Parallel Tempering algorithm is to generalize PT so that (i) \(N_s=1\) provides a proper mixing of the chains, (ii) the algorithm is reversible with respect to \({\varvec{\mu }}\), and (iii) the algorithm considers arbitrary sets \(S_K\) of swaps (always closed w.r.t inversion), instead of only pairwise swaps. We begin by constructing a kernel of the form (7). Let \(r^\mathrm {{(UW)}}:\varTheta ^K\times S_K\mapsto [0,1]\) be a function defined as

$$\begin{aligned}&r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma ) \,{:}{=}\, \frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )}{\sum _{\sigma ' \in S_K}{\varvec{\pi }}({\varvec{\theta }}_{\sigma '})}, \quad {\varvec{\theta }}\in \varTheta ^K,\sigma \in S_K. \end{aligned}$$
(9)

Clearly, (9) is a swapping ratio according to Definition 1. As such, given some state \({\varvec{\theta }}\in \varTheta ^K\), \(r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )\) assigns a state-dependent probability to each of the \({|S_K|}\) possible permutations in \(S_K\). A permutation \(\sigma \in S_K\) is then accepted with probability \(\alpha _{\mathrm {swap}}^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )\), given by

$$\begin{aligned} \alpha _{\mathrm {swap}}^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )\,{:}{=}\,\min \left\{ 1,\frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )r^\mathrm {{(UW)}}({\varvec{\theta }}_\sigma ,\sigma ^{-1})}{{\varvec{\pi }}({\varvec{\theta }})r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )} \right\} . \end{aligned}$$
(10)

Thus, we can define the swapping kernel for the UGPT algorithm, which takes the form of (7), with the particular choice of \(r({\varvec{\theta }},\sigma )=r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )\) and

$$\begin{aligned} \alpha _{\mathrm {swap}}({\varvec{\theta }},\sigma )=\alpha _{\mathrm {swap}}^\mathrm {{(UW)}}({\varvec{\theta }},\sigma ). \end{aligned}$$

Notice that \(\alpha ^\mathrm {{(UW)}}_\text {swap}({\varvec{\theta }},\sigma )=1, \forall \sigma \in S_K\). Indeed, if we further examine Eq.  (10), we see that

$$\begin{aligned} \frac{{\varvec{\pi }}({\varvec{\theta }}_{\sigma }) r^\mathrm {{(UW)}}({\varvec{\theta }}_\sigma ,\sigma ^{-1})}{{\varvec{\pi }}({\varvec{\theta }}) r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )}&= \frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )}{{\varvec{\pi }}({\varvec{\theta }})}\cdot \frac{{\varvec{\pi }}({\varvec{\theta }})}{{\varvec{\pi }}({\varvec{\theta }}_\sigma )}\cdot \frac{\sum _{\sigma '}{\varvec{\pi }}({\varvec{\theta }}_{\sigma '})}{\sum _{{\hat{\sigma }}}{\varvec{\pi }}({\varvec{\theta }}_{{\hat{\sigma }}})} \\ {}&=\frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )}{{\varvec{\pi }}({\varvec{\theta }})}\cdot \frac{{\varvec{\pi }}({\varvec{\theta }})}{{\varvec{\pi }}({\varvec{\theta }}_\sigma )} = 1. \end{aligned}$$

In practice, this means that the proposed permuted state is always accepted with probability 1. The expression of the UGPT kernel then simplifies as follows.

Definition 5

(unweighted swapping kernel) The unweighted swapping kernel \({\mathbf {q}}^\mathrm {{(UW)}}:\varTheta ^K\times {\mathcal {B}}^K\mapsto [0,1] \) is defined as

$$\begin{aligned} {\mathbf {q}}^\mathrm {{(UW)}} ({\varvec{\theta }}, B)=\sum _{\sigma \in S_K}r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )\delta _{{\varvec{\theta }}_\sigma }(B), \ \end{aligned}$$

\( \forall {\varvec{\theta }}\in \varTheta ^K,\ B\in {\mathcal {B}}^K. \) Applying this swapping kernel successively with the kernel \({\mathbf {p}}=p_1\times p_2\times \dots p_K \) in the order \({\mathbf {q}}^\mathrm {{(UW)}} {\mathbf {p}} {\mathbf {q}}^\mathrm {{(UW)}}=:{\mathbf {p}}^\mathrm {{(UW)}}\) gives what we call Unweighted Generalized Parallel Tempering kernel \({\mathbf {p}}^\mathrm {{(UW)}}\). Lastly, we write the UGPT in operator form as

$$\begin{aligned} {\mathbf {P}}^\mathrm {{(UW)}}\,{:}{=}\,{\mathbf {Q}}^\mathrm {{(UW)}} {\mathbf {P}} {\mathbf {Q}}^\mathrm {{(UW)}}, \end{aligned}$$

where \({\mathbf {P}}\) and \({\mathbf {Q}}^\mathrm {{(UW)}}\) are the Markov operators corresponding to the kernels \({\mathbf {p}}\) and \({\mathbf {q}}^\mathrm {{(UW)}}\), respectively. We now investigate the reversibility of the UGPT kernel. We start with a rather straightforward result.

Proposition 2

Suppose that, for any \(k=1,2,\ldots , K,\) \(p_k\) is \(\mu _k\)-reversible. Then, \({\mathbf {p}}=p_1\times \dots \times p_K\) is reversible with respect to \({\varvec{\mu }}\).

Proof

We prove reversibility by confirming that Eq. (1) holds true. To that end, let \({\varvec{\theta }}\in \varTheta ^K, \ A,B\in {\mathcal {B}}^K\), where A and B tensorize, i.e., \(A\,{:}{=} \prod _{k=1}^K A_k\) and \(B\,{:}{=} \prod _{k=1}^K B_k\), with \(A_1,\ldots , A_K, B_1,\ldots ,B_K \in {\mathcal {B}}(\varTheta )\). Then,

$$\begin{aligned} \int _A {\varvec{\pi }}({\varvec{\theta }}) {\mathbf {p}}({\varvec{\theta }},B)\mathrm {d}{\varvec{\theta }}&=\prod _{k=1}^K\int _{A_k} \pi (\theta _k)p(\theta _k,B_k)\mathrm {d}\theta _k\\&=\prod _{k=1}^K\int _{B_k}\pi (\theta _k)p(\theta _k,A_k)\mathrm {d}\theta _k\\&=\int _B {\varvec{\pi }}({\varvec{\theta }}) {\mathbf {p}}({\varvec{\theta }},A)\mathrm {d}{\varvec{\theta }}. \end{aligned}$$

Showing that the previous equality holds for sets AB that tensorize is indeed sufficient to show that the claim holds for any \(A, B \in {\mathcal {B}}^K\). This follows from Carathéodory’s Extension Theorem applied as in the proof of uniqueness of product measures; see Ash (2000) section 1.3.10, 2.6.3, for details. \(\square \)

We can now prove the reversibility of the chain generated by \({\mathbf {p}} ^\mathrm {(UW)}\).

Proposition 3

(Reversibility of the UGPT chain) Suppose that, for any \(k=1,2,\dots , K,\) \(p_k\) is \(\mu _k\)-reversible. Then, the Markov chain generated by \({\mathbf {p}} ^(UW) \) is \({\varvec{\mu }}\)-reversible.

Proof

It follows from Propositions 1 and 2 that the kernels \({\mathbf {q}}^\mathrm {{(UW)}}\) and \({\mathbf {p}}\) are \({\varvec{\mu }}\)-reversible. Furthermore, since \({\mathbf {p}}^\mathrm {{(UW)}}\) is a palindromic composition of kernels, each of which is reversible with respect to \({\varvec{\mu }}\), then, \({\mathbf {p}}^\mathrm {{(UW)}}\) is reversible with respect to \({\varvec{\mu }}\) Brooks et al. (2011). \(\square \)

The UGPT algorithm proceeds by iteratively applying the kernel \({\mathbf {p}}^\mathrm {{(UW)}}\) to a predefined initial state. In particular, states are updated using the procedure outlined in Algorithm 2.

figure b

Remark 2

In practice, one does not need to perform \(|S_K|\) posterior evaluations when computing \(r^\mathrm {(UW)}({\varvec{\theta }}^{n},\cdot ),\) rather “just” K of them. Indeed, since \(\pi _j(\theta ^n_k)\propto \pi (\theta _k)^{T_j},\) \(k,j=1,2,\dots ,K\), we just need to store the values of \(\pi (\theta ^{n}_k),k=1,2,\dots ,K\), for a fixed n,  and then permute over the temperature indices.

Let now \({\mathcal {Q}}:\varTheta \mapsto {\mathbb {R}}\) be a quantity of interest. The posterior mean of \({{\mathcal {Q}}}\) , \(\mu ({\mathcal {Q}}) {:}{=} \mu _1({\mathcal {Q}})\) is approximated using \(N \in {\mathbb {N}}\) samples by the following ergodic estimator \(\widehat{{\mathcal {Q}}}_{(\mathrm {UW})}\):

$$\begin{aligned} \mu ({\mathcal {Q}})\approx \widehat{{\mathcal {Q}}}_{(\mathrm {UW})} = \frac{1}{N-b}\sum _{n=b}^{N} {\mathcal {Q}}(\theta ^{(n)}_1) . \end{aligned}$$

A comment on the pairwise state-dependent PT method of Łącki and Miasojedow (2016)

The work Łącki and Miasojedow (2016) presents a similar state-dependent swapping. We will refer to the method presented therein as Pairwise State Dependent Parallel Tempering (PSDPT). Such a method, however, differs from UGPT from the fact that (i) only pairwise swaps are considered and (ii) it is not rejection free. We summarize such a method for the sake of completeness. Let \(S_{K,\mathrm {pairwise}}\) denote the group of pairwise permutations of \((1,2,\dots ,K).\) Given a current state \({\varvec{\theta }}\in \varTheta ^K\), the PSDPT algorithm samples a pairwise permutation \({\varvec{\theta }}_{\sigma _{i, j}}\in S_{K,\mathrm {pairwise}}\) with probability \(r_{i,j}^\mathrm {(PSDPT)}({\varvec{\theta }},\sigma _{i,j})\) given by

$$\begin{aligned} r_{i,j}^\mathrm {(PSDPT)}({\varvec{\theta }},\sigma _{i,j})\,{:}{=}\, \frac{\exp (-|\varPhi (\theta _i,y)-\varPhi (\theta _j;y)|)}{\sum _{k,l}\exp (-|\varPhi (\theta _k,y)-\varPhi (\theta _l;y)|)}, \end{aligned}$$

and then accepts this swap with probability

$$\begin{aligned} \alpha _\mathrm {swap}^\mathrm {(PSDPT)}({\varvec{\theta }},\sigma _{i,j})\,{:}{=}\,\min \left\{ 1,\left( \frac{\pi _1(\theta _i)}{\pi _1(\theta _j)}\right) ^{\frac{1}{T_j}-\frac{1}{T_i}} \right\} . \end{aligned}$$

This method is attractive from an implementation point of view in the sense that it promotes pairwise swaps that have a similar energy, and as such, are likely (yet not guaranteed) to get accepted. In contrast, UGPT always accepts the new proposed state, which in turn leads to a larger amount of global moves, thus providing a more efficient algorithm. This is verified on the numerical experiments.

Weighted generalized parallel tempering

Following the intuition of the continuous-time Infinite Swapping approach of Dupuis et al. (2012); Plattner et al. (2011), we propose a second discrete-time algorithm, which we will refer to as Weighted Generalized Parallel Tempering (WGPT). The idea behind this method is to swap the dynamics of the process, that is, the Markov kernels and temperatures, instead of swapping the states such that any given swap is accepted with probability 1. We will see that the Markov kernel obtained when swapping the dynamics is not invariant with respect to the product measure of interest \({\varvec{\mu }}\); therefore, an importance sampling step is needed when computing posterior expectations.

For a given permutation \(\sigma \in S_K\), we define the swapped Markov kernel \({\mathbf {p}}_\sigma : \varTheta ^K\times {\mathcal {B}}^K\mapsto [0,1]\) and the swapped product posterior measure \({\varvec{\mu }}_\sigma \) (on the measurable space \((\varTheta ^K,{\mathcal {B}}^K)\)) as:

$$\begin{aligned}&{\mathbf {p}}_\sigma ({\varvec{\theta }},\cdot ) = {p}_{\sigma (1)}(\theta _1,\cdot ) \times \dots \times {p}_{\sigma (K)}(\theta _K,\cdot ),\\&{\varvec{{\varvec{\mu }}}}_\sigma \,{:}{=}\,\mu _{\sigma (1)}\times \dots \times \mu _{\sigma (K)}, \end{aligned}$$

where the swapped posterior measure has a density with respect to \({\varvec{\mu }}_\text {prior}\) given by

$$\begin{aligned}&{\varvec{{\varvec{\pi }}}}_\sigma ({\varvec{\theta }})\,{:}{=}\,\pi _{\sigma (1)}(\theta _1)\times \dots \times \pi _{\sigma (K)}(\theta _K), \quad \theta \in \varTheta ^K,\sigma \in S_K \end{aligned}$$
(11)

Moreover, we define the swapping weights

$$\begin{aligned} w_\sigma ({\varvec{\theta }})\,{:}{=}\,\frac{{\varvec{\pi }}_\sigma ({\varvec{\theta }})}{\sum _{\sigma '\in S_K}{\varvec{\pi }}_{\sigma '}({\varvec{\theta }})}, \quad {\varvec{\theta }}\in \varTheta ^K, \ \sigma \in S_K. \end{aligned}$$
(12)

Note that, in general, \({\varvec{\pi }}_\sigma ({\varvec{\theta }})\ne {\varvec{\pi }}({\varvec{\theta }}_\sigma )\), and as such, \(w_\sigma ({\varvec{\theta }})\ne r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )\), with \(w_\sigma \) defined as in (12).

Definition 6

We define the Weighted Generalized Parallel Tempering kernel \({\mathbf {p}}^\mathrm {(W)} : \varTheta ^K\times B^K\mapsto [0,1]\) as the following state-dependent, convex combination of kernels:

$$\begin{aligned} {\mathbf {p}}^{\mathrm {(W)}}({\varvec{\theta }}, \cdot ) {:}{=} \sum _{\sigma \in S_K} w_\sigma ({\varvec{\theta }}) {\mathbf {p}}_\sigma ({\varvec{\theta }},\cdot ), \quad {\varvec{\theta }}\in \varTheta ^K, \ \sigma \in S_K. \end{aligned}$$

Thus, the WGPT chain is obtained by iteratively applying \({\mathbf {p}}^{\mathrm {(W)}}\). We show in Proposition 4 that the resulting Markov chain has invariant measure

$$\begin{aligned} {\varvec{{\varvec{\mu }}}}_\text {W}&=\frac{1}{{|S_K|}}\sum _{\sigma \in S_K} {\varvec{{\varvec{\mu }}}}_\sigma ={\tilde{\mu }} \times \dots \times {\tilde{\mu }}, \end{aligned}$$

with \({\tilde{\mu }}=\frac{1}{|S_K|}\sum _\sigma \mu _\sigma ,\) i.e., the average with tensorization. Furthermore, \({\varvec{\mu }}_{\mathrm {W}}\) has a density (w.r.t the prior \({\varvec{\mu }}^0\)) given by

$$\begin{aligned} {\varvec{{\varvec{\pi }}}}_\text {W}({\varvec{\theta }})&=\frac{1}{{|S_K|}}\sum _{\sigma \in S_K} {\varvec{\pi }}_\sigma ({\varvec{\theta }}), \quad {\varvec{\theta }}\in \varTheta ^K, \end{aligned}$$

and a similar average and then tensorization representation applies to \({\varvec{\pi }}_{\mathrm {W}}.\) We now proceed to show that \({\mathbf {p}}^{(\mathrm {W})}({\varvec{\theta }},\cdot )\) is \({\varvec{\mu }}_{\mathrm {W}}\)-reversible (hence \({\varvec{\mu }}_{\mathrm {W}}\)-invariant).

Proposition 4

(Reversibility of the WGPT chain) Suppose that, for any \(k=1,2,\dots , K\) \(p_k\) is \(\mu _k\)-reversible. Then, the Markov chain generated by \({\mathbf {p}} ^(W) \) is \({\varvec{\mu }}_\text {W}\)-reversible.

Proof

We show reversibility by showing that (1) holds true. Thus, for \({\varvec{\theta }}\in \varTheta ^K,\ A,B\in {\mathcal {B}}^K\), with \(A{:}{=}A_1\times \dots \times A_K\), \(A_k\in {\mathcal {B}}(\varTheta )\), and with \(B_k\) defined in a similar way, we have that:

$$\begin{aligned}&\int _A {\mathbf {p}}^{\mathrm {(W)}}({\varvec{\theta }},B){\varvec{\pi }}_\text {W}({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\int _A\left[ \sum _{\sigma \in S_K} w_\sigma ({\varvec{\theta }}){\mathbf {p}}_\sigma ({\varvec{\theta }}, B)\right] \frac{\sum _{\rho \in S_K}{\varvec{\pi }}_\rho ({\varvec{\theta }})}{{|S_K|}}{\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\int _A\left[ \sum _{\sigma \in S_K} \frac{{\varvec{\pi }}_\sigma ({\varvec{\theta }})}{\sum _{\sigma ' \in S_K}{\varvec{\pi }}_{\sigma '}({\varvec{\theta }})}{\mathbf {p}}_\sigma ({\varvec{\theta }}, B)\right] \\ {}&\qquad \times \frac{\sum _{\rho \in S_K}{\varvec{\pi }}_\rho ({\varvec{\theta }})}{{|S_K|}}{\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\frac{1}{{|S_K|}}\sum _{\sigma \in S_K}\int _A {\varvec{\pi }}_\sigma ({\varvec{\theta }}){\mathbf {p}}_\sigma ({\varvec{\theta }}, B){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})=I. \end{aligned}$$

From proposition 2, and multiplying and dividing by

$$\begin{aligned} \sum _{\rho \in S_K}{\varvec{\pi }}_{\rho }({\varvec{\theta }}) \end{aligned}$$

we obtain

$$\begin{aligned}&I=\frac{1}{{|S_K|}}\sum _{\sigma \in S_K}\int _B {\varvec{\pi }}_\sigma ({\varvec{\theta }}){\mathbf {p}}_\sigma ({\varvec{\theta }}, A){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}) \quad \text {(by Prop. 2)}\\&=\frac{1}{{|S_K|}}\sum _{\sigma \in S_K}\int _B \frac{{\varvec{\pi }}_\sigma ({\varvec{\theta }}){\mathbf {p}}_\sigma ({\varvec{\theta }},A)}{\sum _{\sigma ' \in S_K}{\varvec{\pi }}_{\sigma '}({\varvec{\theta }})}\sum _{\rho \in S_K}{\varvec{\pi }}_{\rho }({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}) \\&=\sum _{\sigma \in S_K}\int _B w_\sigma ({\varvec{\theta }}){\mathbf {p}}_\sigma ({\varvec{\theta }}, A){\varvec{\pi }}_\mathrm {W}({\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }})\\&=\int _B {\mathbf {p}}^{\mathrm {(W)}}({\varvec{\theta }},A){\varvec{\pi }}_\text {W}( {\varvec{\theta }}){\varvec{\mu }}_\mathrm {pr}(\mathrm {d}{\varvec{\theta }}). \end{aligned}$$

where once again, in light of Carathéodory’s Extension Theorem, it is sufficient to show that reversibility holds for sets that tensorize. \(\square \)

We remark that the measure \({\varvec{\mu }}_{\mathrm {W}}\) is not of interest per se. However, we can use importance sampling to compute posterior expectations. Let \({\mathcal {Q}}({\varvec{\theta }}){:}{=}{\mathcal {Q}}(\theta _1)\) be a \({\varvec{\mu }}\)-integrable quantity of interest. We can write

$$\begin{aligned} \mathbb {E}_{\mu _1}[{\mathcal {Q}}]&=\mathbb {E}_{{\varvec{\mu }}}[{\mathcal {Q}}(\theta _1)]=\mathbb {E}_{{\varvec{\mu }}_\text {W}}\left[ {\mathcal {Q}}(\theta _1)\frac{{\varvec{\pi }}({\varvec{\theta }})}{{\varvec{\pi }}_\text {W}({\varvec{\theta }})}\right] \\ {}&=\frac{1}{{|S_K|}}\sum _{\sigma \in S_K}\mathbb {E}_{{\varvec{\mu }}_\text {W}}\left[ {\mathcal {Q}}(\theta _{\sigma (1)})\frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )}{{\varvec{\pi }}_\text {W}({\varvec{\theta }}_\sigma )}\right] . \end{aligned}$$

The last equality can be justified since \({\varvec{\mu }}_{\mathrm {W}}\) is invariant by permutation of coordinates. Thus, we can define the following (weighted) ergodic estimator \(\widehat{{\mathcal {Q}}}_{(\mathrm {W})} \) of the posterior mean of a quantity of interest \({\mathcal {Q}} \) by

$$\begin{aligned} \mu ({\mathcal {Q}})&\approx \nonumber \\&\widehat{{\mathcal {Q}}}_{(\mathrm {W})} = \frac{1}{{|S_K|}}\frac{1}{N} \sum _{\sigma \in S_K} \sum _{n=1}^N \frac{{\varvec{\pi }}({\varvec{\theta }}^{(n)}_\sigma )}{{\varvec{\pi }}_\text {W}({\varvec{\theta }}^{(n)}_\sigma )}{\mathcal {Q}}(\theta ^{(n)}_{\sigma (1)}) \nonumber \\&= \frac{1}{{|S_K|}}\frac{1}{N} \sum _{\sigma \in S_K} \sum _{n=1}^N {\widehat{w}}({\varvec{\theta }}^{(n)},\sigma ){\mathcal {Q}}(\theta ^{(n)}_{\sigma (1)}), \end{aligned}$$
(13)

where we have denoted the importance sampling weights by \({\widehat{w}}({\varvec{\theta }},\sigma ){:}{=}\frac{{\varvec{\pi }}({\varvec{\theta }}_\sigma )}{{\varvec{\pi }}_\text {W}({\varvec{\theta }}_\sigma )}=\frac{\mathrm {d}{\varvec{\mu }}}{\mathrm {d}{\varvec{\mu }}_{\mathrm {W}}}({\varvec{\theta }}_\sigma )\) and where N is the number of samples in the chain. Notice that \(w({\varvec{\theta }},\sigma )={\widehat{w}}({\varvec{\theta }},\sigma ^{-1}).\) As a result, the WGPT algorithm produces an estimator based on NK weighted samples, rather than “just” N, at the same computational cost of UGPT. Thus, the previous estimator evaluates the quantity of interest \({\mathcal {Q}}\) not only in the points \({\mathcal {Q}}(\theta _1^{(n)})\), but also in all states of the parallel chains, \({\mathcal {Q}}(\theta ^{(n)}_{\sigma (1)})\) for all \(\sigma \in S_K\), namely \({\mathcal {Q}}(\theta ^{(n)}_k), \ k=1,2,\dots ,K\).

Remark 3

Although it is known that, in some cases, an importance sampling estimator can be negatively affected by the dimensionality of the parameter space \(\varTheta \) (see e.g., Asmussen and Glynn (2007) Remark 1.17 or Owen (AB) Examples 9.1-9.3), we argue that this is not the case for our estimator. Indeed, notice that the importance-sampling weights \({\widehat{w}}({\varvec{\theta }},\sigma )\) are always upper bounded by \(|S_K|\), and do not blow up when the dimension goes to infinity. In Sect. 5.7 we present a numerical example on a high-dimensional problem. The results on that section evidence the robustness of WGPT with respect to the dimension of \(\theta \).

The Weighted Generalized Parallel Tempering procedure is shown in Algorithm 3. To reiterate, we remark that sampling from \({\mathbf {p}}_\sigma ({\varvec{\theta }}^{(n)},\cdot )\) involves a swap of dynamics, i.e., kernels and temperatures.

figure c

Just as in Remark 2, one only needs to evaluate the posterior K times (instead of \(|S_K|\)) to compute \(w_{(\cdot )}({\varvec{\theta }}^n)\).

Ergodicity of generalized parallel tempering

Preliminaries

We assume that the chains generated by the MCMC kernels \(p_k\), for \(k=1,\dots ,K\), are aperiodic, \(\mu _k\)-irreducible (Asmussen and Glynn 2007), and have invariant measure \(\mu _k\) on the measurable space \((\varTheta ,{\mathcal {B}}(\varTheta ))\). Let \(r\in [1,\infty )\) and \(\mu \in {\mathcal {M}}(\varTheta )\) be a “reference” probability measure. On a BIP setting, this reference measure is considered to be the posterior. We define the following spaces

$$\begin{aligned}&L_r=L_r(\varTheta ,\mu )=\left\{ f:\varTheta \mapsto \mathbb {R}, \ \mu \text {-measurable, } \right. \\&\left. \quad \quad \quad \quad \text{ s.t } \left\| f\right\| _r^r {:}{=}\int |f(\theta )|^r\mu (\mathrm {d}\theta )<\infty \right\} ,\\&L_r^0=L^0_r(\varTheta ,\mu )=\left\{ f\in L_r(\varTheta ,\mu ), \right. \\&\left. \quad \quad \quad \quad \text { s.t } \mu (f){:}{=}\int _\varTheta f(\theta )\mu (\mathrm {d}\theta )=0 \right\} . \end{aligned}$$

Moreover, when \(r=\infty \), we define

$$\begin{aligned} L_\infty (\varTheta ,\mu ){:}{=}\left\{ f:\varTheta \mapsto \mathbb {R}, \ \text { s.t } \inf _{\underset{B\in {\mathcal {B}}(\varTheta )}{\mu (B)=0}}\underset{y\in \varTheta \backslash B}{\sup } \left|f(y)\right|<\infty \right\} . \end{aligned}$$

Notice that, clearly, \({L^0_r(\varTheta ,\mu )}\subset {L_r(\varTheta ,\mu )}\). In addition we define the spaces of measures

$$\begin{aligned}&{\mathcal {M}}_r(\varTheta ,\mu ){:}{=}\{ \nu \in {\mathcal {M}}(\varTheta ) \text { s.t. } \nu \ll \mu , \ \left\| \nu \right\| _{L_r(\varTheta ,\mu )}< \infty \},\\&\text {where}\quad \left\| \nu \right\| _{L_r(\varTheta ,\mu )}{:}{=} \left\| \frac{\mathrm {d}\nu }{\mathrm {d}\mu } \right\| _{L_r(\varTheta ,\mu )}. \end{aligned}$$

Notice that the definition of \(L_r\)-norm depends on the reference measure \(\mu ,\) and on \(\varTheta \).

A Markov operator \(P:L_r(\varTheta ,\mu )\mapsto L_r(\varTheta ,\mu )\) with invariant measure \(\mu \) is a bounded linear operator whose norm is given by

$$\begin{aligned} \left\| P \right\| _{L_r(\varTheta ,\mu )\mapsto L_r(\varTheta ,\mu )} {:}{=}\sup _{\left\| f \right\| _{L_r(\varTheta ,\mu )}=1} \left\| Pf \right\| _{L_r(\varTheta ,\mu )}, \end{aligned}$$

for \(\ f\in L_r(\varTheta ,\mu ).\) It is well-known (see, e.g., Rudolf (2012)) that any Markov operator P on \(L_r(\varTheta ,\mu )\) with invariant measure \(\mu \) can be understood as a weak contraction in \(L_r(\varTheta ,\mu ),\) i.e., \(\left\| P\right\| _{L_r(\varTheta ,\mu )\mapsto L_r(\varTheta ,\mu )}\le 1\). To quantify the convergence of a Markov chains generated by a Markov operator P, we define the concept of geometric ergodicity. Let \(r\in [1,\infty ]\). A Markov operator P with invariant measure \(\mu \in {\mathcal {M}}(\varTheta )\) is said to be \(L_r(\varTheta ,\mu )\)-geometrically ergodic if for all probability measures \(\nu \in {\mathcal {M}}_r(\varTheta ,\mu )\) there exists an \(\alpha \in (0,1)\) and \(C_\nu <\infty \) such that

$$\begin{aligned} \Vert \nu P^n-\mu \Vert _{L_r(\varTheta ,\mu )}\le C_\nu \alpha ^n, \ \ n\in \ {\mathbb {N}}. \end{aligned}$$
(14)

A related concept to \(L_2\)-geometric ergodicity is that of \(L_2\)-spectral gap. A Markov operator \(P: {L_2(\varTheta ,\mu )}\mapsto L_2(\varTheta ,\mu )\) with invariant measure \(\mu \in {\mathcal {M}}(\varTheta )\) has an \(L_2(\varTheta ,\mu )\)-spectral gap \(1-\beta >0\), with \(\beta <1\), if the following holds

$$\begin{aligned} \left\| P \right\| _{L^0_2(\varTheta ,\mu )\mapsto L^0_2(\varTheta ,\mu )}\le \beta . \end{aligned}$$
(15)

The next Proposition, whose proof can be found e.g., in Rudolf (2012), relates the existence of an \(L_2\)-spectral gap to the geometric ergodicity of the chain (with \(\beta \le \alpha \), in general).

Proposition 5

Let \(P:L_2(\varTheta ,\mu )\mapsto L_2(\varTheta ,\mu )\) be a \(\mu \) reversible Markov transition operator. The existence of an \(L_2(\varTheta ,\mu )\)-spectral gap implies \(L_r(\varTheta ,\mu )\)-geometric ergodicity for any \(r\in [1,\infty ]\).

Proof

The previous claim is shown in Rudolf (2012) Proposition 3.17 and Appendix A.4. It is also shown in Rudolf (2012) that, in general, \(\beta \le \alpha \), with \(\alpha ,\beta \) given as in Eqs. (14) and (15). \(\square \)

Our path to prove ergodicity of the GPT algorithms will be to show the existence of an \(L_2\)-spectral gap.

Geometric ergodicity and \(L_2\)-spectral gap for GPT

The main results of this section are presented in Theorems 1 and 2, which show the existence of an \(L_2\)-spectral gap for both the UGPT and WGPT algorithms, respectively.

We begin with the definition of overlap between two probability measures. Such a concept will later be used to bound the spectral gap of the GPT algorithms.

Definition 7

(Density overlap) Let \(\mu _k,\mu _j\) be two probability measures on the measurable space \((\varTheta ,{\mathcal {B}}(\varTheta ))\), each having respective densities \(\pi _k(\theta ),\pi _j(\theta ), \ \theta \in \varTheta ,\) with respect to some common reference measure \(\nu _\varTheta \) also on \((\varTheta ,{\mathcal {B}}(\varTheta ))\). We define the overlap between \( \pi _k(\theta )\) and \(\pi _j(\theta )\) as

$$\begin{aligned} \eta _{\nu _\varTheta }(\pi _k,\pi _j):&=\int _\varTheta \min \{\pi _k(\theta ),\pi _j(\theta )\}\nu _\varTheta (\mathrm {d\theta }) \\ {}&=1-\frac{1}{2}\left\| \mu _k-\mu _j\right\| _{L_1(\varTheta ,\nu _{\varTheta })}. \end{aligned}$$

An analogous definition holds for \({\varvec{\pi }}_\sigma ,{\varvec{\pi }}_\rho \), with \(\rho ,\sigma \in S_K\).

Assumption 1

For \(k=1,\dots ,K\), let \(\mu _k\in {\mathcal {M}}_1(\varTheta ,\mu _\mathrm {pr})\) be given as in (5), \(p_k:\varTheta \times {\mathcal {B}}(\varTheta )\mapsto [0,1]\) be the Markov kernel associated to the \(k^\text {th}\) dynamics and let \(P_k:L_r(\varTheta ,\mu _k)\mapsto L_r(\varTheta ,\mu _k)\) be its corresponding \(\mu _k\) invariant Markov operator. In addition, for \(\sigma ,\rho \in S_K\), define the measures \({\varvec{\mu }}_\sigma ,{\varvec{\mu }}_\rho \in {\mathcal {M}}(\varTheta ^K)\) as in Eq. (6). Throughout this work it is assumed that:

  1. C1

    The Markov kernel \(p_k\) is \(\mu _k\)-reversible.

  2. C2

    The Markov operator \(P_k\) has an \(L_2(\varTheta ,\mu _k)\) spectral gap.

  3. C3

    For any \(\sigma ,\rho \in S_K\), \(\varLambda _{\sigma ,\rho }{:}{=}\eta _{{\varvec{\mu }}_\mathrm {pr}}({\varvec{\pi }}_\sigma ,{\varvec{\pi }}_\rho )>0,\) with \({\varvec{\pi }}_\sigma ,{\varvec{\pi }}_\rho \) defined as in (11).

These assumptions are relatively mild. In particular, C1 and C2 are known to hold for many commonly-used Markov transition kernels, such as RWM, Metropolis-adjusted Langevin Algorithm, Hamiltonian Monte Carlo, (generalized) preconditioned Crank-Nicolson, among others, under mild regularity conditions on \(\pi \)Asmussen and Glynn (2007); Hairer et al. (2014). Assumption C3 holds true given the construction of the product measures in Sect. 3.

We now present an auxiliary result that we will use to bound the spectral gap of both the Weighted and Unweighted GPT algorithms.

Proposition 6

Suppose that Assumption 1 holds and let \({\mathbf {P}}{:}{=}\bigotimes _{k=1}^KP_k: L_2(\varTheta ^K,{\varvec{\mu }})\mapsto L_2(\varTheta ^K,{\varvec{\mu }}),\) with invariant measure \({\varvec{\mu }}=\mu _1\times \dots \times \mu _K\). Then, \({\mathbf {P}}\) has an \(L_2(\varTheta ^K,{\varvec{\mu }})\)-spectral gap, i.e., \(\left\| {\mathbf {P}} \right\| _{L^0_2(\varTheta ^K,{\varvec{\mu }})\mapsto L^0_2(\varTheta ^K,{\varvec{\mu }})}<1\). Moreover, the Markov chain obtained from \({\mathbf {P}}\) is \(L_r\) geometrically ergodic, for any \(r\in [1,\infty ]\).

Proof

We limit ourselves to the case \(K=2\), since the case for \(K>2\) follows by induction. Denote by \(I:L_2(\varTheta ,\mu _k)\mapsto L_2(\varTheta ,\mu _k), k=1,2\) the identity Markov transition operator, and let \(f\in L_2(\varTheta ^2,{\varvec{\mu }}).\) Notice that f admits a spectral representation in \(L_2(\varTheta ^2,{\varvec{\mu }})\) given by \(f({\varvec{\theta }})=\sum _{k,j}\phi _k(\theta _1)\psi _j(\theta _2)c_{k,j}\), with \(c_{k,j}\in \mathbb {R}\), and where \(\{ \phi _k \}_{i\in {\mathbb {N}}}\) is a complete orthonormal basis (CONB) of \(L_2(\varTheta ,\mu _1)\) and \(\{\psi _j\}_{j\in {\mathbb {N}}}\) is a CONB of \(L_2(\varTheta ,\mu _2)\), so that \(\{\phi _k\otimes \psi _j\}_{k,j\in {\mathbb {N}}}\) is a CONB of \(L_2(\varTheta ^2,{\varvec{\mu }})\). Moreover, we assume that \(\phi _0=\psi _0=1\), and write, for notational simplicity \(\left\| P_1 \right\| =\left\| P_1\right\| _{L_2(\varTheta ,\mu _1)\mapsto L_2(\varTheta ,\mu _1)}\), and \(\left\| {P_2}\right\| =\left\| P_2\right\| _{L_2(\varTheta ,\mu _2)\mapsto L_2(\varTheta ,\mu _2)}\). Lastly, denote \(f_0=f-c_{0,0}\), so that \(f_0\in L_2^0(\varTheta ^2,{\varvec{\mu }})\). Notice that

$$\begin{aligned}&\left\| (P_1 \otimes I)f_0\right\| ^2_{L_2(\varTheta ^2,{\varvec{\mu }})} =\left\| \sum _{(k,j)\ne (0,0)}\left( P_1\phi _k\right) \psi _jc_{k,j} \right\| ^2_{L_2(\varTheta ^2,{\varvec{\mu }})}\nonumber \\&=\left\| \sum _{j=0}^{\infty }\left( \sum _{k=1}^{\infty }P_1\phi _kc_{k,j}\right) \psi _j+\sum _{j=1}^\infty c_{0,j}P_1\phi _0\psi _j\right\| ^2_{L_2(\varTheta ^2,{\varvec{\mu }})}. \end{aligned}$$
(16)

Splitting the sum, we get from the orthonormality of the basis that:

$$\begin{aligned} (16)=&\sum _{j=1}^\infty \left\| \sum _{k=1}^\infty P_1\phi _kc_{k,j}+c_{0,j}P_1\phi _0\right\| ^2_{L_2(\varTheta ,\mu _1)}\\&+\left\| \sum _{k=1}^\infty P_1\phi _kc_{k,0}\right\| ^2_{L_2(\varTheta ,\mu _1)} \\ =&\sum _{j=1}^\infty \left\| P_1\left( \sum _{k=1}^\infty \phi _kc_{k,j}\right) \right\| ^2_{L_2(\varTheta ,\mu _1)} +\sum _{j=1}^\infty \left\| c_{0,j}\phi _0\right\| ^2_{L_2(\varTheta ,\mu _1)}\\&+\left\| P_1\left( \sum _{k=1}^\infty \phi _kc_{k,0}\right) \right\| ^2_{L_2(\varTheta ,\mu _1)}\\ \le&\sum _{j=1}^{\infty }\left( \left\| P_1\right\| ^2 \sum _{k=1}^\infty c_{k,j}^2+c_{0,j}^2\right) +\left\| P_1\right\| ^2 \sum _{k=1}^{\infty }c_{k,0}^2\\ =&\left\| P_1\right\| ^2 \left\| f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2 +(1-\left\| P_1\right\| ^2)\sum _{j=1}^\infty (c_{0,j})^2. \end{aligned}$$

Proceeding similarly, we can obtain an equivalent bound for \(\left\| (I\otimes P_2)f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})} ^2\). We are now ready to bound \(\left\| {\mathbf {P}}\right\| ^2_{L_2(\varTheta ^2,{\varvec{\mu }})\mapsto L_2(\varTheta ^2,{\varvec{\mu }})}\):

$$\begin{aligned} \left\| {\mathbf {P}}f_0\right\| ^2_{L_2(\varTheta ^2,{\varvec{\mu }})}=&\left\| (P_1\otimes P_2) f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})} ^2\\ =&\left\| (P_1\otimes I)(I\otimes P_2)f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2\\ \le&\left\| P_1\right\| ^2 \left\| (I\otimes P_2) f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2\\&+(1-\left\| P_1\right\| ^2)\\&\times \!\left( \sum _{j=1}^\infty \left( (\!I\otimes P_2)\sum _{(\ell ,k)\ne (0,0)}\!c_{\ell ,k}\phi _\ell \psi _k,\phi _0\psi _j\right) ^{\!2}\right) \\ =&\left\| P_1\right\| ^2 \left\| (I\otimes P_2) f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2\\&+(1-\left\| P_1\right\| ^2 )\left( \sum _{j=1}^\infty \left( \sum _{k=1}^\infty c_{0,k}(P_2\psi _k),\psi _j \right) ^2\right) \\ \le&\left\| P_1\right\| ^2 \left\| (I\otimes P_2) f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2\\&+(1-\left\| P_1\right\| ^2 )\left\| P_2\left( \sum ^\infty _{k=1}c_{0,k}\psi _k\right) \right\| _{L_2(\varTheta ,\mu _2)}^2\\ \le&\left\| P_1\right\| ^2 \left\| P_2\right\| ^2 \left\| f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2\\&+\left\| P_1\right\| ^2 (1-\left\| P_2\right\| ^2)\left( \sum _{j=1}^\infty c_{j,0}^2\right) \\&+(1-\left\| P_1\right\| ^2)\left\| P_2 \right\| ^2 \left( \sum _{k=1}^\infty c_{0,k}^2\right) \end{aligned}$$

Assuming without loss of generality that \(\left\| P_1\right\| \ge \left\| P_2\right\| \), we can use the inequality above to bound

$$\begin{aligned} \left\| {\mathbf {P}}f_0\right\| ^2_{L_2(\varTheta ^2,{\varvec{\mu }})}&\le \left\| P_1\right\| ^2 \left\| P_2\right\| ^2 \left\| f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2 \\&\quad + \left\| P_1\right\| ^2 (1-\left\| P_2\right\| ^2)\underbrace{\left( \sum _{j=1}^\infty c_{j,0}^2+\sum _{k=1}^\infty c_{0,k}^2\right) }_{\le \left\| f_0\right\| _{L_2(\varTheta ^2,{\varvec{\mu }})} ^2}\\ {}&\le \left\| P_1\right\| ^2 \left\| f_0 \right\| _{L_2(\varTheta ^2,{\varvec{\mu }})}^2. \end{aligned}$$

Thus, we have that

$$\begin{aligned} \left\| {\varvec{P}} \right\| _{L^0_2(\varTheta ^2,{\varvec{\mu }})\mapsto L^0_2(\varTheta ^2,{\varvec{\mu }})} \le \max _{k=1,2}\{\left\| P_k\right\| _{L^0_2(\varTheta ,\mu _k)\mapsto L^0_2(\varTheta ,\mu _k)} \}\!<\!1. \end{aligned}$$

The previous result can easily be extended to \(K>2\). Lastly, \(L_r(\varTheta ^K,{\varvec{\mu }})\)-geometric ergodicity \(\forall r\in [1,\infty ]\) follows from proposition 5. \(\square \)

We can use the previous result to prove the geometric ergodicity of the UGPT algorithm:

Theorem 1

(Ergodicity of UGPT ) Suppose Assumption 1 holds and denote by \({\varvec{\mu }}\) the invariant measure of the UGPT Markov operator \({\mathbf {P}}^{(UW) }\). Then, \({\mathbf {P}}^{(UW) }\) has an \(L_2(\varTheta ^K,{\varvec{\mu }})\)-spectral gap. Moreover, the chain generated by \({\mathbf {P}}^{(UW) }\) is \(L_r(\varTheta ^K,{\varvec{\mu }})\)-geometrically ergodic for any \(r\in [1,\infty ]\).

Proof

Recall that \({\mathbf {P}}^\mathrm {(UW)}{:}{=}{\mathbf {Q}}^\mathrm {(UW)} {\mathbf {P}} {\mathbf {Q}}^\mathrm {(UW)}\). From the definition of operator norm, we have that

$$\begin{aligned}&\left\| {\mathbf {P}}^{\mathrm {(UW)}}\right\| _{L^0_2(\varTheta ^K,{\varvec{\mu }})\mapsto L^0_2(\varTheta ^K,{\varvec{\mu }})}\\ {}&\le \left\| {\mathbf {Q}}^{\mathrm {(UW)}} \right\| ^2_{L^0_2(\varTheta ^K,{\varvec{\mu }})\mapsto L^0_2(\varTheta ^K,{\varvec{\mu }})}\left\| {\mathbf {P}} \right\| _{L^0_2(\varTheta ^K,{\varvec{\mu }})\mapsto L^0_2(\varTheta ^K,{\varvec{\mu }})}\\&\le \left\| {\mathbf {P}} \right\| _{L^0_2(\varTheta ^K,{\varvec{\mu }})\mapsto L^0_2(\varTheta ^K,{\varvec{\mu }})}<1, \end{aligned}$$

where the previous line follows from Proposition 6 and the fact that \({\mathbf {Q}}^{\mathrm {(UW)}}\) is a weak contraction in \(L_2(\varTheta ^K,{\varvec{\mu }})\) (see, e.g., Baxter and Rosenthal (1995) Proposition 1). Lastly, \(L_r(\varTheta ^K,{\varvec{\mu }})\)-geometric ergodicity \(\forall r\in [1,\infty ]\) follows from Proposition 5 and the fact that \({\mathbf {P}}^\mathrm {(UW)}\) is \({\varvec{\mu }}\)-reversible by Proposition 3. \(\square \)

We now turn to proving geometric ergodicity for the WGPT algorithm. We begin with an auxiliary result, lower-bounding the variance of a \({\varvec{\mu }}_{\mathrm {W}}\)-integrable functional \(f\in L_2(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})\).

Proposition 7

Let \(f\in L^0_2(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})\) be a \({\varvec{\mu }}_{\mathrm {W}}\)-integrable function such that \(\left\| f \right\| _{L_2(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})}=1,\) and denote by \(\mathbb {V}_{{\varvec{\mu }}_{\mathrm {W}}}[f]\), \(\mathbb {V}_{{\varvec{\mu }}_\sigma }[f]\) the variance of f with respect to \({\varvec{\mu }}_{\mathrm {W}},{\varvec{\mu }}_\sigma \), respectively with \(\sigma \in S_K\). In addition, suppose Assumption 1 holds. Then, it can be shown that

$$\begin{aligned} 0<\frac{\varLambda _m}{2-\varLambda _m}\le \frac{1}{{|S_K|}}\sum _{\sigma \in S_K}\mathbb {V}_{{\varvec{\mu }}_\sigma }[f]\le \mathbb {V}_{{\varvec{\mu }}_{\mathrm {W}}}[f]=1, \end{aligned}$$

with \(\varLambda _m=\underset{\sigma ,\rho \in S_K}{\min }\{\varLambda _{\sigma ,\rho }\}\) and \(\varLambda _{\sigma ,\rho }\) as in Assumption C3.

Proof

The proof is technical and tedious and is presented in Appendix A. \(\square \)

We are finally able to prove the ergodicity of the WGPT algorithm.

Theorem 2

(Ergodicity of WGPT) Suppose Assumption 1 holds for some \(r\in [1,\infty ]\) and denote by \({\varvec{\mu }}_W \) the invariant measure of the WGPT Markov operator \({\mathbf {P}}^{(W) }\). Then, \({\mathbf {P}}^{(W) }\) has an \(L_2(\varTheta ^K,{\varvec{\mu }}_W )\)-spectral gap. Moreover, the chain generated by \({\mathbf {P}}^{(W) }\) is \(L_r(\varTheta ^K,{\varvec{\mu }}_W )\) geometrically ergodic for any \(r\in [1,\infty ]\).

Proof

Let \({\mathcal {L}}{:}{=} \{f\in L_2^0(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}}) : \left\| f\right\| _{L_2^0(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})}=1\}\), and, for notational clarity, write

$$\begin{aligned} \left\| {\mathbf {P}}_\sigma \right\| _{L^0_2}{:}{=}\left\| {\mathbf {P}}_\sigma ^2 \right\| _{L^0_2(\varTheta ^K,{\varvec{{\varvec{\mu }}}}_\sigma )\mapsto L^0_2(\varTheta ^K,{\varvec{{\varvec{\mu }}}}_\sigma )}. \end{aligned}$$

Then, from the definition of operator norm,

$$\begin{aligned}&\left\| \mathbf {{{\varvec{P}}^\text {(w)}}} \right\| _{L^0_2(\varTheta ^K,{\varvec{{\varvec{\mu }}_{\mathrm {W}}}})\mapsto L^0_2(\varTheta ^K,{\varvec{{\varvec{\mu }}_{\mathrm {W}}}}) }^2 \nonumber \\&={\sup }_{f \in {\mathcal {L}}} \left\| \mathbf {{{\varvec{P}}^\text {(w)}}} f \right\| ^2_{L_2(\varTheta ^K,{\varvec{{\varvec{\mu }}_{\mathrm {W}}}})} \nonumber \\&=\underset{f \in {\mathcal {L}}}{\sup }\int _{\varTheta ^K} \left|\sum _{\sigma \in S_K} w_\sigma ({\varvec{\theta }}) \int _{\varTheta ^K} f({\varvec{y}}){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}}) \right|^2 {\varvec{\mu }}_{\mathrm {W}}(\mathrm {d}{\varvec{\theta }}) \nonumber \\&\le \underset{f \in {\mathcal {L}}}{\sup } \int _{\varTheta ^K} \sum _{\sigma \in S_K} w_\sigma ({\varvec{\theta }})\left|\int _{\varTheta ^K}f({\varvec{y}}) {\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}})\right|^2{\varvec{\mu }}_{\mathrm {W}}(\mathrm {d}{\varvec{\theta }}) \nonumber \\&=\underset{f \in {\mathcal {L}}}{\sup } \frac{1}{{|S_K|}} \sum _{\sigma \in S_K} \int _{\varTheta ^K} \left| \int _{\varTheta ^K} f({\varvec{y}}){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}}) \right|^2 {\varvec{\mu }}_\sigma (\mathrm {d} {\varvec{\theta }}), \end{aligned}$$
(17)

where the second to last line follows from the convexity of \((\cdot )^2\) and the last line follows from the definition of \(w_\sigma \) and \({\varvec{\mu }}_{\mathrm {W}}.\) Now, let \(\bar{f}_\sigma {:}{=}{\varvec{\mu }}_\sigma (f)\). Notice that we have

$$\begin{aligned}&\int _{\varTheta ^K} \left| \int _{\varTheta ^K} f({\varvec{y}}){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}}) \right|^2 {\varvec{\mu }}_\sigma (\mathrm {d} {\varvec{\theta }}) \nonumber \\ =&\int _{\varTheta ^K} \left| \int _{\varTheta ^K} (f({\varvec{y}})-\bar{f}_\sigma +\bar{f}_\sigma ){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}}) \right|^2 {\varvec{\mu }}_\sigma (\mathrm {d} {\varvec{\theta }}) \nonumber \\ =&\int _{\varTheta ^K} \left( \left| \int _{\varTheta ^K} (f({\varvec{y}})-\bar{f}_\sigma ){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}})\right|^2+\left|\int _{\varTheta ^K}\bar{f}_\sigma {\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}}) \right|^2\right. \nonumber \\&\left. +2\bar{f}_\sigma \int _{\varTheta ^K}(f({\varvec{y}})-\bar{f}_\sigma ){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}}) \right) {\varvec{\mu }}_\sigma (\mathrm {d} {\varvec{\theta }}) \nonumber \\ =&\underbrace{\int _{\varTheta ^K} \left( \int _{\varTheta ^K} (f({\varvec{y}})-\bar{f}_\sigma ){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}})\right) ^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})}_\text {I}+(\bar{f}_\sigma )^2 \nonumber \\&+ 2\bar{f}_\sigma \underbrace{\int _{\varTheta ^K}\int _{\varTheta ^K}(f({\varvec{y}})-\bar{f}_\sigma )\mathbf {p_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}} ){\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})}}_=0 \text {by stationarity} \end{aligned}$$
(18)

Thus, multiplying and dividing I by

$$\begin{aligned} \left( \int _{\varTheta ^K} \left( f({\varvec{\theta }})-\bar{f}_\sigma \right) ^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})\right) , \end{aligned}$$

we obtain from the definition of \(\left\| {\mathbf {P}}_\sigma \right\| _{L^0_2}^2\) that:

$$\begin{aligned} \begin{aligned} (18)=&\left( \frac{\int _{\varTheta ^K} \left( \int _{\varTheta ^K} (f({\varvec{y}})-\bar{f}_\sigma ){\mathbf {p}}_\sigma ({\varvec{\theta }},\mathrm {d}{\varvec{y}})\right) ^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})}{ \int _{\varTheta ^K} \left( f({\varvec{\theta }})-\bar{f}_\sigma \right) ^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})}\right) \\&\times \left( \int _{\varTheta ^K} \left( f({\varvec{\theta }})-\bar{f}_\sigma \right) ^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})\right) +(\bar{f}_\sigma )^2 \\ \le&\left\| {\mathbf {P}}_\sigma \right\| _{L^0_2}^2\left( \int _{\varTheta ^K} \left( f({\varvec{\theta }})-\bar{f}_\sigma \right) ^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})\right) +(\bar{f}_\sigma )^2\\ =&\left\| {\mathbf {P}}_\sigma \right\| _{L^0_2}^2\left( \int _{\varTheta ^K}f({\varvec{\theta }})^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})\right) \\ {}&+\left( 1-\left\| {\mathbf {P}}_\sigma \right\| _{L^0_2}^2\right) (\bar{f}_\sigma )^2 \\ =&\left( \int _{\varTheta ^K}f({\varvec{\theta }})^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})\right) -\underbrace{\left( 1-\left\| {\mathbf {P}}_\sigma \right\| _{L^0_2}^2\right) }_{{:}{=}\gamma }, \text {with} \gamma \in (0,1)\\&\times \left( \int _{\varTheta ^K} \left( f({\varvec{\theta }})-\bar{f}_\sigma \right) ^2{\varvec{\mu }}_\sigma (\mathrm {d}{\varvec{\theta }})\right) . \end{aligned} \end{aligned}$$
(19)

Replacing Eq. (19) into Eq. (17), we get

$$\begin{aligned}&\left\| {\mathbf {P}}^\mathrm {(W)}\right\| ^2_{L_2^0(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})\mapsto L_2^0(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})} \\&\le \underset{f\in {\mathcal {L}}}{\sup }\left( \int _{\varTheta ^K}f({\varvec{\theta }})^2{\varvec{\mu }}_{\mathrm {W}}(\mathrm {d}{\varvec{\theta }})\right) -\frac{\gamma }{|S_K|}\sum _{\sigma \in S_K}\mathbb {V}_{{\varvec{\mu }}_\sigma }[f]\\&\le 1-\gamma \left( \frac{\varLambda _m}{2-\varLambda _m}\right) <1 \quad \text {(by Proposition 7)} . \end{aligned}$$

Thus, \(\mathbf {{{\varvec{P}}^\text {(w)}}}\) has an \(L_2(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})\) spectral gap. Once again, \(L_r(\varTheta ^K,{\varvec{\mu }}_{\mathrm {W}})\)-geometric ergodicity (with \(r\in [1,\infty ])\) follows from Proposition 5 and the fact that \({\mathbf {P}}^\mathrm {(W)}\) is \({\varvec{\mu }}_{\mathrm {W}}\)-reversible by Proposition 4. \(\square \)

Discussion and comparison to similar theoretical result

Theorems 1 and 2 state the existence of an \(L_2\)-spectral gap, hence \(L_r\)-geometric ergodicity for both the UGPT and the WGPT algorithm. Their proof provides also a quantification of the \(L_2\)-spectral gap in terms of the \(L_2\)-spectral gap of each individual Markov operator \(P_k\). Such a bound is, however, not satisfactory as it does not use any information on the temperature and it just states that the \(L_2\)-spectral gap of the UWPT and WGPT chain is not worse that the smallest \(L_2\)-spectral gap among the individual chains (without swapping). This result is not sharp, as it can be evidenced in the numerical section, where a substantial improvement in convergence is achieved by our methods.

Convergence results for the standard parallel tempering algorithm have been obtained in the works (Miasojedow et al. 2013) and (Woodard et al. 2009). In particular, the work (Miasojedow et al. 2013) has proved geometric ergodicity for the pairwise parallel tempering algorithm using the standard drift condition construction of Meyn and Tweedie (2012). It is unclear from that work which convergence rate is obtained for the whole algorithm. In comparison, our results are given in terms of spectral gaps. On the other hand, the work Woodard et al. (2009) presents conditions for rapid mixing of a particular type of parallel tempering algorithm, where the transition kernel is to be understood as a convex combination of such kernels, as opposed to our case, where it is to be understood as a tensorization. Their obtained results provide, for their setting, a better convergence rate that the one we obtained for the UGPT. We believe that their result can be extended to the UGPT algorithm, and this will be the focus of future work. On the other hand, the use of the ideas in Woodard et al. (2009) for the WGPT algorithm seems more problematic.

Numerical experiments

We now present four academic examples to illustrate the efficiency of both GPT algorithms discussed herein and compare them to the more traditional random walk Metropolis and standard PT algorithms. Notice that we compare the different algorithms in their simplest version that uses random walk Metropolis as a base transition kernel. The only exception is in Sect. 5.7, which presents a high-dimensional BIP for which the preconditioned Crank-Nicolson Cotter et al. (2013) is used as the base method in all algorithms instead of RWM. More advanced samplers, such as Adaptive metropolis (Haario et al. 2006, 2001), or other transition kernels, could be used as well to replace RWM or pCN. Experiments 5.3, 5.4 and 5.5 were run in a Dell (R) Precision (TM) T3620 workstation with Intel(R) Core(TM) i7-7700 CPU with 32 GB of RAM. Numerical simulations in Sects. 5.3 and 5.5 were run on a single thread, while the numerical simulations in Sect. 5.4 were run on an embarrassingly parallel fashion over 8 threads using the Message Passing Interface (MPI) and the Python package MPI4py (Dalcín et al. 2005). Lastly, Experiments 5.6 and 5.7 were run on the Fidis cluster of the EPFL. The scripts used to generate the results presented in this section were written in Python 3.6, and can be found in https://doi.org/10.5281/zenodo.3700048.

Implementation remarks

In most Bayesian inverse problems, particularly those dealing with large-scale computational models, the computational cost is dominated by the evaluation of the forward operator, which can be, for example, the numerical approximation of a possibly non-linear partial differential equation. In the case where all possible permutations are considered (i.e., \(S_K={\mathscr {S}}_K\)), there are K! possible permutations of the states, the computation of the swapping ratio in the GPT algorithms can become prohibitively expensive if one is to evaluate K! forward models, even for moderate values of K. This problem can be circumvented by storing the values \(\pi (\theta ^{(n)}_k)\), \(k=1,\dots ,K,\) \(n=1,\dots N\), since the swapping ratio for GPT consists of permutations of these values, divided by the temperature parameters. Thus, “only” K forward model evaluations need to be computed at each step and the swapping ratio can be computed at negligible cost for moderate values of K.

There is, however, a clear trade-off between the choice of K (which has a direct impact on the efficiency of the method), and the computational cost associated to (G)PT. Intuitively, a large K would provide a better mixing, however, it requires a larger number of forward model evaluations, which tends to be costly. We remark that such a trade-off between efficiency and number of function evaluations is also present in some advanced MCMC methods, such as Hamiltonian Monte Carlo, where one needs to choose a number of time steps for the time integration (see, e.g., Beskos et al. (2017)). Furthermore, there is an additional constraint when choosing \(S_K={\mathscr {S}}_K\), and it is the permutation cost associated to computing \(r^\mathrm {{(UW)}}({\varvec{\theta }},\sigma )\) and \(w_\sigma ({\varvec{\theta }})\). In particular, the computation of either of those quantities has a complexity of K! thus, this cost will eventually surpass the cost of evaluating the forward model K times. This is illustrated in Fig. 1, where we plot the cost per sample of two different posteriors vs K. These posteriors are taken from the numerical examples in Sects. 5.5 and 5.7. The posterior in Sect. 5.5 is rather inexpensive to evaluate, since one can compute the forward map \({\mathcal {F}}\) analytically (the difficulty associated to sampling from that posterior comes from its high multi-modality). On the contrary, evaluating the posterior in Sect. 5.7 requires numerically approximating the solution to a time-dependent, second-order PDE, and as such, evaluating such a posterior is costly. As we can see for \(K\le 6\), the computational cost in both cases is dominated by the forward model evaluation. Notice that for \(K\le 9\), the cost per sample from posterior (27) is dominated by the evaluation of the forward model.

Fig. 1
figure 1

Cost per sample vs K for \(S_K={\mathscr {S}}_K\) for the forward model in Sect. 5.5 and the forward model in 5.7

Thus, for high values of K, it is advisable to only consider the union of properly chosen semi-groups AB of \({\mathscr {S}}_K,\) with \( A\cap B\ne \emptyset ,\) such that AB generates \({\mathscr {S}}_K\) (i.e., if the smallest semi-groups that contains A and B is \({\mathscr {S}}_K\) itself), and \(|A\cup B|<|{\mathscr {S}}_K|=K!,\) which is referred to as partial Infinite Swapping in the continuous case (Dupuis et al. 2012). One particular way of choosing A and B is to consider, for example, A to be the set of permutations that only permute the indices associated with relatively low temperatures while leaving the other indices unchanged, and B as the set of permutations for the indices of relatively high temperatures, while leaving the other indices unchanged. Intuitively, swaps between temperatures that are, in a sense, “close” to each other tend to be chosen with a higher probability. We refer the reader to Dupuis et al. (2012) section 6.2 for a further discussion on this approach in the continuous-time setting. One additional idea would be to consider swapping schemes that, for example, only permute states between \(\mu _i\) and \(\mu _{i+1}, \mu _{i+2}, \dots , \mu _{i+\ell }\) for some user-defined \(\ell \ge 1\) and any given \(i=1,2,\dots , K-1\). The intuition behind this choice also being that swaps between posteriors that are at close temperatures are more likely to occur than swaps between posteriors with a high temperature difference. We intend to explore this further in depth in future work.

We reiterate that the total number of temperatures K depends heavily on the problem and the computational budget available (Doll et al. 2012; Van Der Sluys et al. 2008; Yu et al. 2016) For the experiments considered in the work we chose \(K=4\) or \(K=5\), which provide an acceptable compromise between acceleration and cost.

Experimental setup

We now present an experimental setup common to all the numerical examples presented in the following subsections. In particular, all the experiments presented in this work utilize a base method given by either RWM (for Experiments 5.3 through 5.6) or pCN (used in Experiment 5.7) for the Markov transition kernels p. Furthermore, we take \(S_K={\mathscr {S}}_K\) for all experiments, where \(K=5\) for Experiment 5.5 and \(K=4\) for the other 4 experiments. In addition, we follow the rule of thumb of Earl and Deem (2005) for the choice of temperatures, setting, for each experiment, \(T_{k}=a^{k-1},\) \(k=1,\ldots ,K\), for some positive constant \(a>1\). The particular choice of a is problem-dependent and it is generally chosen so that \(\mu _K\) becomes sufficiently simple to explore. For each experiment we implement 5 MCMC algorithms to sample from a given posterior \(\mu =\mu _1\), namely, the base (untempered) method (either RWM or pCN), and such a method combined with the standard PT algorithm (PT) with \(N_s=1,\) the PSDPT algorithm of Łącki and Miasojedow (2016), and both versions of GPT. For our setting, the tempered algorithms have a cost (in terms of number of likelihood evaluations) that is K times larger than the base method. Thus, to obtain a fair comparison across all algorithms, we run the chain for the base method K times longer. Lastly, given some problem-dependent quantity of interest \({\mathcal {Q}}\), we assess the efficiency of our proposed algorithms to compute the posterior expectation of \({\mathcal {Q}}\) by comparing the mean square error (Experiments 5.35.5), for which the exact value of \(\mathbb {E}_{\mu }[{\mathcal {Q}}]\) is known, or the variance (Experiments 5.65.7) of the ergodic estimator \(\hat{{\mathcal {Q}}}\) obtained over \(N_\text {runs}\) independent runs of each algorithm.

Density concentrated over a quarter circle-shaped manifold

Let \(\mu \) be a probability measure that has density \(\pi \) with respect to the uniform Lebesgue measure on the unit square \(\mu _\mathrm {pr}={\mathcal {U}}([0,1]^2)\) given by

$$\begin{aligned} \pi (\theta )=\frac{1}{Z}\exp \left( -10000(\theta _1^2+\theta _2^2-0.8^2)^2 \right) {\mathbf {1}}_{[0,1]^2}, \end{aligned}$$

where \(\theta =(\theta _1,\theta _2),\) Z is the normalization constant, and \({\mathbf {1}}_{[0,1]^2}\) is the indicator function over the unit square. We remark that this example is not of particular interest per se; however, it can be used to illustrate some of the advantages of the algorithms discussed herein. The difficulty of sampling from such a distribution comes from the fact that its density is concentrated over a quarter circle-shaped manifold, as can be seen on the left-most plot in Fig. 2. This in turn will imply that a single level RWM chain would need to take very small steps in order to properly explore such density.

Fig. 2
figure 2

Tempered densities (with \(T_1=1,\ T_2=17.1,\ T_3=292.4,\ T_4=5000\)) for the density concentrated around a quarter circle-shaped manifold example. As we can see, the density becomes less concentrated as the temperature increases, which allows us to use RWM proposals with larger step sizes

We aim at estimating \(\hat{{\mathcal {Q}}}_i=\mathbb {E}_{\mu }[\theta _i]\approx {\hat{\theta _i}}\), for \(i=1,2\). For the tempered algorithms (PT, PSDPT, UGPT, and WGPT), we consider \(K=4\) temperatures and choose \(T_4=5000\), so that the tempered density \(\pi _4\) becomes sufficiently simple to explore the target distribution. This gives \(T_1=1, T_2=17.1, T_3=292.4, T_4=5000\).

We compare the quality of our algorithms by examining the variance of the estimators \({\hat{\theta _i}},\) \(i=1,2\) computed over \(N_\text {runs}=100\) independent MCMC runs of each algorithm. For the tempered algorithms, each estimator is obtained by running the inversion experiment for \(N=25,000\) samples per run, discarding the first 20% of the samples (5000) as a burn-in. Accordingly, we run the single-chain random walk Metropolis algorithm for \(N_\mathrm {RWM}=K N=100, 000\) iterations, per run, and discard the first 20% of the samples obtained with the RWM algorithm (20,000) as a burn-in.

The untempered RWM algorithm uses Gaussian proposals with covariance matrix \(\Sigma _\mathrm {RWM}=\rho _1^2 I_{2\times 2},\) where \(I_{2\times 2}\) is the identity matrix in \(\mathbb {R}^{2\times 2}\), and \(\rho ^2_1=0.022\) is chosen in order to obtain an acceptance rate of around 0.23. For the tempered algorithms (i.e., PT, PSDPT, and both versions of GPT), we use \(K=4\) RWM kernels \(p_k\), \(k=1,2,3,4\), with proposal density \(q_{\mathrm {prop},k}(\theta ^{(n)}_k,\cdot )={\mathcal {N}}(\theta _k^{(n)},\rho _k^2I_{2\times 2})\), where \(\rho _k\) is shown in Table 1. This choice of \(\rho _k\) gives an acceptance rate for each chain of around 0.23. Notice that \(\rho _1\) corresponds to the “step-size” of the single-temperature RWM algorithm.

Table 1 Step size of the RWM proposal distribution for the manifold experiment

Experimental results for the ergodic run are shown in Table 2. We can see how both GPT algorithms provide a gain over RWM, PT and PSDPT algorithms, with the WGPT algorithm providing the largest gain. Scatter plots of the samples obtained with each method are presented in Fig. 3. Here, the subplot titled “WGPT” (bottom row, middle) corresponds to weighted samples from \({\varvec{\mu }}_{\text {W}}\), with weight \({\hat{w}}\) as in (13), while the one titled “WGPT (inv)” (bottom row, right) corresponds to samples from \({\varvec{\mu }}_{\mathrm {W}}\) without any post-processing. Notice how the samples from the latter concentrates over a thicker manifold, which in turn makes the target density easier to explore when using state-dependent Markov transition kernels.

Table 2 Results for the density concentrated around a circle-shaped manifold experiment. As we can see, both GPT algorithms provide an improvement over PT, PSDPT and RWM. The computational cost is comparable across all algorithms
Fig. 3
figure 3

Scatter-plots of the samples from \(\mu \) obtained with each algorithm on a single run. Top, from left to right: random walk Metropolis, PT and PSDPT. Bottom, from left to right: UGPT, WGPT (after re-weighting the samples), and WGPT, before re-weighting the samples

Multiple source elliptic BIP

We now consider a slightly more challenging problem, for which we try to recover the probability distribution of the location of a source term in a Poisson equation (Eq. 20), based on some noisy measured data. Let \((\varTheta ,{\mathcal {B}}(\varTheta ),\mu _\mathrm {pr})\) be the measure space, set \(\varTheta ={\bar{D}}{:}{=}[0,1]^2\), with Lebesgue (uniform) measure \(\mu _\mathrm {pr},\) and consider the following Poisson’s equation with homogeneous boundary conditions:

$$\begin{aligned} {\left\{ \begin{array}{ll} \varDelta u(x,\theta )=f( x,\theta ),&{} x\in D,\ \theta \in \varTheta ,\\ u(x,\theta )=0,&{} x\in \partial D. \end{array}\right. } \end{aligned}$$
(20)

Such equation can model, for example, the electrostatic potential \(u{:}{=}u( x,\theta )\) generated by a charge density \(f( x,\theta )\) depending on an uncertain location parameter \(\theta \in \varTheta \). Data y is recorded on an array of \(64\times 64\) equally-spaced points in D by solving (20) with a forcing term given by

$$\begin{aligned} f(x)= \sum _{i=1}^{4}e^{-1000[(x_1-s ^{(i)}_1)^2 +(x_2-s^{(i)}_2)^2]}, \end{aligned}$$
(21)

where the true source locations \(s^{(i)}, \ i=1,2,3,4,\) are given by \(s^{(1)}=(0.2,0.2),\ s^{(2)}=(0.2,0.8),\ s^{(3)}=(0.8,0.2),\) and \(s^{(4)}=(0.8,0.8)\). Such data is assumed to be polluted by an additive Gaussian noise with distribution \({\mathcal {N}}(0,\eta ^2 I_{64\times 64})\), with \(\eta =3.2\times 10^{-6}\), (which corresponds to a 1% noise) and where \(I_{64\times 64}\) is the 64-dimensional identity matrix. Thus, we set \((Y,\left\| \cdot \right\| _Y)=(\mathbb {R}^{64\times 64},\left\| \cdot \right\| )\), with \(\left\| A \right\| =(64\eta )^{-2}\left\| A \right\| ^2_F \), for some arbitrary matrix \(A\in \mathbb {R}^{64\times 64},\) where \(\left\| \cdot \right\| _F\) is the Frobenius norm. We assume a misspecified model where we only consider a single source in Eq. (21). That, is, we construct our forward operator \({\mathcal {F}}: \varTheta \mapsto Y\) by solving (20) with a source term given by

$$\begin{aligned} f(x,\theta )= e^{-1000[(x_1-\theta _1)^2 +(x_2-\theta _2)^2]}. \end{aligned}$$
(22)

In this particular setting, this leads to a posterior distribution with four modes since the prior density is uniform in the domain and the likelihood has a local maximum whenever \((\theta _1,\theta _2)=(s_1^{(i)},s_2^{(i)}),\ i=1,2,3,4\). The Bayesian inverse problem at hand can be understood as sampling from the posterior measure \(\mu \), which has a density with respect to the prior \(\mu _\mathrm {pr}={\mathcal {U}}({\bar{D}})\) given by

$$\begin{aligned} \pi (\theta )=\frac{1}{Z}\exp \left( -\frac{1}{2}\left\| y-{\mathcal {F}}(\theta )\right\| ^2_\Sigma \right) , \end{aligned}$$

for some (intractable) normalization constant Z as in (4). We remark that the solution to (20) with a forcing term of the form of (22) is approximated using a second-order accurate finite difference approximation with grid-size \(h=1/64\) on each spatial component.

The difficulty in sampling from the current BIP arises from the fact that the resulting posterior \(\mu \) is multi-modal and the number of modes is not known apriori (see Fig. 4).

Fig. 4
figure 4

True tempered densities for the elliptic BIP example. Notice that the density is not symmetric, due to the additional random noise

We follow a similar experimental setup to the previous example, and aim at estimating \(\hat{{\mathcal {Q}}}_i=\mathbb {E}_{\mu }[\theta _i]\approx {\hat{\theta _i}}\), for \(i=1,2\). We use \(K=4\) temperatures and \(N_\text {runs}=100\). For the PT, PSDPT and GPT algorithms, four different temperatures are used, with \(T_1=1,\ \ T_2=7.36,\ \ T_3=54.28,\) and \(T_4=400\). For each run, we obtain \(N=25,000\) samples with the PT, PSDPT, and both GPT algorithms, and \(N=100,000\) samples with RWM, discarding the first 20% of the samples in both cases (5000, 20000, respectively) as a burn-in. On each of the tempered chains, we use RWM proposals, with step-sizes shown in Table 3. This choice of step size provides an acceptance rate of about 0.24 across all tempered chains and all tempered algorithms. For the single-temperature RWM run, we choose a larger step size (\(\rho _\mathrm {RWM}=0.16\)) so that the RWM algorithm is able to explore the whole distribution. Such a choice, however, provides a smaller acceptance rate of about 0.01 for the single-chain RWM.

Experimental results are shown in Table 4. Once again, we can see how both GPT algorithms provide a gain over RWM and both variations of the PT algorithm, with the WGPT algorithm providing a larger gain. Scatter-plots of the obtained samples are shown in Fig. 4.

Table 3 Step size of the RWM proposal distribution for the elliptic BIP experiment
Table 4 Results for the elliptic BIP problem. The computational cost is comparable across all algorithms, given that the cost of each iteration is dominated by the cost of solving the underlying PDE
Fig. 5
figure 5

Scatterplots of the samples from \(\mu \) obtained with different algorithms on a single run. Top, from left to right: random walk Metropolis, PT and PSDPT. Bottom, from left to right: UGPT, WGPT (after re-weighting the samples), and WGPT, before re-weighting the samples. As we can see, WGPT (before re-weighting) is able to “connect” the parameter space

1D wave source inversion

We consider a small variation of example 5.1 in Motamed and Appelo (2019). Let \((\varTheta ,{\mathcal {B}}(\varTheta ),\mu _\mathrm {pr})\) be a measure space, with \(\varTheta =[-5,5]\) and uniform (Lebesgue) measure \(\mu _\mathrm {pr}\), and let \(I=(0,T]\) be a time interval. Consider the following Cauchy problem for the 1D wave equation:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_{tt}(x,t,\theta )-u_{xx}(x,t,\theta )=0,&{} (x,t,\theta ) \in \mathbb {R}\times I\times \varTheta ,\\ u(x,0,\theta )=h(x,\theta ), &{}(x,t,\theta ) \in \mathbb {R}\times \{0\}\times \varTheta ,\\ u_t(x,0,\theta )=0,&{} (x,t,\theta ) \in \mathbb {R}\times \{0\}\times \varTheta . \end{array}\right. } \end{aligned}$$
(23)

Here, \(h(x,\theta )\) acts as a source term generating a initial wave pulse. Notice that Eq. (23) can be easily solved using d’Alembert’s formula, namely

$$\begin{aligned} u(x,t,\theta )=\frac{1}{2}\left( h(x-t,\theta )+h(x+t,\theta )\right) . \end{aligned}$$

Synthetic data y is generated by solving Eq. (23) with initial data

$$\begin{aligned} h(x,\theta _1,\theta _2)&=\frac{1}{2}\left( e^{-100(x-\theta _1-0.5)^2} +e^{-100(x-\theta _1)^2}\right. \\&\left. +e^{-100(x-\theta _1+0.5)^2}+e^{-100(x-\theta _2-0.5)^2}\right. \\&\left. +e^{-100(x-\theta _2)^2}+e^{-100(x-\theta _2+0.5)^2}\right) , \end{aligned}$$

with \(\theta _1=-3,\theta _2=3\) and observed at \(N_R=11\) equally-spaced receiver locations between \(R_1=-5\) and \(R_2=5\) on \(N_T=1000\) time instants between \(t=0\) and \(T=5\). The signal recorded by each receiver is assumed to be polluted by additive Gaussian noise \({\mathcal {N}}(0,\eta ^2 I_{1000\times 1000})\), with \(\eta =0.01\), which corresponds to roughly 1% noise. We set \((Y,\left\| \right\| _{Y})=(\mathbb {R}^{11\times 1000}, \left\| \cdot \right\| _{\Sigma }), \) with

$$\begin{aligned} \left\| A \right\| _{\Sigma }^2=(\sqrt{N_R}\eta )^{-2}\sum _{i=1}^{N_R}\sum _{j=1}^{N_T}A_{i,j}^2, \end{aligned}$$

\(A\in \mathbb {R}^{11\times 1000}\). Once again, we assume a misspecified model where we construct our forward operator \({\mathcal {F}}: \varTheta \mapsto Y\) by solving (23) with a source term given by

$$\begin{aligned} h(x,\theta )&=\left( e^{-100(x-\theta -0.5)^2}\right. \\&\quad \ \left. +\,e^{-100(x-\theta )^2}+e^{-100(x-\theta +0.5)^2}\right) . \end{aligned}$$

The Bayesian inverse problem at hand can be understood as sampling from the posterior measure \(\mu \), which has a density with respect to the prior \(\mu _\mathrm {pr}={\mathcal {U}}([-5,5])\) given by

$$\begin{aligned} \pi (\theta )&=\frac{1}{Z}\exp \left( -\frac{1}{2}\left\| y-{\mathcal {F}}(\theta )\right\| ^2_\Sigma \right) \nonumber \\&=\frac{1}{Z}\exp \left( -\varPhi (\theta ;y)\right) , \end{aligned}$$
(24)

for some (intractable) normalization constant Z as in (4). The difficulty in solving this BIP comes from the high multi-modality of the potential \(\varPhi (\theta ;y)\), as it can be seen in Fig. 6. This, shape of \(\varPhi (\theta ;y)\) makes the posterior difficult to explore using local proposals.

Fig. 6
figure 6

Multi-modal potential for the Cauchy problem. Notice the minima around \(\theta =-3\) and \(\theta =3\)

In this case, we consider \(K=5\), and set \(T_1=1,\ \ T_2=5,\ \ T_3=25,\) \(T_4=125\) and \(T_5=625\). Notice that from Fig. 1, the computational cost per sample is dominated by the evaluation of (24) for values of \(K\le 6\). Once again, we obtain \(N=25,000\) samples with the PT, PSDPT, and both GPT algorithms, and \(N=125,000\) samples with RWM, discarding the first 20% of the samples in both cases (5000, 25000, respectively) as a burn-in. On each of the tempered chains, we use RWM proposals, with step-sizes shown in Table 5. This choice of step size provides an acceptance rate of about 0.4 across all tempered chains and all tempered algorithms. The choice of step-size for the RWM algorithm is done in such a way that it can “jump” modes, which are at distance of roughly 1/2.

We consider \({\mathcal {Q}}=\theta \) as a quantity of interest. Experimental results are shown in Table 6. Once again, we can see how both GPT algorithms provide a gain over RWM and both variations of the PT algorithm, with the WGPT algorithm providing the largest gain. Notice that, given the high muti-modality of the posterior at hand, the simple RWM algorithm is not well-suited for this type of distribution, as it can be seen from its large variance; this suggests that the RWM usually gets “stuck” at one mode of the posterior. Notice that, intuitively, due to the symmetric nature of the potential, one would expect the true mean of \(\theta \) to be close to 0. This value was computed by means of numerical integration and is given by \(\mathbb {E}_\mu [\theta ]=0.08211\).

Table 5 Step size of the RWM proposal distribution for the Cauchy BIP experiment
Table 6 Results for the 1D Cauchy BIP problem. The computational cost is comparable across all algorithms

Acoustic wave source inversion

We consider a more challenging problem, for which we try to recover the probability distribution of the spatial location of a (point-like) source term, together with the material properties of the medium, on an acoustic wave equation (Eq. 25), based on some noisy measured data. We begin by describing the mathematical model of such wave phenomena. Let \((\varTheta ,{\mathcal {B}}(\varTheta ),\mu _\mathrm {pr})\) be the measure space , with Lebesgue (uniform) measure \(\mu _\mathrm {pr}\), set \({\bar{D}}{:}{=}[0,3]\times [0,2],\) \(\partial D={\bar{\varGamma }}_{N}\cup {\bar{\varGamma }}_{\mathrm {Abs}}, \mathring{\varGamma _{\mathrm {N}}}\cap \mathring{\varGamma _{\mathrm {Abs}}}=0,\) \(|{\varGamma _{\mathrm {N}}}|, |{\varGamma _{\mathrm {Abs}}}|>0,\) and define \(\varTheta =D\times \varTheta _\alpha \times \varTheta _\beta ,\) where \(\varTheta _\alpha =[6,14]\), \(\varTheta _\beta =[4500,5500]\). Here, we are considering a rectangular spatial domain D, with the top boundary denoted by \({\varGamma _{\mathrm {N}}}\) and the side and bottom boundaries denoted by \({\varGamma _{\mathrm {Abs}}}\). Lastly, let \(\theta {:}{=}(s_1,s_2,\alpha ,\beta )\in \varTheta \). Consider the following acoustic wave equation with absorbing boundary conditions:

$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha ^2 u_{tt}-\nabla \cdot (\beta ^2\nabla u)=f,&{} \text {in } D\times (0,T)\times \varTheta ,\\ u=u_t=0,&{} \text {in } D\times \{0\}\times \varTheta ,\\ \beta ^2 \nabla u\cdot {{\hat{n}}}=0,&{} \text {on }{\varGamma _{\mathrm {N}}}\times (0,T)\times \varTheta ,\\ \beta ^2 \nabla u\cdot {{\hat{n}}}=-\alpha \beta u_t,&{} \text {on } {\varGamma _{\mathrm {Abs}}}\times (0,T)\times \varTheta , \end{array}\right. } \end{aligned}$$
(25)

where \(u=u(x,t,\theta )\), and \(f=f(x,t,\theta )\). Here the boundary condition on the top boundary \({\varGamma _{\mathrm {N}}}\) corresponds to a Neumann boundary condition, while the boundary condition on \({\varGamma _{\mathrm {Abs}}}\) corresponds to the so-called absorbing boundary condition, a type of artificial boundary condition used to minimize reflection of wave hitting the boundary. Data \(y\in Y\) is obtained by solving Eq. (25) with a force term given by

$$\begin{aligned} f(x,t,\theta )&=10^{11} e^{-\frac{1}{2\cdot 0.1^2}\left[ \left( x_1-s_1\right) ^2+\left( x_2-s_2\right) ^2\right] } \nonumber \\&\ \quad \times (1-2\cdot 1000\pi ^2t^2)e^{-2\cdot 1000^2\pi ^2t^2}, \end{aligned}$$
(26)

with a true set of parameters \(\varTheta \ni \theta ^*{:}{=}(s_1,s_2,\alpha ,\beta )\) given by \(s_1=1.5,s_2=1.0,\alpha =10,\beta =5000\), and observed on \(N_R=3\) different receiver locations \(R_1=(1.0,2.0),R_2=(1.5,2.0),R_3=(2.0,2.0)\) at \(N_T=117\) equally-spaced time instants between \(t=0\) and \(t=0.004\). In physical terms, the parameters \(s_1,s_2\) represent the source location, while the parameters \(\alpha ,\beta \) are related to the material properties of the medium. Notice that, on a slight abuse of notation, we have used the symbol \(\pi \) to represent the number \(3.14159\dots \) in Eq. (26) and it should not be confused with the symbol for density. The data measured by each receiver is polluted by additive Gaussian noise \({\mathcal {N}}(0,\eta ^2 I_{117\times 117})\), with \(\eta =0.013\), which corresponds to roughly a 2% noise. Thus, we have that \((Y,\left\| \cdot \right\| _Y)=(\mathbb {R}^{3\times 117},\left\| \cdot \right\| _\Sigma ), \) where \(\left\| A\right\| _\Sigma ^2{:}{=}(\sqrt{N_R}\eta )^{-2}\sum _{i=1}^{N_R}\sum _{j=0}^{N_T}A_{i,j}^2\). Thus, the forward mapping operator \({\mathcal {F}}:\varTheta \mapsto Y\) can be understood as the numerical solution of Eq. (25) evaluated at 117 discrete time instants at each of the 3 receiver locations. Such a numerical approximation is obtained by the finite element method using piece-wise linear elements and the time stepping is done using a Forward Euler scheme with sufficiently small time-steps to respect the so-called Courant-Friedrichs-Lewy condition (Quarteroni and Quarteroni 2009). This numerical solution is implemented using the Python library FEniCS (Logg et al. 2012), using 40\(\times 40\) triangular elements. The Bayesian inverse problem at hand can thus be understood as sampling from the posterior measure \(\mu \), which has a density with respect to the prior \(\mu _\mathrm {pr}={\mathcal {U}}(\varTheta )\) given by

$$\begin{aligned} \pi (\theta )=\frac{1}{Z}\exp \left( -\frac{1}{2}\left\| y-{\mathcal {F}}(\theta )\right\| ^2_\Sigma \right) . \end{aligned}$$

The previous BIP presents two difficulties; on the one hand, Eq. (25) is, typically, expensive to solve, which in turn translates into expensive evaluations of the posterior density. On the other, the log-likelihood has an extremely complicated structure, which in turn makes its exploration difficult. This can be seen in Fig. 7, where we plot of the log-likelihood for different source locations \((s_1,s_2)\) and for fixed values of the material properties \(\alpha =10,\beta =5000. \) More precisely, we plot \({\tilde{\varPhi }}((s_1,s_2);y){:}{=}-\frac{1}{2}\left\| y-{\mathcal {F}}(s_1,s_2,10,5000)\right\| ^2_\Sigma \) on a grid of \(100\times 100\) equally spaced points \((s_1,s_2)\) in D. It can be seen that, even though the log-likelihood has a clear peak around the true value of \((s_1,s_2)\), there are also regions of relatively high concentration of log-probability, surrounded by regions with a significantly smaller log-probability, making it a suitable problem for our setting.

Fig. 7
figure 7

Plot of the log-likelihood for different values of \(s_1\), \(s_2\) and fixed values of \(\alpha =10\) and \(\beta =5000\). The magenta points represent the reciever locations \(R_1,R_2,R_3\). The black point represents the true location of the source \((s_1,s_2)=(1.5,1.0)\)

Following the same set-up of previous experiments, we aim at estimating \(\hat{{\mathcal {Q}}}_i=\mathbb {E}_{\mu }[\theta _i]\approx {\hat{\theta _i}}\), for \(i=1,2\). Once again, we consider \(K=4\) temperatures for the tempered algorithms (PT, PSDPT, UGPT, and WGPT), and set temperatures to \(T_1=1, T_2=7.36, T_3=54.28,\) \(T_4=400\). We compare the quality of our algorithms by examining the variance of the estimators \({\hat{\theta _i}},\) \(i=1,2\) computed over \(N_\text {runs}=50\) independent MCMC runs of each algorithm. For each run, we run the tempered algorithms obtaining \(N=7,000\) samples, discarding the first 20% of the samples (1400) as a burn-in. For the RWM algorithm, we run the inversion experiment for \(N_\mathrm {RWM}=K N=28, 000\) iterations, and discard the first 20% of the samples obtained (5600) as a burn-in.

Each individual chain is constructed using Gaussian RWM proposals \(q_{\mathrm {prop},k}(\theta ^{n}_k,\cdot )={\mathcal {N}}(\theta ^{n}_k,{\mathcal {C}}_k)\), \(k=1,2,3,4\), with covariance \({\mathcal {C}}_k\) described in Table 7. The covariance is tuned in such a way that the acceptance rate of each chain is around 0.2. The variance of the estimators obtained with each method is presented in Table 8. Once again, our GPT algorithms outperform all other tested methods for this particular setting. In particular, our methods provide huge computational gains when compared to RWM and the PSDPT algorithm of Łącki and Miasojedow (2016), as well as some moderate computational gains when compared to the standard PT.

Table 7 Step size of the RWM proposal distribution for the acoustic BIP experiment. Here \(\mathrm {Diag}(d_1,d_2,\dots ,d_N)\) is to be understood as the \(N\times N\) diagonal matrix with entries \(d_1,d_2,\dots ,d_N\)
Table 8 Results for the acoustic BIP problem. Once again, we can see that both GPT algorithm provide an improvement over RWM, PT and PSDPT. The computational cost is comparable across all algorithms, given that the cost of each iteration is dominated by the cost of solving the underlying PDE

High-dimensional acoustic wave inversion

Lastly, we present a high-dimensional example for which we try to invert for the material properties \(\beta ^2\) in (25). For simplicity, we will consider fixed values of \(\alpha =1,\) \(s_1=1.5,\) and \(s_2=1\). In this case, we set \(\beta ^2=10+\hat{\beta }^2(x)\), where \(\hat{\beta }(x)\) is taken to be a realization of a random field discretized on a mesh of \(N_x\times N_y\) triangular elements. This modeling choice ensures that \(\beta ^2\) is lower bounded. In this case, we will invert for the nodal values of (the finite element discretization of) \(\hat{\beta }\), which will naturally result in a high-dimensional problem. We remark that one is usually interested in including the randomness in \(\beta ^2\), instead of \(\hat{\beta }\); however, we purposely choose to do so to induce an explicitly multi-modal posterior, and as such, to better illustrate the advantages of our proposed methods when sampling from these types of distributions.

We begin by formalizing the finite-element discretization of the parameter space (see e.g., Bui-Thanh and Nguyen (2016) for a more detailed discussion).

Let \({\bar{D}}=[0,3]\times [0,2],\) denote the physical space of the problem and let \(V_h\) be a finite-dimensional subspace of \(L_2(D)\) arising from a given finite element discretization. We write the finite element approximation \(\hat{\beta }_h\in V_h\) of \(\hat{\beta }\) as

$$\begin{aligned} \hat{\beta }(x)\approx \hat{\beta }_h(x)=\sum _{n=1}^{N_\text {v}}b_n\phi _n(x), \end{aligned}$$

where \(\{\phi \}_{n=1}^{N_\text {v}}\) are the Lagrange basis functions corresponding to the nodal points \(\{x_n\}_{n=1}^{N_\text {v}}\), \((b_1,\dots ,b_{N_\text {v}})^T=:\theta \in \mathbb {R}^{N_\text {v}}\) is the set of nodal parameters and \(N_\text {v}\) corresponds to the number of vertices used in the FE discretization. Thus, the problem of inferring the distribution of \(\beta \) given some data y, can be understood as inferring the distribution of \(\theta \) given y. For our particular case, we will discretize D using \(28\times 28\) (non-overlapping) piece-wise linear finite elements, which results in \(N_\text {v}=841\) and as such \(\varTheta =\mathbb {R}^{841}\). We consider a Gaussian prior \(\mu _\mathrm {pr,\infty }={\mathcal {N}}(0,{\mathcal {A}}^{-2})\), where \({\mathcal {A}}\) is a differential operator acting on \(L_2(D)\) of the form

$$\begin{aligned} {\mathcal {A}}{:}{=}-a\nabla \cdot (H \nabla )+d I, \quad a,d>0, \end{aligned}$$

together with Robin boundary conditions \(\nabla (\cdot ) \cdot {{\hat{n}}} +\sqrt{ad}(\cdot )=0\), where, following Villa et al. (2021), H is taken of the form

$$\begin{aligned} H{:}{=}\begin{pmatrix} e_1\sin ^2(\ell )+e_2\cos ^2(\ell )&{}&{}&{} (e_1-e_2)\sin (\ell )\cos (\ell )\\ (e_1-e_2)\sin (\ell )\cos (\ell )&{}&{}&{}e_1\cos ^2(\ell )+e_2\sin ^2(\ell ) \end{pmatrix}. \end{aligned}$$

Here H models the spatial anisotropy of a Gaussian Random field sampled from \(\mu _\mathrm {pr,\infty }\). It is known that for a two-dimensional (spatial) space, the covariance operator \({\mathcal {A}}^{-2}\) is symmetric and trace-class (Bui-Thanh and Nguyen 2016), and as such, the (infinite-dimensional) prior measure is well-defined. Thus, we set

$$\begin{aligned} \hat{\beta }(x)\sim {\mu }_\mathrm {pr,\infty }, \end{aligned}$$

which in turn induces the discretized prior:

$$\begin{aligned} {\hat{\beta }_h}(x)\sim \mu _\mathrm {pr}{:}{=}{\mathcal {N}}(0,{\mathcal {A}}_h^{-2}), \end{aligned}$$

where \({\mathcal {A}}_h^{-2}\) is a finite-element approximation of \({\mathcal {A}}\) using \(28\times 28\) (non-overlapping) piece-wise linear finite elements. Samples from \(\mu _\mathrm {pr}\) are obtained using the FEniCS package (Logg et al. 2012) and the hIPPYlib library Villa et al. (2021).

We follow an approach similar to our previous example. We collect data \(y\in Y\) by solving Eq. (25) with a force term given by (26) and a true field \(\hat{\beta }^*_h\sim \mu _\mathrm {pr}\) with \(a=0.1\), \(d=0.5\), \(\ell =\pi /4\), \(e_1=2\) and \(e_2=0.5\). Such a realization of \(\hat{\beta }^*_h\) is shown in Fig. 8.

Fig. 8
figure 8

True field \(\hat{\beta }^*_h(x)\). Notice the anisotropy on the field. The magenta points represent the receiver locations. The black line represents the zero-level set of the field

Furthermore, data is observed at \(N_R=5\) different receiver locations \(R_1=(1.0,2.0),R_2=(1.25,2.0),R_3=(1.5,2.0),R_4=(1.75,2.0)\), and \(R_5=(2.0,2.0)\) at \(N_T=600\) equally-spaced time instants between \(t=0\) and \(t=0.6\). The data measured by each receiver is polluted by an (independent) additive Gaussian noise \(\eta \sim {\mathcal {N}}(0,\sigma _\mathrm {noise}^2 I_{600\times 600})\), with \(\sigma =0.021\), which corresponds to roughly a 0.5% noise. Thus, we have that \((Y,\left\| \cdot \right\| _Y)=(\mathbb {R}^{5\times 600},\left\| \cdot \right\| _\Sigma )\). Similarly as in Sect. 5.6, the forward mapping operator \({\mathcal {F}}:\varTheta \mapsto Y\) can be understood as the numerical solution of Eq. (25) evaluated at 600 discrete time instants at each of the 5 receiver locations. Numerical implementation follows a similar set-up as in Sect. 5.6, however, for simplicity, we use \(28\times 28\) triangular elements to approximate the forward operator \({\mathcal {F}}\). The Bayesian inverse problem at hand can thus be understood as sampling from the posterior measure \(\mu \), which has a Radon-Nikodym derivative with respect to the prior \(\mu _\mathrm {pr}\) given by

$$\begin{aligned} \pi (\theta )=\frac{\mathrm {d} \mu }{\mathrm {d} \mu _\mathrm {pr}}(\theta )=\frac{1}{Z}\exp \left( -\frac{1}{2}\left\| y-{\mathcal {F}}(\theta )\right\| ^2_\Sigma \right) . \end{aligned}$$
(27)

The previous BIP has several difficulties; clearly, it is a high-dimensional posterior. Furthermore, just as in the previous example, the underlying mathematical model for the forward operator is a costly time-dependent PDE. By choosing to invert for \(\hat{\beta }_h\sim \mu _\mathrm {pr}\) (instead of \(\beta _h^2\approx \beta ^2\)), and since \(\mu _\mathrm {pr}\) is centered at zero, we induce a multi-modal posterior, indeed, if the posterior concentrates around \(\hat{\beta }^*_h\) it will also have peaks at any other \(\hat{\beta }^j\) obtained by flipping the sign of \(\hat{\beta }^*_h\) in a concentrated region separated by the zero level set of \(\hat{\beta }^*_h\) (we identify 7 regions in Fig. 8). This can be seen in Fig. 9, where we plot 4 posterior samples \(\hat{\beta }_h\sim \mu \). Notice the change in sign between some regions. Lastly, as a quantities of interest, we will consider \({\mathcal {Q}}_1=\int _D \exp (\hat{\beta }(x))\mathrm {d}x\) and \({\mathcal {Q}}_2=\exp (\hat{\beta }(1.5,1))\). We remark that, although these quantities of interest do not have any meaningful physical interpretation, they are, however, affected by the multi-modality of the posterior, and as such, well suited to exemplify the capabilities of our method.

Given the high-dimensionality of the posterior, we present a slightly different experimental setup in order to estimate \(\mathbb {E}_{\mu }[{\mathcal {Q}}_i]\approx \hat{{\mathcal {Q}}}_i\), \(i=1,2\). In particular, we will use the preconditioned Crank-Nicolson (pCN) as a base method, instead of RWM, for the transition kernel p. We compare the quality of our algorithms by examining the variance of the estimators \(\hat{{\mathcal {Q}}}_i\) computed over \(N_\text {runs}=50\) independent MCMC runs of each algorithm, with \(K=4\) temperatures for the tempered algorithms given by \(T_1=1, T_2=4.57, T_3=20.89, T_4=100\). For the tempered algorithms, each estimator is obtained by running the inversion experiment for \(N=4,800\) samples, discarding the first 20% of the samples (800) as a burn-in. For the untempered pCN algorithm, we run the inversion experiment for \(N_\mathrm {pCN}=K N=19, 200\) iterations, and discard the first 20% of the samples obtained (3840) as a burn-in.

Each individual chain is constructed using pCN proposals \(q_{\mathrm {prop},k}(\theta ^{n}_k,\cdot )={\mathcal {N}}(\sqrt{1-\rho _k^2}\theta ^{n}_k,\rho _k^2 {\mathcal {A}}_h^{-2})\), \(k=1,2,3,4\), with \(\rho _k\) described in Table 9. The simple, un-tempered pCN algorithm is run with a step size given by \(\rho =\rho _1\). The values of \(\rho _k\) are tuned in such a way that the acceptance rate of each chain is around 0.3 and are reported in Table 9. The variance of the estimators obtained with each method is presented in Table 10. Once again, even for this high-dimensional, highly multi-modal case, our proposed methods perform considerably better than the other algorithms.

Fig. 9
figure 9

Posterior samples \(\hat{\beta }_h\sim \mu \) obtained with the UW GPT algorithm. Notice the resemblance to Fig. 8

Table 9 Values of \(\rho _k\) for the pCN kernel for the high-dimensional wave inversion problem
Table 10 Results for the high-dimensional acoustic BIP problem. As for the previous examples, The computational cost is comparable across all algorithms

Conclusions and future work

In the current work, we have proposed, implemented, and analyzed two versions of the GPT, and applied these methods to a BIP context. We demonstrate that such algorithms produce reversible and geometrically-ergodic chains under relatively mild conditions. As shown in Sect. 5, such sampling algorithms provide an attractive alternative to the more standard Parallel Tempering when sampling from difficult (i.e., multi-modal or concentrated around a manifold) posteriors. We remark that the framework considered here-in can be combined with other, more advanced MCMC algorithms, such as, e.g., the Metropolis-adjusted Langevin algorithm (MALA), or the Delayed Rejection Adaptive Metropolis (DRAM), for example Haario et al. (2006).

We intend to carry out a number of future extensions of the work presented herein. One of our short-term goals is to extend the methodology developed in the current work to a Multi-level Markov Chain Monte Carlo context, as in Dodwell et al. (2015); Madrigal-Cianci et al. (2021). In addition, from a theoretical point of view, we would like to investigate the role that the number of chains and the choice of temperatures play on the convergence of the GPT algorithm, as it has been done previously for Parallel Tempering in Woodard et al. (2009). Improving on the estimates presented here would likely be the focus of future work. We also believe that by excluding the identity permutation (i.e., \(\mathrm {id}\notin S_K\)) on the UGPT, one could obtain a swapping kernel which is better in the so-called Peskun sense, see (Andrieu and Roberts 2009) for more details. We intend to carry further numerical experiments to better understand and compare swapping strategies. Furthermore, from a computational perspective, given that the framework presented in this work is, in principle, dimension independent, the methods explored in this work can also be combined with dimension-independent samplers such as the ones presented in Beskos et al. (2017); Cui et al. (2016), thus providing a sampling algorithm robust to both multi-modality and large dimensionality of the parameter space. Given the additional computational cost of these methods, a non-trivial coupling of GPT and these methods needs to be devised. Lastly, we aim at applying the methods developed in the current work to more computationally challenging BIP, in particular those arising in seismology and seismic source inversion, where it is not uncommon to find multi-modal posterior distributions when inverting for a point source.