Generalized Parallel Tempering on Bayesian Inverse Problems

In the current work we present two generalizations of the Parallel Tempering algorithm, inspired by the so-called continuous-time Infinite Swapping algorithm. Such a method, found its origins in the molecular dynamics community, and can be understood as the limit case of the continuous-time Parallel Tempering algorithm, where the (random) time between swaps of states between two parallel chains goes to zero. Thus, swapping states between chains occurs continuously. In the current work, we extend this idea to the context of time-discrete Markov chains and present two Markov chain Monte Carlo algorithms that follow the same paradigm as the continuous-time infinite swapping procedure. We analyze the convergence properties of such discrete-time algorithms in terms of their spectral gap, and implement them to sample from different target distributions. Numerical results show that the proposed methods significantly improve over more traditional sampling algorithms such as Random Walk Metropolis and (traditional) Parallel Tempering.


Introduction
Modern computational facilities and recent advances in computational techniques have made the use of Markov Chain Monte Carlo (MCMC) methods feasible for some large-scale Bayesian inverse problems (BIP), where the goal is to characterize the posterior distribution of a set of parameters θ of a computational model which describes some physical phenomena, conditioned on some (usually indirectly) measured data y. However, some computational difficulties are prone to arise when dealing with difficult to explore posteriors, i.e., posterior distributions that are multi-modal, or that concentrate around a non-linear, lower-dimensional manifold, since some of the more commonly-used Markov transition kernels in MCMC algorithms, such as random walk Metropolis (RWM) or preconditioned Crank-Nicholson (pCN), are not well-suited in such situations. This in turn can make the computational time needed to properly explore these complicated target distributions arbitrarily long. Some recent works address these issues by employing Markov transitions kernels that use geometric information [5]; however, this requires efficient computation of the gradient of the posterior density, which might not always be feasible, particularly when the underlying computational model is a so-called "black-box".
In recent years, there has been an active development of computational techniques and algorithms to overcome these issues using a tempering strategy [15,19,28,36,44]. Of particular importance for the work presented here is the Parallel Tempering (PT) algorithm [19,26,36] (also known as replica exchange), which finds its origins in the physics and molecular dynamics community. The general idea behind such methods is to simultaneously run K independent MCMC chains, where each chain is invariant with respect to a flattened (referred to as tempered ) version of the posterior of interest µ, while, at the same time, proposing to swap states between any two chains every so often. Such a swap is then accepted using the standard Metropolis-Hastings (MH) acceptance-rejection rule. Intuitively, chains with a larger smoothing parameter (referred to as temperature) will be able to better explore the parameter space. Thus, by proposing to exchange states between chains that target posteriors at different temperatures, it is possible for the chain of interest (i.e., the one targeting µ) to mix faster, and to avoid the undesirable behavior of some MCMC samplers of getting "stuck" in a mode. Moreover, the fact that such an exchange of states is accepted with the typical MH acceptancerejection rule, will guarantee that the chain targeting µ remains invariant with respect to such probability measure [19].
Tempering ideas have been successfully used to sample from posterior distributions arising in different fields of science, ranging from astrophysics to machine learning [14,19,36,42]. The works [31,45] have studied the convergence of the PT algorithm from a theoretical perspective and provided minimal conditions for its rapid mixing. Moreover, the idea of tempered distributions has not only been applied in combination with parallel chains. For example, the simulated tempering method [33] uses a single chain and varies the temperature within this chain. In addition, tempering forms the basis of efficient particle filtering methods for stationary model parameters in Sequential Monte Carlo settings [6,7,24,25,28] and Ensemble Kalman Inversion [11].
A generalization over the PT approach, originating from the molecular dynamics community, is the so-called Infinite Swapping (IS) algorithm [18,39]. As opposed to PT, this IS paradigm is a continuous-time Markov process and considers the limit where states between chains are swapped infinitely often. It is shown in [18] that such an approach can in turn be understood as a swap of dynamics, i.e., kernel and temperature (as opposed to states) between chains. We remark that once such a change in dynamics is considered, it is not possible to distinguish particles belonging to different chains. However, since the stationary distribution of each chain is known, importance sampling can be employed to compute posterior expectations with respect to the target measure of interest. Infinite Swapping has been successfully applied in the context of computational molecular dynamics and rare event simulation [17,30,39]; however, to the best of our knowledge, such methods have not been implemented in the context of Bayesian inverse problems.
In light of this, the current work aims at importing such ideas to the BIP setting, by presenting them in a discrete-time Metropolis-Hastings Markov chain Monte Carlo context. We will refer to these algorithms as Generalized Parallel Tempering (GPT). We emphasize, however, that these methods are not a time discretization of the continuous-time Infinite Swapping presented in [18], but, in fact, a discrete-time Markov process inspired by the ideas presented therein with suitably defined state-dependent probabilities of swapping states or dynamics. We now summarize the main contributions of this work.
We begin by presenting a generalized framework for discrete time PT in the context of MCMC for BIP, and then proceed to propose, analyze and implement two novel state-dependent PT algorithms inspired by the ideas presented in [18].
Furthermore, we prove that our GPT methods have the right invariant measure, by showing reversibility of the generated Markov chains, and prove their ergodicity. Finally, we implement the proposed GPT algorithms for an array of Bayesian inverse problems, comparing their efficiency to that of an un-tempered, (single temperature), version of the underlying MCMC algorithm, and standard PT. For the base method to sample at the cold temperature level, we use Random Walk Metropolis (RWM) (Sections 5.3-5.6) or preconditioned Crank-Nicolson (Section 5.7), however, we emphasize that our methods can be used together with any other, more efficient base sampler. Experimental results show improvements in terms of computational efficiency of GPT over un-tempered RWM and standard PT, thus making the proposed methods attractive from a computational perspective. From an implementation perspective, the swapping component of our proposed methods is rejection-free, thus effectively eliminating some tuning parameters on the PT algorithm, such as swapping frequency.
We notice that a PT algorithm with state-dependent swapping probabilities has been proposed in [26], however, such a work only consider pairwise swapping of chains and a different construction of the swapping probabilities, resulting in a less-efficient sampler, at least for the BIPs addressed in this work.
Our ergodicity result relies on an L 2 spectral gap analysis. It is known [41] that when a Markov chain is both reversible and has a positive L 2 -spectral gap, one can in turn provide non-asymptotic error bounds on the mean square error of an ergodic estimator of the chain. Our bounds on the L 2 -spectral gap, however, are far from being sharp and could possibly be improved using e.g., domain decomposition ideas as in [45]. Such analysis is left for a future work.
The rest of this paper is organized as follows. Section 2 is devoted to the introduction of the notation, Bayesian inverse problems, and Markov chain Monte Carlo methods. In Section 3, we provide a brief review of (standard) PT (Section 3.2), and introduce the two versions of the GPT algorithm in Sections 3.3 and 3.4, respectively. In fact, we present a general framework that accommodates both the standard PT algorithms and our generalized versions. In Section 4, we recall some of the standard theory of Markov chains in Section 4.1 and state the main theoretical results of the current work (Theorems 1 and 2) in Section 4.2. The proof of these Theorems is given by a series of Propositions in Section 4.2. We present some numerical experiments in Section 5, and draw some conclusions in Section 6.
2 Problem setting

Notation
Let (W, · ) be a separable Banach space with associated Borel σ-algebra B(W ), and let ν W be a σ-finite "reference" measure on W . For any measure µ on (W, B(W )) that is absolutely continuous with respect to ν W (in short µ ν W ), we define the Radon-Nikodym derivative π µ := dµ dν W . We denote by M(W ) the set of real-valued signed measures on (W, B(W )), and by M(W ) ⊂ M(W ) the set of probability measures on (W, B(W )).
Let W 1 , W 2 be two separable Banach spaces with reference measures ν W 1 , ν W 2 , and let µ 1 ν W 1 , µ 2 ν W 2 be two probability measures, with corresponding densities given by π µ 1 , π µ 2 . The product of these two measures is defined by , is a probability measure on (W, B(W )).
We denote by P the Markov transition operator associated to p, which acts on measures as ν → νP ∈ M(W ), and on functions as f → P f, with P f measurable on (W, B(W )), such that Let P k , k = 1, 2, be Markov transition operators associated to kernels p k : We define the tensor product Markov operator P := P 1 ⊗ P 2 as the Markov operator associated with the product measure p(θ, for all A 1 ∈ B(W 1 ) and A 2 ∈ B(W 2 ). Moreover, (Pf ) : W 1 × W 2 → R is the function given by for an appropriate f : We say that a Markov operator P (resp. P) is invariant with respect to a measure ν (resp. ν) if νP = ν (resp. νP = ν ). A related concept to invariance is that of reversibility; a Markov kernel p : For two given ν-invariant Markov operators P 1 , P 2 , we say that P 1 P 2 is a composition of Markov operators, not to be confused with P 1 ⊗ P 2 . Furthermore, given a composition of K ν-invariant Markov operators P c := P 1 P 2 . . . P K , we say that P c is palindromic if P 1 = P K , P 2 = P K−1 , . . . , P k = P K−k+1 , k = 1, 2 . . . , K. It is known (see, e.g., [8,Section 1.12.17]) that a palindromic, ν-invariant Markov operator P c has an associated Markov transition kernel p c which is ν-reversible.

Bayesian inverse problems
Let (Θ, · Θ ) and (Y, · Y ) be separable Banach spaces with associated Borel σ-algebras B(Θ), B(Y ). In Bayesian Inverse Problems we aim at characterizing the probability distribution of a set of parameters θ ∈ Θ conditioned on some measured data y ∈ Y , where the relation between θ and y is given by: Here ε is some random noise with known distribution µ noise (assumed to have a density π noise with respect to some reference measure ν Y on Y ) and F : Θ → Y is the so-called forward mapping operator. Such an operator can model, e.g., the numerical solution of a possibly non-linear Partial Differential Equation (PDE) which takes θ as a set of parameters. We will assume that the data y is finite dimensional, i.e., Y = R d y for some d y ≥ 1, and that θ ∼ µ pr . Furthermore, we define the potential function Φ(θ; y) : where the function Φ(θ; y) is a measure of the misfit between the recorded data y and the predicted value F(θ), and often depends only on y − F (θ) Y . The solution to the Bayesian inverse problem is given by (see, for example, [27, Theorem 2.5]) where µ (with corresponding ν Θ -density π) is referred to as the posterior measure and Z := Θ exp(−Φ(θ; y))µ pr (dθ).
The Bayesian approach to the inverse problem consists of updating our knowledge concerning the parameter θ, i.e., the prior, given the information that we observed in Equation (2). One way of doing so is to generate samples from the posterior measure µ y . A common method for performing such a task is to use, for example, the Metropolis-Hastings algorithm, as detailed in the next section. Once samples {θ (n) } N n=0 have been obtained, the posterior expectation E µ y [Q] of some µ y -integrable quantity of interest Q : Θ → R can be approximated by the following ergodic estimator where b < N is the so-called burn-in period, used to reduce the bias typically associated to MCMC algorithms.

Metropolis-Hastings and tempering
We briefly recall the Metropolis-Hastings algorithm [23,34]. Let q prop : Θ×B(Θ) → [0, 1] be an auxiliary kernel. For n = 1, 2, . . . , a candidate state θ * is sampled from q prop (θ n , ·), and proposed as the new state of the chain at step n + 1. Such a state is then accepted (i.e., we set θ n+1 = θ * ), with probability α MH , otherwise the current state is retained, i.e., θ n+1 = θ n . Notice that, with a slight abuse of notation, we are denoting kernel and density by the same symbol q prop . The Metropolis-Hastings algorithm induces the Markov transition kernel p : for every θ ∈ Θ and A ∈ B(Θ). In most practical algorithms, the proposal state θ * is sampled from a state-dependent auxiliary kernel q prop (θ n , ·). Such is the case for random walk Metropolis or preconditioned Crank Nicolson, where q prop (θ n , ·) = N (θ n , Σ) or q prop (θ n , ·) = N ( 1 − ρ 2 θ n , ρΣ), 0 < ρ < 1, respectively. However, these types of localized proposals tend to present some undesirable behaviors when sampling from certain difficult measures, which are, for example, concentrated over a manifold or are multi-modal [19]. In the first case, in order to avoid a large rejection rate, the "step-size" Σ 1/2 of the proposal kernel must be quite small, which will in turn produce highly-correlated samples. In the second case, chains generated by these localized kernels tend to get stuck in one of the modes. In either of these cases, very long chains are required to properly explore the parameter space. One way of overcoming such difficulties is to introduce tempering. Let µ k , µ pr be probability measures on (Θ, B(Θ)), k = 1, . . . , K, such that all µ k are absolutely continuous with respect to µ pr , and let {T k } K k=1 be a set of K temperatures such that 1 = T 1 < T 2 < · · · < T K ≤ ∞. In a Bayesian setting, µ pr corresponds to the prior measure and µ k , k = 1, . . . , K correspond to posterior measures associated to different temperatures. Denoting by π k the µ pr -density of µ k , we set where Z k := Θ e −Φ(θ;y)/T k µ pr (dθ), and with Φ(θ; y) as the potential function defined in (3). In the case where T K = ∞, we set µ K = µ pr . Notice that µ 1 corresponds to the target posterior measure. We say that for k = 2, . . . , K, each measure µ k is a tempered version of µ 1 . In general, the 1/T k term in (5) serves as a "smoothing" 1 factor, which in turn makes µ k easier to explore as T k → ∞. In PT MCMC algorithms, we sample from all posterior measures µ k simultaneously. Here, we first use a µ kreversible Markov transition kernel p k on each chain, and then, we propose to exchange states between chains at two consecutive temperatures, i.e., chains targeting µ k , µ k+1 , k ∈ {1, . . . , K − 1}. Such a proposed swap is then accepted or rejected with a standard Metropolis-Hastings acceptance rejection step. This procedure is presented in Algorithm 1. We remark that such an algorithm can be modified to, for example, propose to swap states every N s steps of the chain, or to swaps states between two chains µ i , µ j , with i, j chosen randomly and uniformly from the index set {1, 2, . . . , K}. In the next section we present the generalized PT algorithms which swap states according to a random permutation of the indices drawn from a state dependent probability. with probability end function 1 Here, smoothing is to be understood in the sense that it flattens the density.

Generalizing Parallel Tempering
Infinite Swapping was initially developed in the context of continuous-time MCMC algorithms, which were used for molecular dynamics simulations. In continuoustime PT, the swapping of the states is controlled by a Poisson process on the set {1, . . . , K}. Infinite Swapping is the limiting algorithm obtained by letting the waiting times of this Poisson process go to zero. Hence, we swap the states of the chain infinitely often over a finite time interval. We refer to [18] for a thorough introduction and review of Infinite Swapping in continuous-time. In Section 5 of the same article, the idea to use Infinite Swapping in time-discrete Markov chains was briefly discussed. Inspired by this discussion, we present two Generalizations of the (discrete-time) Parallel Tempering strategies. To that end, we propose to either (i) swap states in the chains at every iteration of the algorithm in such a way that the swap is accepted with probability one, which we will refer to as the Unweighted Generalized Parallel Tempering (UGPT), or (ii), swap dynamics (i.e., swap kernels and temperatures between chains) at every step of the algorithm. In this case, importance sampling must also be used when computing posterior expectations since this in turn provides a Markov chain whose invariant measure is not the desired one. We refer to this approach as Weighted Generalized Parallel Tempering (WGPT). We begin by introducing a common framework to both PT and the two versions of GPT.
Let (Θ, · Θ ) be a separable Banach space with associated Borel σ-algebra B(Θ). Let us define the K-fold product space Θ K := × K k=1 Θ,with associated product σ-algebra B K := K k=1 B(Θ), as well as the product measures on (Θ K , B K ) where µ k k = 1, . . . , K are the tempered measures with temperatures 1 = T 1 < T 2 < T 3 < · · · < T K ≤ ∞ introduced in the previous section. Similarly, we define the product prior measure µ pr := × K k=1 µ pr . Notice that µ has a density π(θ) with respect to µ prior given by π(θ) = K k=1 π k (θ k ), θ := (θ 1 , . . . , θ K ) ∈ Θ K , with π i (θ) added subscript given as in (5). The idea behind the tempering methods presented herein is to sample from µ (as opposed to solely sampling from µ 1 ) by creating a Markov chain obtained from the successive application of two µ-invariant Markov kernels p and q, to some initial distribution ν, usually chosen to be the prior µ pr . Each kernel acts as follows. Given the current state θ n = (θ n 1 , . . . , θ n K ), the kernel p, which we will call the standard MCMC kernel, proposes a new, intermediate stateθ n+1 = (θ n+1 1 , . . . ,θ n+1 K ), possibly following the Metropolis-Hastings algorithm (or any other algorithm that generates a µinvariant Markov operator). The Markov kernel p is a product kernel, meaning that each componentθ n k , k = 1 . . . , K, is generated independently of the others. Then, the swapping kernel q proposes a new state θ n+1 = (θ n+1 1 , . . . , θ n+1 K ) by introducing an "interaction" between the components ofθ (n+1) . This interaction step can be achieved, e.g., in the case of PT, by proposing to swap two components at two consecutive temperatures, i.e., components k and k + 1, and accepting this swap with a certain probability given by the usual Metropolis-Hastings acceptancerejection rule. In general, the swapping kernel is applied every N s steps of the chain, for some N s ≥ 1. We will devote the following subsection to the construction of the swapping kernel q.
Example 1 As a simple example, consider a Standard PT as in Algorithm 1 with K = 4. In this case, we attempt to swap two contiguous temperatures T i and T i+1 , i = 1, 2, 3. Thus, S K is the set of permutations {σ 1,2 , σ 2,3 , σ 3,4 } with: Notice that each permutation is its own inverse; for example: To define the swapping kernel q, we first need to define the swapping ratio and swapping acceptance probability.
Definition 1 (Swapping ratio) We say that a function r : is a swapping ratio if it satisfies the following two conditions: 1. ∀θ ∈ Θ K , r(θ, ·) is a probability mass function on S K . 2. ∀σ ∈ S K , r(·, σ) is measurable on (Θ K , B K ).

Definition 2 (Swapping acceptance probability)
Let θ ∈ Θ K and σ, σ −1 ∈ S K . We call swapping acceptance probability the function α swap : We can now define the swapping kernel q.
The swapping mechanism should be understood in the following way: given a current state of the chain θ ∈ Θ K , the swapping kernel samples a permutation σ from S K with probability r(θ, σ) and generates θ σ . This permuted state is then accepted as the new state of the chain with probability α swap (θ, σ). Notice that the swapping kernel follows a Metropolis-Hastings-like procedure with "proposal" distribution r(θ, σ) and acceptance probability α swap (θ, σ). Moreover, as detailed in the next proposition, such a kernel is reversible with respect to µ, since it is a Metropolis-Hastings type kernel.

Proposition 1
The Markov kernel q defined in (7) is reversible with respect to the product measure µ defined in (6).

Thus,
Let A σ := {z ∈ Θ K : z σ −1 ∈ A}, and, for notational simplicity, write min{a, b} = {a ∧ b}, a, b ∈ R. From I, we have: Then, noticing that µ pr is permutation invariant, we get For the second term II we simply have This generic form of the swapping kernel provides the foundation for both PT and GPT. We describe these algorithms in the following subsections.

The Parallel Tempering case
We first show how a PT algorithm that only swaps states between the i th and j th components of the chain can be cast in the general framework presented above. To that end, let σ i,j be the permutation of (1, 2, . . . , K), which only permutes the i th and j th components, while leaving the other components invariant (i.e., such that σ(i) = j, σ(j) = i, and σ(k) = k, k = i, k = j). We can take S K = {σ i,j , i, j = 1, . . . , K} and define the PT swapping ratio between components i and j by r Notice that this implies that r does not depend on θ, which in turn leads to the swapping acceptance probability α Thus, we can define the swapping kernel for the Parallel Tempering algorithm that swaps components i and j as follows: Definition 4 (Pairwise Parallel Tempering swapping kernel) Let θ ∈ Θ K , σ i,j ∈ S K . We define the Parallel Tempering swapping kernel, which proposes to swap states between the i th and j th chains as q In practice, however, the PT algorithm considers various sequential swaps between chains, which can be understood by applying the composition of kernels q . . . at every swapping step. In its most common form [8,19,36], the PT algorithm, hereafter referred to as standard PT (which on a slight abuse of notation we will denote by PT), proposes to swap states between chains at two consecutive temperatures. Its swapping kernel q (PT) : Moreover, the algorithm described in [19], proposes to swap states every N s ≥ 1 steps of MCMC. The complete kernel for the PT kernel is then given by [8,19,36] where p is a standard reversible Markov transition kernel used to evolve the individual chains independently.
Remark 1 Although the kernel p as well as each of the q i,i+1 are µ-reversible, notice that (8) does not have a palindromic structure, and as such it is not necessarily µ-reversible. One way of making the PT algorithm reversible with respect to µ is to consider the palindromic form where RPT stands for Reversible Parallel Tempering. In practice, there is not much difference between p (RPT) and p (PT) , however, under the additional assumption of geometric ergodicity of the chain (c.f Section 4) having a reversible kernel is useful to compute explicit error bounds on the non-asymptotic mean square error of an ergodic estimator [41].

Unweighted Generalized Parallel Tempering
The idea behind the Unweighted Generalized Parallel Tempering algorithm is to generalize PT so that (i) N s = 1 provides a proper mixing of the chains, (ii) the algorithm is reversible with respect to µ, and (iii) the algorithm considers arbitrary sets S K of swaps (always closed w.r.t inversion), instead of only pairwise swaps. We begin by constructing a kernel of the form (7). Let r (UW) : be a function defined as Clearly, (9) is a swapping ratio according to Definition 1. As such, given some state θ ∈ Θ K , r (UW) (θ, σ) assigns a state-dependent probability to each of the |S K | possible permutations in S K . A permutation σ ∈ S K is then accepted with probability α (UW) swap (θ, σ), given by Thus, we can define the swapping kernel for the UGPT algorithm, which takes the form of (7), with the particular choice of r(θ, σ) = r (UW) (θ, σ) and Notice that α (UW) swap (θ, σ) = 1, ∀σ ∈ S K . Indeed, if we further examine Equation (10), we see that In practice, this means that the proposed permuted state is always accepted with probability 1. The expression of the UGPT kernel then simplifies as follows.
Definition 5 (unweighted swapping kernel) The unweighted swapping kernel Applying this swapping kernel successively with the kernel gives what we call Unweighted Generalized Parallel Tempering kernel p (UW) . Lastly, we write the UGPT in operator form as where P and Q (UW) are the Markov operators corresponding to the kernels p and q (UW) , respectively. We now investigate the reversibility of the UGPT kernel. We start with a rather straightforward result.
Proof We prove reversibility by confirming that equation (1) holds true. To that end, let θ ∈ Θ K , A, B ∈ B K , where A and B tensorize, i.e., A := K k=1 A k and Showing that the previous equality holds for sets A, B that tensorize is indeed sufficient to show that the claim holds for any A, B ∈ B K . This follows from Carathéodory's Extension Theorem applied as in the proof of uniqueness of product measures; see [2, §1.3.10, 2.6.3], for details.
We can now prove the reversibility of the chain generated by p (UW) .
Proof It follows from Proposition 1 and 2 that the kernels q (UW) and p are µreversible. Furthermore, since p (UW) is a palindromic composition of kernels, each of which is reversible with respect to µ, then, p (UW) is reversible with respect to µ [8].
The UGPT algorithm proceeds by iteratively applying the kernel p (UW) to a predefined initial state. In particular, states are updated using the procedure outlined in Algorithm 2.
Let now Q : Θ → R be a quantity of interest. The posterior mean of Q , µ(Q) := µ 1 (Q) is approximated using N ∈ N samples by the following ergodic estimator Q (UW) :

A comment on the pairwise state-dependent PT method of [26]
The work [26] presents a similar state-dependent swapping. We will refer to the method presented therein as Pairwise State Dependent Parallel Tempering (PS-DPT). Such a method, however, differs from UGPT from the fact that (i) only pairwise swaps are considered and (ii) it is not rejection free. We summarize such a method for the sake of completeness. Let S K,pairwise denote the group of pairwise permutations of (1, 2, . . . , K). Given a current state θ ∈ Θ K , the PSDPT algorithm samples a pairwise permutation θ σ i,j ∈ S K,pairwise with probability r , and then accepts this swap with probability This method is attractive from an implementation point of view in the sense that it promotes pairwise swaps that have a similar energy, and as such, are likely (yet not guaranteed) to get accepted. In contrast, UGPT always accepts the new proposed state, which in turn leads to a larger amount of global moves, thus providing a more efficient algorithm. This is verified on the numerical experiments.

Weighted Generalized Parallel Tempering
Following the intuition of the continuous-time Infinite Swapping approach of [18,39], we propose a second discrete-time algorithm, which we will refer to as Weighted Generalized Parallel Tempering (WGPT). The idea behind this method is to swap the dynamics of the process, that is, the Markov kernels and temperatures, instead of swapping the states such that any given swap is accepted with probability 1.
We will see that the Markov kernel obtained when swapping the dynamics is not invariant with respect to the product measure of interest µ; therefore, an importance sampling step is needed when computing posterior expectations. For a given permutation σ ∈ S K , we define the swapped Markov kernel p σ : Θ K × B K → [0, 1] and the swapped product posterior measure µ σ (on the measurable space (Θ K , B K )) as: where the swapped posterior measure has a density with respect to µ prior given by Moreover, we define the swapping weights Note that, in general, π σ (θ) = π(θ σ ), and as such, w σ (θ) = r (UW) (θ, σ), with w σ defined as in (12).

Definition 6
We define the Weighted Generalized Parallel Tempering kernel p (W) : as the following state-dependent, convex combination of kernels: Thus, the WGPT chain is obtained by iteratively applying p (W) . We show in proposition 4 that the resulting Markov chain has invariant measure , the average with tensorization. Furthermore, µ W has a density (w.r.t the prior µ 0 ) given by and a similar average and then tensorization representation applies to π W . We now proceed to show that Proposition 4 (Reversibility of the WGPT chain) Suppose that, for any k = 1, 2, . . . , K p k is µ k -reversible. Then, the Markov chain generated by p (W) is µ W -reversible.
Proof We show reversibility by showing that (1) holds true. Thus, for θ ∈ Θ K , A, B ∈ B K , with A := A 1 × · · · × A K , A k ∈ B(Θ), and with B k defined in a similar way, we have that: From proposition 2, and multiplying and dividing by we obtain where once again, in light of Carathéodory's Extension Theorem, it is sufficient to show that reversibility holds for sets that tensorize.
We remark that the measure µ W is not of interest per se. However, we can use importance sampling to compute posterior expectations. Let Q(θ) := Q(θ 1 ) be a µ-integrable quantity of interest. We can write The last equality can be justified since µ W is invariant by permutation of coordinates. Thus, we can define the following (weighted) ergodic estimator Q (W) of the posterior mean of a quantity of interest Q by where we have denoted the importance sampling weights by w(θ, σ) : where N is the number of samples in the chain. Notice that w(θ, σ) = w(θ, σ −1 ). As a result, the WGPT algorithm produces an estimator based on N K weighted samples, rather than "just" N , at the same computational cost of UGPT. Thus, the previous estimator evaluates the quantity of interest Q not only in the points Q(θ (n) 1 ), but also in all states of the parallel chains, Q(θ

Remark 3
Although it is known that, in some cases, an importance sampling estimator can be negatively affected by the dimensionality of the parameter space Θ (see e.g., [3,Remark 1.17] or [38, Examples 9.1-9.3]), we argue that this is not the case for our estimator. Indeed, notice that the importance-sampling weights w(θ, σ) are always upper bounded by |S K |, and do not blow up when the dimension goes to infinity. In Section 5.7 we present a numerical example on a high-dimensional problem. The results on that section evidence the robustness of WGPT with respect to the dimension of θ.
The Weighted Generalized Parallel Tempering procedure is shown in Algorithm 3. To reiterate, we remark that sampling from p σ (θ (n) , ·) involves a swap of dynamics, i.e., kernels and temperatures.
Just as in Remark 2, one only needs to evaluate the posterior K times (instead of |S K |) to compute w (·) (θ n ).

Preliminaries
We assume that the chains generated by the MCMC kernels p k , for k = 1, . . . , K, are aperiodic, µ k -irreducible [3], and have invariant measure µ k on the measurable space (Θ, B(Θ)). Let r ∈ [1, ∞) and µ ∈ M(Θ) be a "reference" probability measure. On a BIP setting, this reference measure is considered to be the posterior. We define the following spaces Moreover, when r = ∞, we define Notice that, clearly, L 0 r (Θ, µ) ⊂ L r (Θ, µ). In addition we define the spaces of measures Notice that the definition of L r -norm depends on the reference measure µ, and on Θ.
Our path to prove ergodicity of the GPT algorithms will be to show the existence of an L 2 -spectral gap.

Geometric ergodicity and L 2 -spectral gap for GPT
The main results of this section are presented in Theorem 1 and Theorem 2, which show the existence of an L 2 -spectral gap for both the UGPT and WGPT algorithms, respectively.
We begin with the definition of overlap between two probability measures. Such a concept will later be used to bound the spectral gap of the GPT algorithms.
These assumptions are relatively mild. In particular, C1 and C2 are known to hold for many commonly-used Markov transition kernels, such as RWM, Metropolisadjusted Langevin Algorithm, Hamiltonian Monte Carlo, (generalized) preconditioned Crank-Nicolson, among others, under mild regularity conditions on π [3,22]. Assumption C3 holds true given the construction of the product measures in Section 3.
We now present an auxiliary result that we will use to bound the spectral gap of both the Weighted and Unweighted GPT algorithms.
We can use the previous result to prove the geometric ergodicity of the UGPT algorithm: Theorem 1 (Ergodicity of UGPT ) Suppose Assumption 1 holds and denote by µ the invariant measure of the UGPT Markov operator P (UW) . Then, P (UW) has an L 2 (Θ K , µ)-spectral gap. Moreover, the chain generated by P (UW) is L r (Θ K , µ)geometrically ergodic for any r ∈ [1, ∞].
Proof Recall that P (UW) := Q (UW) PQ (UW) . From the definition of operator norm, we have that where the previous line follows from Proposition 6 and the fact that Q (UW) is a weak contraction in L 2 (Θ K , µ) (see, e.g., [4, Proposition 1]). Lastly, L r (Θ K , µ)geometric ergodicity ∀r ∈ [1, ∞] follows from proposition 5 and the fact that P (UW) is µ-reversible by Proposition 3.
We now turn to proving geometric ergodicity for the WGPT algorithm. We begin with an auxiliary result, lower-bounding the variance of a µ W -integrable functional f ∈ L 2 (Θ K , µ W ).
Proposition 7 Let f ∈ L 0 2 (Θ K , µ W ) be a µ W -integrable function such that f L 2 (Θ K ,µ W ) = 1, and denote by V µ W [f ], V µ σ [f ] the variance of f with respect to µ W , µ σ , respectively with σ ∈ S K . In addition, suppose Assumption 1 holds. Then, it can be shown that Proof The proof is technical and tedious and is presented in Appendix A.
We are finally able to prove the ergodicity of the WGPT algorithm.

Proof Let
, and, for notational clarity, write .
Then, from the definition of operator norm, where the second to last line follows from the convexity of (·) 2 and the last line follows from the definition of w σ and µ W . Now, letf σ := µ σ (f ). Notice that we Thus, multiplying and dividing I by we obtain from the definition of P σ 2 L 0 2 that: Replacing Equation (19) into Equation (17), we get Thus, P (w) has an L 2 (Θ K , µ W ) spectral gap. Once again, L r (Θ K , µ W )-geometric ergodicity (with r ∈ [1, ∞]) follows from Proposition 5 and the fact that P (W) is µ W -reversible by Proposition 4.

Discussion and comparison to similar theoretical result
Theorems 1 and 2 state the existence of an L 2 -spectral gap, hence L r -geometric ergodicity for both the UGPT and the WGPT algorithm. Their proof provides also a quantification of the L 2 -spectral gap in terms of the L 2 -spectral gap of each individual Markov operator P k . Such a bound is, however, not satisfactory as it does not use any information on the temperature and it just states that the L 2 -spectral gap of the UWPT and WGPT chain is not worse that the smallest L 2 -spectral gap among the individual chains (without swapping). This result is not sharp, as it can be evidenced in the numerical section, where a substantial improvement in convergence is achieved by our methods.
Convergence results for the standard parallel tempering algorithm have been obtained in the works [36] and [45]. In particular, the work [36] has proved geometric ergodicity for the pairwise parallel tempering algorithm using the standard drift condition construction of [35]. It is unclear from that work which convergence rate is obtained for the whole algorithm. In comparison, our results are given in terms of spectral gaps. On the other hand, the work [45] presents conditions for rapid mixing of a particular type of parallel tempering algorithm, where the transition kernel is to be understood as a convex combination of such kernels, as opposed to our case, where it is to be understood as a tensorization. Their obtained results provide, for their setting, a better convergence rate that the one we obtained for the UGPT. We believe that their result can be extended to the UGPT algorithm, and this will be the focus of future work. On the other hand, the use of the ideas in [45] for the WGPT algorithm seems more problematic.

Numerical experiments
We now present four academic examples to illustrate the efficiency of both GPT algorithms discussed herein and compare them to the more traditional random walk Metropolis and standard PT algorithms. Notice that we compare the different algorithms in their simplest version that uses random walk Metropolis as a base transition kernel. The only exception is in Section 5.7, which presents a high-dimensional BIP for which the preconditioned Crank-Nicolson [10] is used as the base method in all algorithms instead of RWM. More advanced samplers, such as Adaptive metropolis [20,21], or other transition kernels, could be used as well to replace RWM or pCN. Experiments 5.3, 5.4 and 5.5 were run in a Dell (R) Precision (TM) T3620 workstation with Intel(R) Core(TM) i7-7700 CPU with 32 GB of RAM. Numerical simulations in Section 5.3 and 5.5 were run on a single thread, while the numerical simulations in Section 5.4 were run on an embarrassingly parallel fashion over 8 threads using the Message Passing Interface (MPI) and the Python package MPI4py [13]. Lastly, experiments 5.6 and 5.7 were run on the Fidis cluster of the EPFL. The scripts used to generate the results presented in this section were written in Python 3.6, and can be found in DOI: 10.5281/zenodo.4736623

Implementation remarks
In most Bayesian inverse problems, particularly those dealing with large-scale computational models, the computational cost is dominated by the evaluation of the forward operator, which can be, for example, the numerical approximation of a possibly non-linear partial differential equation. In the case where all possible permutations are considered (i.e., S K = S K ), there are K! possible permutations of the states, the computation of the swapping ratio in the GPT algorithms can become prohibitively expensive if one is to evaluate K! forward models, even for moderate values of K. This problem can be circumvented by storing the values π(θ (n) k ), k = 1, . . . , K, n = 1, . . . N , since the swapping ratio for GPT consists of permutations of these values, divided by the temperature parameters. Thus, "only" K forward model evaluations need to be computed at each step and the swapping ratio can be computed at negligible cost for moderate values of K.
There is, however, a clear trade-off between the choice of K (which has a direct impact on the efficiency of the method), and the computational cost associated to (G)PT. Intuitively, a large K would provide a better mixing, however, it requires a larger number of forward model evaluations, which tends to be costly. We remark that such a trade-off between efficiency and number of function evaluations is also present in some advanced MCMC methods, such as Hamiltonian Monte Carlo, where one needs to choose a number of time steps for the time integration (see, e.g., [5]). Furthermore, there is an additional constraint when choosing S K = S K , and it is the permutation cost associated to computing r (UW) (θ, σ) and w σ (θ). In particular, the computation of either of those quantities has a complexity of K! thus, this cost will eventually surpass the cost of evaluating the forward model K times. This is illustrated in Figure 1, where we plot the cost per sample of two different posteriors vs K. These posteriors are taken from the numerical examples in Sections 5.5 and 5.7. The posterior in Section 5.5 is rather inexpensive to evaluate, since one can compute the forward map F analytically (the difficulty associated to sampling from that posterior comes from its high multi-modality). On the contrary, evaluating the posterior in Section 5.7 requires numerically approximating the solution to a time-dependent, second-order PDE, and as such, evaluating such a posterior is costly. As we can see for K ≤ 6, the computational cost in both cases is dominated by the forward model evaluation. Notice that for K ≤ 9, the cost per sample from posterior (27) is dominated by the evaluation of the forward model. Thus, for high values of K, it is advisable to only consider the union of properly chosen semi-groups A, B of S K , with A ∩ B = ∅, such that A, B generates S K (i.e., if the smallest semi-groups that contains A and B is S K itself), and |A ∪ B| < |S K | = K!, which is referred to as partial Infinite Swapping in the continuous case [18]. One particular way of choosing A and B is to consider, for example, A to be the set of permutations that only permute the indices associated with relatively low temperatures while leaving the other indices unchanged, and B as the set of permutations for the indices of relatively high temperatures, while leaving the other indices unchanged. Intuitively, swaps between temperatures that are, in a sense, "close" to each other tend to be chosen with a higher probability. We refer the reader to [18, Section 6.2] for a further discussion on this approach in the continuous-time setting. One additional idea would be to consider swapping schemes that, for example, only permute states between µ i and µ i+1 , µ i+2 , . . . , µ i+ for some user-defined ≥ 1 and any given i = 1, 2, . . . , K − 1. The intuition behind this choice also being that swaps between posteriors that are at close temperatures are more likely to occur than swaps between posteriors with a high temperature difference. We intend to explore this further in depth in future work.
We reiterate that the total number of temperatures K depends heavily on the problem and the computational budget available [17,42,46] For the experiments considered in the work we chose K = 4 or K = 5, which provide an acceptable compromise between acceleration and cost.

Experimental setup
We now present an experimental setup common to all the numerical examples presented in the following subsections. In particular, all the experiments presented in this work utilize a base method given by either RWM (for experiments 5.3 through 5.6) or pCN (used in experiment 5.7) for the Markov transition kernels p. Furthermore, we take S K = S K for all experiments, where K = 5 for experiment 5.5 and K = 4 for the other 4 experiments. In addition, we follow the rule of thumb of [19] for the choice of temperatures, setting, for each experiment, T k = a k−1 , k = 1, . . . , K, for some positive constant a > 1. The particular choice of a is problemdependent and it is generally chosen so that µ K becomes sufficiently simple to explore. For each experiment we implement 5 MCMC algorithms to sample from a given posterior µ = µ 1 , namely, the base (untempered) method (either RWM or pCN), and such a method combined with the standard PT algorithm (PT) with N s = 1, the PSDPT algorithm of [26], and both versions of GPT. For our setting, the tempered algorithms have a cost (in terms of number of likelihood evaluations) that is K times larger than the base method. Thus, to obtain a fair comparison across all algorithms, we run the chain for the base method K times longer. Lastly, given some problem-dependent quantity of interest Q, we assess the efficiency of our proposed algorithms to compute the posterior expectation of Q by comparing the mean square error (experiments 5.3-5.5), for which the exact value of E µ [Q] is known, or the variance (experiments 5.6-5.7) of the ergodic estimatorQ obtained over N runs independent runs of each algorithm.

Density concentrated over a quarter circle-shaped manifold
Let µ be a probability measure that has density π with respect to the uniform Lebesgue measure on the unit square µ pr = U([0, 1] 2 ) given by where θ = (θ 1 , θ 2 ), Z is the normalization constant, and 1 [0,1] 2 is the indicator function over the unit square. We remark that this example is not of particular interest per se; however, it can be used to illustrate some of the advantages of the algorithms discussed herein. The difficulty of sampling from such a distribution comes from the fact that its density is concentrated over a quarter circle-shaped manifold, as can be seen on the left-most plot in Figure 2. This in turn will imply that a single level RWM chain would need to take very small steps in order to properly explore such density. We aim at estimatingQ i = E µ [θ i ] ≈θ i , for i = 1, 2. For the tempered algorithms (PT, PSDPT, UGPT, and WGPT), we consider K = 4 temperatures and choose T 4 = 5000, so that the tempered density π 4 becomes sufficiently simple to explore the target distribution. This gives T 1 = 1, T 2 = 17.1, T 3 = 292.4, T 4 = 5000. We compare the quality of our algorithms by examining the variance of the estimatorsθ i , i = 1, 2 computed over N runs = 100 independent MCMC runs of each algorithm. For the tempered algorithms, each estimator is obtained by running the inversion experiment for N = 25, 000 samples per run, discarding the first 20% of the samples (5000) as a burn-in. Accordingly, we run the single-chain random walk Metropolis algorithm for N RWM = KN = 100, 000 iterations, per run, and discard the first 20% of the samples obtained with the RWM algorithm (20,000) as a burn-in. 5000) for the density concentrated around a quarter circle-shaped manifold example. As we can see, the density becomes less concentrated as the temperature increases, which allows us to use RWM proposals with larger step sizes.
The untempered RWM algorithm uses Gaussian proposals with covariance matrix Σ RWM = ρ 2 1 I 2×2 , where I 2×2 is the identity matrix in R 2×2 , and ρ 2 1 = 0.022 is chosen in order to obtain an acceptance rate of around 0.23. For the tempered algorithms (i.e., PT, PSDPT, and both versions of GPT), we use K = 4 RWM kernels p k , k = 1, 2, 3, 4, with proposal density q prop,k (θ Table 1. This choice of ρ k gives an acceptance rate for each chain of around 0.23. Notice that ρ 1 corresponds to the "step-size" of the singletemperature RWM algorithm. Table 1 Step size of the RWM proposal distribution for the manifold experiment. Experimental results for the ergodic run are shown in Table 2. We can see how both GPT algorithms provide a gain over RWM, PT and PSDPT algorithms, with the WGPT algorithm providing the largest gain. Scatter plots of the samples obtained with each method are presented in Figure 3. Here, the subplot titled "WGPT" (bottom row, middle) corresponds to weighted samples from µ W , with weightŵ as in (13), while the one titled "WGPT (inv)" (bottom row, right) corresponds to samples from µ W without any post-processing. Notice how the samples from the latter concentrates over a thicker manifold, which in turn makes the target density easier to explore when using state-dependent Markov transition kernels.  Table 2 Results for the density concentrated around a circle-shaped manifold experiment. As we can see, both GPT algorithms provide an improvement over PT, PSDPT and RWM. The computational cost is comparable across all algorithms.

Multiple source elliptic BIP
We now consider a slightly more challenging problem, for which we try to recover the probability distribution of the location of a source term in a Poisson equation (Eq. (20)), based on some noisy measured data. Let (Θ, B(Θ), µ pr ) be the measure space, set Θ =D := [0, 1] 2 , with Lebesgue (uniform) measure µ pr , and consider the following Poisson's equation with homogeneous boundary conditions: Such equation can model, for example, the electrostatic potential u := u(x, θ) generated by a charge density f (x, θ) depending on an uncertain location parameter θ ∈ Θ. Data y is recorded on an array of 64 × 64 equally-spaced points in D by solving (20) with a forcing term given by where the true source locations s (i) , i = 1, 2, 3, 4, are given by s Such data is assumed to be polluted by an additive Gaussian noise with distribution N (0, η 2 I 64×64 ), with η = 3.2 × 10 −6 , (which corresponds to a 1% noise) and where I 64×64 is the 64dimensional identity matrix. Thus, we set (Y, We assume a misspecified model where we only consider a single source in Eq. (21). That, is, we construct our forward operator F : Θ → Y by solving (20) with a source term given by In this particular setting, this leads to a posterior distribution with four modes since the prior density is uniform in the domain and the likelihood has a local maximum whenever (θ 1 , θ 2 ) = (s 2 ), i = 1, 2, 3, 4. The Bayesian inverse problem at hand can be understood as sampling from the posterior measure µ, which has a density with respect to the prior µ pr = U(D) given by for some (intractable) normalization constant Z as in (4). We remark that the solution to (20) with a forcing term of the form of (22) is approximated using a second-order accurate finite difference approximation with grid-size h = 1/64 on each spatial component. The difficulty in sampling from the current BIP arises from the fact that the resulting posterior µ is multi-modal and the number of modes is not known apriori (see Figure 4).
We follow a similar experimental setup to the previous example, and aim at estimatingQ i = E µ [θ i ] ≈θ i , for i = 1, 2. We use K = 4 temperatures and N runs = 100. For the PT, PSDPT and GPT algorithms, four different temperatures Fig. 4 True tempered densities for the elliptic BIP example. Notice that the density is not symmetric, due to the additional random noise. are used, with T 1 = 1, T 2 = 7.36, T 3 = 54.28, and T 4 = 400. For each run, we obtain N = 25, 000 samples with the PT, PSDPT, and both GPT algorithms, and N = 100, 000 samples with RWM, discarding the first 20% of the samples in both cases (5000, 20000, respectively) as a burn-in. On each of the tempered chains, we use RWM proposals, with step-sizes shown in table 3. This choice of step size provides an acceptance rate of about 0.24 across all tempered chains and all tempered algorithms. For the single-temperature RWM run, we choose a larger step size (ρ RWM = 0.16) so that the RWM algorithm is able to explore the whole distribution. Such a choice, however, provides a smaller acceptance rate of about 0.01 for the single-chain RWM.
Experimental results are shown in Table 4. Once again, we can see how both GPT algorithms provide a gain over RWM and both variations of the PT algorithm, with the WGPT algorithm providing a larger gain. Scatter-plots of the obtained samples are shown in Figure 4.

1D wave source inversion
We consider a small variation of example 5.1 in [37]. Let (Θ, B(Θ), µ pr ) be a measure space, with Θ = [−5, 5] and uniform (Lebesgue) measure µ pr , and let k = 1 k = 2 k = 3 k = 4 ρ k,Tempered 0.030 0.100 0.400 0.600 ρ k,RWM 0.160 --- Table 3 Step size of the RWM proposal distribution for the elliptic BIP experiment.  Table 4 Results for the elliptic BIP problem. The computational cost is comparable across all algorithms, given that the cost of each iteration is dominated by the cost of solving the underlying PDE. UGPT, WGPT (after re-weighting the samples), and WGPT, before re-weighting the samples. As we can see, WGPT (before re-weighting) is able to "connect" the parameter space.
I = (0, T ] be a time interval. Consider the following Cauchy problem for the 1D wave equation: Here, h(x, θ) acts as a source term generating a initial wave pulse. Notice that Equation (23) can be easily solved using d'Alembert's formula, namely Synthetic data y is generated by solving Equation (23) with initial data with θ 1 = −3, θ 2 = 3 and observed at N R = 11 equally-spaced receiver locations between R 1 = −5 and R 2 = 5 on N T = 1000 time instants between t = 0 and T = 5. The signal recorded by each receiver is assumed to be polluted by additive Gaussian noise N (0, η 2 I 1000×1000 ), with η = 0.01, which corresponds to roughly 1% noise. We set (Y, Y ) = (R 11×1000 , · Σ ), with A ∈ R 11×1000 . Once again, we assume a misspecified model where we construct our forward operator F : Θ → Y by solving (23) with a source term given by The Bayesian inverse problem at hand can be understood as sampling from the posterior measure µ, which has a density with respect to the prior µ pr = U([−5, 5]) given by for some (intractable) normalization constant Z as in (4). The difficulty in solving this BIP comes from the high multi-modality of the potential Φ(θ; y), as it can be seen in Figure 6. This, shape of Φ(θ; y) makes the posterior difficult to explore using local proposals.
In this case, we consider K = 5, and set T 1 = 1, T 2 = 5, T 3 = 25, T 4 = 125 and T 5 = 625. Notice that from Figure 1, the computational cost per sample is dominated by the evaluation of (24) for values of K ≤ 6. Once again, we obtain N = 25, 000 samples with the PT, PSDPT, and both GPT algorithms, and N = 125, 000 samples with RWM, discarding the first 20% of the samples in both cases (5000, 25000, respectively) as a burn-in. On each of the tempered chains, we use RWM proposals, with step-sizes shown in table 5. This choice of step size provides an acceptance rate of about 0.4 across all tempered chains and all tempered algorithms. The choice of step-size for the RWM algorithm is done in such a way that it can "jump" modes, which are at distance of roughly 1/2. We consider Q = θ as a quantity of interest. Experimental results are shown in Table 6. Once again, we can see how both GPT algorithms provide a gain over RWM and both variations of the PT algorithm, with the WGPT algorithm providing the largest gain. Notice that, given the high muti-modality of the posterior at hand, the simple RWM algorithm is not well-suited for this type of distribution, as it can be seen from its large variance; this suggests that the RWM usually gets "stuck" at one mode of the posterior. Notice that, intuitively, due to the symmetric nature of the potential, one would expect the true mean of θ to be close to 0. This value was computed by means of numerical integration and is given by

Acoustic wave source inversion
We consider a more challenging problem, for which we try to recover the probability distribution of the spatial location of a (point-like) source term, together with the material properties of the medium, on an acoustic wave equation (Eq. (25)), based on some noisy measured data. We begin by describing the mathematical model of such wave phenomena. Let (Θ, B(Θ), µ pr ) be the measure space , with Lebesgue (uniform) measure µ pr , setD := [0, 3] × [0, 2], ∂D =Γ N ∪Γ Abs ,Γ N ∩ Γ Abs = 0, |Γ N |, |Γ Abs | > 0, and define Θ = D × Θ α × Θ β , where Θ α = [6,14], Θ β = [4500, 5500]. Here, we are considering a rectangular spatial domain D, with the top boundary denoted by Γ N and the side and bottom boundaries denoted by Γ Abs . Lastly, let θ := (s 1 , s 2 , α, β) ∈ Θ. Consider the following acoustic wave equation with absorbing boundary conditions: where u = u(x, t, θ), and f = f (x, t, θ). Here the boundary condition on the top boundary Γ N corresponds to a Neumann boundary condition, while the boundary condition on Γ Abs corresponds to the so-called absorbing boundary condition, a type of artificial boundary condition used to minimize reflection of wave hitting the boundary. Data y ∈ Y is obtained by solving Equation (25) with a force term given by with a true set of parameters Θ θ * := (s 1 , s 2 , α, β) given by s 1 = 1.5, s 2 = 1.0, α = 10, β = 5000, and observed on N R = 3 different receiver locations R 1 = (1.0, 2.0), R 2 = (1.5, 2.0), R 3 = (2.0, 2.0) at N T = 117 equally-spaced time instants between t = 0 and t = 0.004. In physical terms, the parameters s 1 , s 2 represent the source location, while the parameters α, β are related to the material properties of the medium. Notice that, on a slight abuse of notation, we have used the symbol π to represent the number 3.14159 . . . in equation (26) and it should not be confused with the symbol for density. The data measured by each receiver is polluted by additive Gaussian noise N (0, η 2 I 117×117 ), with η = 0.013, which corresponds to roughly a 2% noise. Thus, we have that (Y, Thus, the forward mapping operator F : Θ → Y can be understood as the numerical solution of Equation (25) evaluated at 117 discrete time instants at each of the 3 receiver locations. Such a numerical approximation is obtained by the finite element method using piece-wise linear elements and the time stepping is done using a Forward Euler scheme with sufficiently small time-steps to respect the so-called Courant-Friedrichs-Lewy condition [40]. This numerical solution is implemented using the Python library FEniCS [29], using 40×40 triangular elements. The Bayesian inverse problem at hand can thus be understood as sampling from the posterior measure µ, which has a density with respect to the prior µ pr = U(Θ) given by The previous BIP presents two difficulties; on the one hand, Equation (25) is, typically, expensive to solve, which in turn translates into expensive evaluations of the posterior density. On the other, the log-likelihood has an extremely complicated structure, which in turn makes its exploration difficult. This can be seen in Figure  7, where we plot of the log-likelihood for different source locations (s 1 , s 2 ) and for fixed values of the material properties α = 10, β = 5000. More precisely, we plotΦ((s 1 , s 2 ); y) := − 1 2 y − F (s 1 , s 2 , 10, 5000) 2 Σ on a grid of 100 × 100 equally spaced points (s 1 , s 2 ) in D. It can be seen that, even though the log-likelihood has a clear peak around the true value of (s 1 , s 2 ), there are also regions of relatively high concentration of log-probability, surrounded by regions with a significantly smaller log-probability, making it a suitable problem for our setting.
Following the same set-up of previous experiments, we aim at estimatingQ i = E µ [θ i ] ≈θ i , for i = 1, 2. Once again, we consider K = 4 temperatures for the tempered algorithms (PT, PSDPT, UGPT, and WGPT), and set temperatures to T 1 = 1, T 2 = 7.36, T 3 = 54.28, T 4 = 400. We compare the quality of our algorithms by examining the variance of the estimatorsθ i , i = 1, 2 computed over N runs = 50 independent MCMC runs of each algorithm. For each run, we run the tempered algorithms obtaining N = 7, 000 samples, discarding the first 20% of the samples (1400) as a burn-in. For the RWM algorithm, we run the inversion experiment for N RWM = KN = 28, 000 iterations, and discard the first 20% of the samples obtained (5600) as a burn-in.
Each individual chain is constructed using Gaussian RWM proposals q prop,k (θ n k , ·) = N (θ n k , C k ), k = 1, 2, 3, 4, with covariance C k described in Table 7. The covariance is tuned in such a way that the acceptance rate of each chain is around 0.2. The variance of the estimators obtained with each method is presented in Table 8. Once again, our GPT algorithms outperform all other tested methods for this particular setting. In particular, our methods provide huge computational gains when compared to RWM and the PSDPT algorithm of [26], as well as some moderate computational gains when compared to the standard PT. Diag(1, 1, 1, 50) - Table 7 Step size of the RWM proposal distribution for the acoustic BIP experiment. Here

High-dimensional acoustic wave inversion
Lastly, we present a high-dimensional example for which we try to invert for the material properties β 2 in (25). For simplicity, we will consider fixed values of α = 1, s 1 = 1.5, and s 2 = 1. In this case, we set β 2 = 10 +β 2 (x), whereβ(x) is taken to be a realization of a random field discretized on a mesh of N x × N y triangular elements. This modeling choice ensures that β 2 is lower bounded. In this case, we will invert for the nodal values of (the finite element discretization of)β, which will naturally result in a high-dimensional problem. We remark that one is usually interested in including the randomness in β 2 , instead ofβ; however, we purposely choose to do so to induce an explicitly multi-modal posterior, and as such, to better illustrate the advantages of our proposed methods when sampling from these types of distributions. We begin by formalizing the finite-element discretization of the parameter space (see e.g., [9] for a more detailed discussion). LetD = [0, 3] × [0, 2], denote the physical space of the problem and let V h be a finite-dimensional subspace of L 2 (D) arising from a given finite element dis-cretization. We write the finite element approximationβ h ∈ V h ofβ aŝ where {φ} N v n=1 are the Lagrange basis functions corresponding to the nodal points is the set of nodal parameters and N v corresponds to the number of vertices used in the FE discretization. Thus, the problem of inferring the distribution of β given some data y, can be understood as inferring the distribution of θ given y. For our particular case, we will discretize D using 28 × 28 (non-overlapping) piece-wise linear finite elements, which results in N v = 841 and as such Θ = R 841 . We consider a Gaussian prior µ pr,∞ = N (0, A −2 ), where A is a differential operator acting on L 2 (D) of the form A := −a∇ · (H∇) + dI, a, d > 0, together with Robin boundary conditions ∇(·) ·n + √ ad(·) = 0, where, following [43], H is taken of the form H := e 1 sin 2 ( ) + e 2 cos 2 ( ) (e 1 − e 2 ) sin( ) cos( ) (e 1 − e 2 ) sin( ) cos( ) e 1 cos 2 ( ) + e 2 sin 2 ( ) .
Here H models the spatial anisotropy of a Gaussian Random field sampled from µ pr,∞ . It is known that for a two-dimensional (spatial) space, the covariance operator A −2 is symmetric and trace-class [9], and as such, the (infinite-dimensional) prior measure is well-defined. Thus, we set which in turn induces the discretized prior: where A −2 h is a finite-element approximation of A using 28 × 28 (non-overlapping) piece-wise linear finite elements. Samples from µ pr are obtained using the FEniCS package [29] and the hIPPYlib library [43].
We follow an approach similar to our previous example. We collect data y ∈ Y by solving Equation (25) with a force term given by (26) and a true fieldβ * h ∼ µ pr with a = 0.1, d = 0.5, = π/4, e 1 = 2 and e 2 = 0.5. Such a realization ofβ * h is shown in Figure 8.
Furthermore, data is observed at N R = 5 different receiver locations R 1 = (1.0, 2.0), R 2 = (1.25, 2.0), R 3 = (1.5, 2.0), R 4 = (1.75, 2.0), and R 5 = (2.0, 2.0) at N T = 600 equally-spaced time instants between t = 0 and t = 0.6. The data measured by each receiver is polluted by an (independent) additive Gaussian noise η ∼ N (0, σ 2 noise I 600×600 ), with σ = 0.021, which corresponds to roughly a 0.5% noise. Thus, we have that (Y, · Y ) = (R 5×600 , · Σ ). Similarly as in Section 5.6, the forward mapping operator F : Θ → Y can be understood as the numerical solution of Equation (25) evaluated at 600 discrete time instants at each of the 5 receiver locations. Numerical implementation follows a similar set-up as in Section 5.6, however, for simplicity, we use 28 × 28 triangular elements to approximate the forward operator F. The Bayesian inverse problem at hand can thus be understood as sampling from the posterior measure µ, which has a Radon-Nikodym derivative with respect to the prior µ pr given by The previous BIP has several difficulties; clearly, it is a high-dimensional posterior. Furthermore, just as in the previous example, the underlying mathematical model for the forward operator is a costly time-dependent PDE. By choosing to invert forβ h ∼ µ pr (instead of β 2 h ≈ β 2 ), and since µ pr is centered at zero, we induce a multi-modal posterior, indeed, if the posterior concentrates aroundβ * h it will also have peaks at any otherβ j obtained by flipping the sign ofβ * h in a concentrated region separated by the zero level set ofβ * h (we identify 7 regions in Figure 8). This can be seen in Figure 9, where we plot 4 posterior samplesβ h ∼ µ. Notice the change in sign between some regions. Lastly, as a quantities of interest, we will consider Q 1 = D exp(β(x))dx and Q 2 = exp(β(1.5, 1)). We remark that, although these quantities of interest do not have any meaningful physical interpretation, they are, however, affected by the multi-modality of the posterior, and as such, well suited to exemplify the capabilities of our method.
Given the high-dimensionality of the posterior, we present a slightly different experimental setup in order to estimate E µ [Q i ] ≈Q i , i = 1, 2. In particular, we will use the preconditioned Crank-Nicolson (pCN) as a base method, instead of RWM, for the transition kernel p. We compare the quality of our algorithms by examining the variance of the estimatorsQ i computed over N runs = 50 independent MCMC runs of each algorithm, with K = 4 temperatures for the tempered algorithms given by T 1 = 1, T 2 = 4.57, T 3 = 20.89, T 4 = 100. For the tempered algorithms, each estimator is obtained by running the inversion experiment for N = 4, 800 samples, discarding the first 20% of the samples (800) as a burn-in. For the untempered pCN algorithm, we run the inversion experiment for N pCN = KN = 19, 200 iterations, and discard the first 20% of the samples obtained (3840) as a burn-in. Each individual chain is constructed using pCN proposals q prop,k (θ n k , ·) = N ( 1 − ρ 2 k θ n k , ρ 2 k A −2 h ), k = 1, 2, 3, 4, with ρ k described in Table 9. The simple, un-tempered pCN algorithm is run with a step size given by ρ = ρ 1 . The values of ρ k are tuned in such a way that the acceptance rate of each chain is around 0.3 and are reported in Table 9. The variance of the estimators obtained with each method is presented in Table 10. Once again, even for this high-dimensional, highly multi-modal case, our proposed methods perform considerably better than the other algorithms. k = 1 k = 2 k = 3 k = 4 ρ k 0.1 0.2 0.4 0.8 Table 9 Values of ρ k for the pCN kernel for the high-dimensional wave inversion problem.  Table 10 Results for the high-dimensional acoustic BIP problem. As for the previous examples, The computational cost is comparable across all algorithms.

Conclusions and future work
In the current work, we have proposed, implemented, and analyzed two versions of the GPT, and applied these methods to a BIP context. We demonstrate that such algorithms produce reversible and geometrically-ergodic chains under relatively mild conditions. As shown in Section 5, such sampling algorithms provide an attractive alternative to the more standard Parallel Tempering when sampling from difficult (i.e., multi-modal or concentrated around a manifold) posteriors. We remark that the framework considered here-in can be combined with other, more advanced MCMC algorithms, such as, e.g., the Metropolis-adjusted Langevin algorithm (MALA), or the Delayed Rejection Adaptive Metropolis (DRAM), for example [20]. We intend to carry out a number of future extensions of the work presented herein. One of our short-term goals is to extend the methodology developed in the current work to a Multi-level Markov Chain Monte Carlo context, as in [16,32] In addition, from a theoretical point of view, we would like to investigate the role that the number of chains and the choice of temperatures play on the convergence of the GPT algorithm, as it has been done previously for Parallel Tempering in [45]. Improving on the estimates presented here would likely be the focus of future work. We also believe that by excluding the identity permutation (i.e., id / ∈ S K ) on the UGPT, one could obtain a swapping kernel which is better in the so-called Peskun sense, see [1] for more details. We intend to carry further numerical experiments to better understand and compare swapping strategies. Furthermore, from a computational perspective, given that the framework presented in this work is, in principle, dimension independent, the methods explored in this work can also be combined with dimension-independent samplers such as the ones presented in [5,12], thus providing a sampling algorithm robust to both multi-modality and large dimensionality of the parameter space. Given the additional computational cost of these methods, a non-trivial coupling of GPT and these methods needs to be devised. Lastly, we aim at applying the methods developed in the current work to more computationally challenging BIP, in particular those arising in seismology and seismic source inversion, where it is not uncommon to find multi-modal posterior distributions when inverting for a point source.