Convergence rate of multiple-try Metropolis independent sampler

The multiple-try Metropolis method is an interesting extension of the classical Metropolis–Hastings algorithm. However, theoretical understanding about its usefulness and convergence behavior is still lacking. We here derive the exact convergence rate for the multiple-try Metropolis Independent sampler (MTM-IS) via an explicit eigen analysis. As a by-product, we prove that an naive application of the MTM-IS is less efficient than using the simpler approach of “thinned” independent Metropolis–Hastings method at the same computational cost. We further explore more variants and find it possible to design more efficient algorithms by applying MTM to part of the target distribution or creating correlated multiple trials.


Introduction 1.Fundamental Metropolis-Hastings method
Markov chain Monte Carlo (MCMC) methods have played important roles in statistical computing and Bayesian inference and have attracted much attention from both theoretical researchers and practitioners.In a nutshell, the set of methods provide general and practical recipes for generating random draws

Geometric convergence
A Markov chain with transition function A is said to be geometrically ergodic if, for π-almost everywhere x, A n (x, •) − π(•) ≤ C(x)r n holds true with constant r ∈ (0, 1).Here • denotes a distance metric between two probability measures, usually taken as the total variation (TV) distance.Other modes of convergence, such as convergence in χ 2 -distance (which implies the convergence in total variation), have also been investigated (Liu et al., 1995;Liu, 2008).Establishing this inequality and deriving sharp bounds on the rate r are seen as central tasks in studying MCMC algorithms (Tierney, 1994;Liu et al., 1995;Roberts and Tweedie, 1996).
As a generalization of the standard Metropolis-Hastings algorithm, the Multiple-Try Metropolis (MTM) scheme as formalized in Liu et al. (2000) allows one to draw multiple trials at each step and select one according to a specially designed probability distribution.Although intuitively the MTM scheme enables one to escape from local optimums more easily, there is little theoretical understanding of the convergence rate of any form of the MTM algorithm, making it a challenging practical concern when deciding whether a MTM approach should be employed for a specific problem.Existing theoretical results on the Metropolis-Hastings algorithm clearly cannot be easily extended to the MTM algorithm.Indeed, getting sharp bounds on the convergence rate of any general-purpose Metropolis-Hastings algorithm can be extremely challenging, except for the Independent Metropolis-Hastings (IMH) algorithm (which is also called the Metropolised independence sampler by Liu (1996) and the independence Metropolis chain by Tierney (1994)).We are therefore tempted to consider whether the IMH's multiple-try version, which we call the Multiple-Try Metropolis Independent sampler (MTM-IS), can be tackled theoretically.

Convergence rate of Independent Metropolis-Hastings algorithm
Geometrical ergodicity is not guaranteed for a general Metropolis-Hastings algorithm unless we impose suitable restrictions (Roberts and Tweedie, 1996), and exact convergence rates for Metropolis-Hastings algorithms are rare to find (Diaconis and Saloff-Coste, 1998).In practice, geometric ergodicity is often established under the 'drift-and-minorization' framework (Diaconis et al., 2008).But this technique usually results in a very conservative bound of the convergence rate, not quite practically useful.Because of the very special structure of the IMH algorithm, explicit eigen-analyses of its transition matrix for the finite-discrete state space case were obtained by Liu (1996), which results in the exact convergence rate of the IMH algorithm (also a very tight bound on the constant in front of the rate) and offers a comparison with classical rejection sampling and importance sampling.Atchadé and Perron (2007) studies the continuous case by determining the full spectrum of the transition operator of the IMH algorithm.A recent preprint of Wang (2020) combines previous results and provides a lower bound, hence determining the exact convergence rate.In this paper, we impose similar conditions on the MTM-IS and study its exact convergence rate.

Multiple-Try Metropolis and its variants
The original idea of Multiple-Try Metropolis (MTM) comes from chemical physicists interested in molecular simulations (Frenkel et al., 1996).Its general formulation constructed in Liu et al. (2000) inspires the development of Ensemble MCMC methods by Neal (2011), connects with particle filtering (Martino et al., 2014), and stimulates ideas of parallelizing MCMC (Calderhead, 2014;Yang et al., 2018).We refer interested readers to the review of Martino (2018).
Intuitively, the MTM approach enables one to explore the sample space more broadly, and thus potentially gains efficiency in avoiding being trapped in local modes.The method has been incorporated in some applications such as model selection (Pandolfi et al., 2010) and Bayes factor estimation (Dai and Liu, 2020).
In the context of molecular simulations (Frenkel et al., 1996), the multipletry strategy is often applied to a target distribution in which the state space can be partitioned into two parts: position and orientation, i.e., x = (x p , x o ).For a given x p , evaluating multiple configurations corresponding to different orientations, π(x p , x o1 ), . . ., π(x p , x om ) is not much more expensive than evaluating a single π(x p , x o ).Thus, MTM can be quite useful in facilitating an efficient move: we can propose the new configuration by (a) first proposing the position x p (new) ; (b) associating with it multiple orientations x o1 (new) , . . ., x om (new) ; (c) picking one from them properly, and (d) using the MTM rule to do acceptance/rejection.In addition to this case, MTM is also particularly useful when combined with directional sampling, as in (Liu et al., 2000;Dai and Liu, 2020).Specifically, given a sampling direction e at position x, multiple trials are drawn simultaneously as r 1 , . . ., r m ∼ p(r) to construct y j = x + r j e.
Several variants of the MTM are worth mentioning: Craiu and Lemieux (2007) propose to use correlated trials to accelerate MTM and introduces antithetic and stratified sampling to bring correlation; Casarin et al. (2013) argue that multiple independent trials from different distributions are worth considering, and connect to interactive sampling algorithms.Theoretically, Bédard et al. (2012) conducts a scaling analysis for MTM.However, to the best of our efforts, we can not find any existing result on the convergence rate of an MTM algorithm.
In this paper, we report the exact convergence rate of the MTM-IS for general target π(•) and proposal p(•).The result is somewhat surprising as it shows that the MTM-IS with k multiple tries is not as efficient as simply repeating the standard IMH algorithm k times, thus suggesting that the we may want to design the k multiple proposals to be "over-dispersed" (e.g., negatively correlated) in order to take advantage of the MTM structure.Another useful scenario, as discussed previously and detailed in Section 5.1, is to help proposing a better configuration for a general Metropolis-Hastings algorithm by orienting part of the proposal better via MTM.
The rest of the article is organized as follows.Section 2 carries out an eigenvalue analysis of MTM-IS; Section 3 specifies the exact convergence rate of MTM-IS under the total variation distance and offers an inequality to compare MTM-IS with its corresponding "thinned" IMH algorithm (i.e., taking one draw from every k iterations of the sampler); Section 4 provides some empirical results for multivariate Gaussian and Gaussian mixtures; Section 5 discusses several variants and extensions of MTM; and Section 6 concludes the article with a short remark.
2 Eigen-analysis of Multiple-Try Metropolis Independent sampler

Notations
Throughout the article, we use X to denote the state space, which can be either discrete or continuous.Notations π(x) and p(x, y) represent the target and proposal distributions, respectively, with x, y ∈ X .If proposal distribution is independent of the current state x, we write it as p(y).The actual transition function/probability/density of the MCMC algorithm is denoted by A(x, y).
A collection of multiple trials of size k is written as y = (y 1 , . . ., y k ).We consider the total variation distance for any two (signed) measures P and Q, which is defined as where F denotes the σ-field common to P and Q (e.g., the Borel σ-field for most common uses).
In Section 5, we slightly abuse the notation by letting p(x, y) be the proposal distribution for x ∈ X and y = (y 1 , . . ., y k ) ∈ X k , as we would consider multiple correlated trials in this section.Besides, we write p(x, y (−j) | y j ) as the conditional distribution of y (−j) ≡ (y 1 , . . ., y j−1 , y j+1 , . . ., y k ) given y j and x.Lastly, p j (x, y j ) = p(x, y)dy −j denotes the conditional marginal distribution of y j given x.

Description of the algorithms
The general framework of the MTM as formulated in Liu et al. (2000) is summarized in Algorithm 1.Let the current state be x, and let the number of multiple tries be k.With a proposal transition function p(x, y) that defines the conditional distribution of y, we define the generalized importance weight as where λ is a symmetric non-negative function (i.e., λ(x, y) = λ(y, x) ≥ 0, ∀x, y).Thus, the acceptance/rejection ratio in a general MH algorithm is just the ratio of the generalized importance weights.
Algorithm 1 Multiple-Try Metropolis: the current state is x.
Here, x * 1 , x * 2 , . . ., x * k−1 are called balancing trials, which are drawn to guarantee the detailed balance.Liu et al. (2000) also extend the MTM for generating non-independent multiple trials such as semi-deterministic ones along a direction.If we choose p(x, y) = p(y), we can modify this algorithm to avoid drawing additional balancing trials as the algorithm is still valid if we simply replace the x * j by y j in computing ρ.This modified version is summarized in Algorithm 2 and named the MTM-IS(k).In this case, we further select λ(x, y) ≡ 1 then the generalized importance weight (1) turns out to be w(y | x) = π(y)/p(y), coinciding with the standard notation of importance ratio.In order to simplify the notations, we could write w(y) = π(y)/p(y). (2) In theory, we assume that π is absolutely continuous with respect to p, so that this importance weight can be interpreted as the Radon-Nikodym derivative.In practice, one should always choose p so that its support covers Algorithm 2 MTM-IS: the current state is x.
that of π for the algorithm to work well.The main result of this section is stated in Theorem 2, which can be viewed as a generalization of the results in Liu (1996) and Atchadé and Perron (2007) and provides the exact convergence rate of MTM-IS.

Transition distribution decomposition
Theorem 1 The transition distribution of MTM-IS can be decomposed as where H k is defined as and R(x) = 1 − X min {H k [w(x)], H k [w(y)]} π(y)dy ∈ [0, 1] denotes the rejection probability when the current state is x ∈ X .In particular, H k (z) is a strictly decreasing function in z.For k = 1, H k degenerates to H 1 (z) = z −1 .
Proof Let x / ∈ B ⊂ X be measurable, the probability of proposing an element in B and accepting it is The last equality appears irrelevant to x, but the importance ratio w(x) = π(x)/p(x) matters when deciding whether or not the chosen y J is accepted.Furthermore, where H k is as defined in (4).Thus, the overall rejection probability is and the prescribed decomposition (3) is thus proved.
Let w * inf{u > 0 : π(x : w(x) ≤ u) = 1} be the essential supremum of w(x) on X w.r.t.π(•) (i.e., w * is the smallest value such that w(x) ≤ w * with π-probability 1).Since H k (w) is a monotone decreasing function of w (Theorem 1), we have an upper bound R(x) ≤ 1 − H k (w * ).Furthermore, since we have the following mixture representation of the transition function, convenient for comparing with π: where . This representation can be used to facilitate a coupling argument to prove the geometric convergence of the Markov chain (more details in Section 3).

Spectrum of the transition operator
Now we provide a result to fully characterize the spectrum of the transition operator induced by the MTM-IS algorithm.A similar result was derived for the IMH algorithm by Liu (1996) for the discrete state-space case, and then by Atchadé and Perron (2007) in general.To be concrete, we introduce the following definitions.
Definition 1 Let A(x, y) be the transition function of a Markov chain with π as its invariant distribution.We define its transition operator K : L 2 (π) → L 2 (π) as It computes the conditional mean and is called the forward operator in Liu et al. (1995).
Definition 2 Let K 0 be the restriction of K onto L 2 0 (π), the orthogonal complement of the constant function of L 2 (π).Then the spectrum of K 0 is The essential range of a function R is Theorem 2 Let K be the transition operator defined by the MTM-IS algorithm, and let K 0 be similarly defined as in Definition 1.Then, σ(K 0 ) ⊆ ess-ran(R), where R is the rejection probability defined in (5).The equality holds if ∀ α ∈ ess-ran(R), π{y : R(y) = α} = 0.
Since the proof is mostly technical, we defer it to the Appendix.From (5) and Theorem 1, it is obvious to see that an upper bound of R(x) is 1−H k (w * ).This implies that there is a gap between 1 and the upper edge 1 − H(w * ) of the spectrum σ(K 0 ), provided that w * < ∞.For the finite discrete state-space case, H(w * ) = 1/w * , and 1 − H(w * ) is the exact convergence rate of the chain.
3 Convergence Rate and Algorithmic Comparison The χ 2 -distance between two probability distributions π and p is defined as Let p n (x) = A n (p 0 , x) denote the distribution of X n , the state of the Markov chain after n steps from initialization p 0 .It was shown in Liu et al. (1995) that It is easy to show that (Liu et al., 1995) is the spectral radius of K 0 (Liu et al., 1995), which is equal to the maximum of σ(K 0 ) in absolute value.As shown in Theorem 2, this is bounded by 1 − H(w * ).Thus, d χ (π, p n ) ≤ (1 − H(w * )) n d χ (π, p 0 ).It also follows from the Cauchy-Schwartz inequality that Thus, the L 1 distance between p n and the target π, also known as their total variation distribution and denoted as p n −π T V , decreases geometrically bounded by the same rate.

Maximal total variation distance
Definition 3 Let the transition function of a Markov chain be A(•, •), with the corresponding stationary distribution π(•).The maximal total variation distance between the Markov chain's n-step distribution and π is Moreover, the quantity r = lim sup is called the exact convergence rate of the Markov chain.
Since the total variation distance is equivalent to the L 1 distance p − π T V = 2 p − π L 1 between two probability measures π and p, it is easy to see from definition of (10) and equation ( 11) that rate r ≤ ρ.In the following, we use another a coupling argument to validate this upper bound r.Moreover, we will also show that for the transition kernel defined by Algorithm 2, inequality r ≥ ρ also holds.We need the following lemmas to prove our results.
Lemma 1 (Coupling) (Levin and Peres, 2017) t=0 are a pair of Markov chains with the same transition rule satisfying: (i) If Ψ i = Ψ i for some i, then for any j ≥ i, Ψ j = Ψ j ; and (ii) Ψ 0 ∼ π.Then, for τ = min{n : Ψn = Ψn}, we have a bound Lemma 2 (Lower bound) (Wang, 2020) Let R(x) denote the rejection probability (5) given current state x.That is, Then, we have a lower bound Theorem 3 Consider the MTM-IS defined in Algorithm 2 and let w * < ∞ be the essential supremum of w(x) = π(x)/p(x).Then, the maximal total variation distance of the algorithm to its target distribution π is Thus, the exact convergence rate of the MTM-IS is 1 − H k (w * ).
Proof We will establish that upper and lower bounds of d(n) are equal in the limit.
Upper Bound.An upper bound can be obtained by using the coupling idea of Lemma 1.Consider two Markov chains {x t } and {x t } defined by MTM-IS.Because of the the decomposition (6), we can interpret the actual transition measure A(x, •) as a mixture of π(•) and qres(x, •), and define the following coupling rule for the two chains.First, we let x 0 = x (for some arbitrary x ∈ X ) and assume that x0 ∼ π(•) as the initialization of these two chains.Then, suppose that the two chains are at x t and xt , respectively, at time t.If x t = xt , then sample x t+1 from A(x t , •) and set x t+1 = x t+1 .Thus, their future paths coalesce into one.If x t = xt , we draw z ∼ Bernoulli(H(w * )) and sample x ∼ π(•).We set x t+1 = x t+1 = x if z = 1.Otherwise, we sample x t+1 ∼ qres(x t , •) and x t+1 ∼ qres(x t , •), independently.
Our constructions of {x t } and {x t } have the following properties: (i) marginally these two chains both evolve according to A(•, •); (ii) the distribution of x t is exactly A t (x, •) and the distribution of xt is π(•); (iii) once x t = xt for some t, the two chains coalesce into one afterwards.Applying Lemma 1, we have Taking the supremum over x ∈ X we have Lower Bound: For the lower bound, we consider the worst case as demonstrated in the proof of Lemma 2 in Wang (2020).In particular, if we can find some x * such that w(x * ) = w * , then the proof is over; but sometimes this is not achievable, in which case we take advantage of the continuity and monotonicity of H k .For any > 0, there exists δ > 0 such that H(w) < H(w * ) + once w * − δ < w ≤ w * .By the definition of essential supremum, we can always find some Letting → 0, we derive the final result.

Comparison with the IMH sampler
Since one iteration of MTM-IS is computationally as expensive as k-iterations of the IMH algorithm, we are interested in knowing which one has a better convergence rate.We denote the MTM-IS algorithm with k trials as MTM-IS(k) to emphasize the role of k.Correspondingly, we denote the k-fold thinned IMH algorithm IMH(k) (i.e., collecting 1 draw after every k steps of the standard IMH).Note, however, that a clear advantage of MTM-IS(k) over IMH(k) is that the former is straightforward to parallelise as suggested in Calderhead (2014), which can considerably speed up the algorithm in practice.
Previously, we obtain the exact convergence rate of MTM-IS(k) as 1 − H k (w * ).We rewrite (4) as an expectation form to gain some insights: , where X 1 , . . ., X k−1 are independent samples from p(•).Setting k = 1, the formula reduces to H 1 (z) = z −1 , which gives rise to the exact convergence rate 1 − 1/w * of the IMH algorithm as shown in Liu (1996) and Atchadé and Perron (2007).The convergence rate of IMH(k) is then exactly (1 − 1/w * ) k .We have the following main result, whose proof is deferred to the Appendix.
Theorem 4 With the same notations as in Theorem 3, we have for any k ≥ 1, where all X i 's are taken independently from p(•).Thus, MTM-IS(k) is no more efficient than IMH(k) although the two algorithms are of similar computational cost.
This theorem provides the first theoretical guidance on the use of MTM methods.It implies that in this rather simple MTM-IS framework, multiple independent proposals are not helpful in improving the the mixing of the algorithm.It is not surprising that IMH is preferable when the target distribution is "easy" -after all, the IMH is perfect if the proposal matches the target exactly and having multiple trials is simply a waste.It is surprising to us, though, that such a preference holds universally.
We speculate that k independent multiple proposals in a general MTM framework are also not more efficient than the corresponding k-fold thinned MCMC algorithm.It therefore casts a doubt on the utility of MTM.Our numerical experiences in the past suggest that the MTM strategy is most helpful in jumping among multiple modes of the target distribution (Liu et al., 2000;Dai and Liu, 2020).Also as demonstrated in the molecular simulation literature (Frenkel et al., 1996), a form of partial MTM is very useful in building part of the proposal and will be examined in more detail in Section 5.1.More general correlated multiple proposals may also help (Craiu and Lemieux, 2007) and will be discussed in Sections 5.2 and 5.4.

Numerical Illustrations
We illustrate the discrepancy between convergence rates of MTM-IS(k) and IMH(k) numerically.As expected, if the proposal p is already very close to target π, IMH(k) is significantly better than MTM-IS(k).The performance difference of the two algorithms becomes quite minimal if the proposal distribution differs from the target one considerably, i.e., when w * is large.In these examples, the explicit convergence rate formula for MTM-IS(k) is still complicated, so we use Monte Carlo to approximate the expectation in (15).

Univariate examples
The first two examples were previously used in Liu (1996) to compare the IMH algorithm with importance sampling and rejection sampling and are reexamined here.The third example is a continuous case with an unbounded domain.
Example 1 Let the state space be X = {1, . . ., m}, p(i) = 1/m and π(i) = (2m + 1 − 2i)/m 2 , p(i) = 1/m.In this case, w * = 2−1/m is close to 2, leading to an approximate convergence rate 0.5 for the IMH algorithm.Figure 1 displays the convergence rates of MTM-IS(k) and IMH(k) with m = 1000 and k ranging from 1 to 10 computed from 50000 independent uniform Monte Carlo samples.Example 2 We consider the case where the target distribution is binomial Bin(m, θ), and p(x) = 1/(m + 1) is uniform.Then Using the standard normal approximation, we find that Figure 2 is computed from 50000 independent uniform Monte Carlo samples with m = 100 for two θ values.We that in the latter case when the distribution is very skewed, the discrepancy between MTM-IS(k) and IMH(k) is much smaller.
Example 3 We investigate a one-dimensional continuous case with the target being N (0, 1), and the proposal distribution being a scaled t-distribution with 10 degrees of freedom, p(x) = ct 10 (cx) with c ≥ 1.For practical uses of both importance sampling and IMH-type algorithms, we strongly recommend to choose a proposal distribution that has a heavier tail than but does not differ too much with the target.In our case, both t-distribution proposals satisfy the fat-tail requirement.But a larger c leads to a larger discrepancy between the target and the proposal.

Multivariate Gaussian and Gaussian mixture
We first use multivariate Gaussian distributions as both the target and proposal to show some practical implications of our result.Let π = N (0, I d ) and Then we find that the importance weight can be expressed as: Therefore, w * = sup w( x) < ∞ if either σ > 1 with an arbitrary µ or σ = 1 with µ = 0.When σ > 1, the maximal importance weight w * ∼ σ d and thus the mixing time of IMH τ IMH (δ) = Ω(w * log(1/δ)) scales exponentially with the dimension d.In the same manner, the mixing time of MTM-IS also scales exponentially with d, and becomes worse as σ increases.Figure 4 supports that MTM-IS and consecutive IMH have almost the same mixing rates.Next, we consider a Gaussian mixture distribution where 1 is a d-dimensional vector filled with all 1's.Employing T = N (0, σ 2 I d ), we have the importance weight It is easy to see that w * < ∞ if and only if σ > 1. Figure 5 depicts theoretical convergence rates and log mixing times for varying dimension and proposal standard deviation σ.Again the mixing times scale exponentially with dimension d.Unlike the single Gaussian case, however, Figure 5(b) shows that the slope of log mixing times is not a monotone function of σ.
Figure 6 explores the optimization with σ.Specifically, Figure 6(a) plots the convergence rates against varying σ when d = 2, showing that the optimal choice is σ ≈ 1.594949.When d grows, the optimal σ remains approximately in the range of 1.55∼1.62. Figure 6(c) indicates that the mixing time still scales exponentially with d even if σ is optimized.

Partial MTM-IS: an efficient variant
To reflect how MTM has actually been used in molecular simulations (Frenkel et al., 1996), we assume a partition of the state-space, x = (x a , x b ), and the corresponding partition of the target distribution π(x) ∝ q(x a , x b ) = q a (x a )q b (x b |x a ), where q b may not be normalized.We assume that q a (x a ) is much more expensive to evaluate than q b (x b |x a ).An important point to note is that we want to move (x a , x b ) jointly instead of iterating between conditional draws of x a |x b and x b |x a (for reasons such as the two components may be tightly coupled).We consider the independent proposal: p(x) = p a (x a )p b (x b |x a ).A Partial MTM-IS algorithm is as follows: Remark 1 (PMTM-IS versus MTM-IS) Note that, compared with the vanilla MTM-IS (Algorithm 2), PMTM-IS needs to draw extra balancing samples.Since we assume that sampling x b and evaluating it are both very cheap, it is still worth doing.In this case, there are no standard IMH or MCMC variants for comparisons.
Typically, one iteration of IMH involves evaluating qa/pa twice (respectively on x a and y a ) and evaluating q b /p b twice (respectively on x b |x a and y b |y a ).In contrast, one iteration of Algorithm 3 consists of evaluating qa/pa twice (respectively Convergence of MTM-IS Algorithm 3 PMTM-IS: the current state is x = (x a , x b ).
, and set W y = k j=1 w j , W x = k j=1 w j .4: Select index J with probability proportional to w j and define y = (y a , y b J ). 5: Accept y with probability ρ = min {1, W y /W x }. on x a and y a ) and evaluating q b /p b for 2k times (respectively on x b j |x a and y b j |y a with j = 1, . . ., k).When evaluating q b is significantly computationally more expensive than qa, Algorithm 3 nearly matches the computational cost of one-step IMH.Under certain reasonable regularity conditions, the following proposition shows that Algorithm 3 provably converges faster.
where πa and π b are normalized marginal and conditional distributions.Under the following regularity conditions with proposal p (all parts normalized): IMH converges with rate 1 − 1/w * .In contrast, the partial MTM-IS (Algorithm 3) has a convergence rate no slower than 1 − 1/w * .
Proof Noting that ess sup x π(x a ,x b ) p(x a ,x b ) = w * , we obtain the convergence rate of IMH as 1 − 1/w * by Theorem 3. As for Algorithm 3, we decompose the transition kernel as . . .Suppose the normalizing constant of q(x a , , in which y b k = y b and x b k = x b .Therefore, it gives rise to The following inequality immediately follows: Surprisingly, (17) leads to a mixture decomposition like (3) and thus is sufficient to construct the upper bound in Theorem 3 by the coupling argument and Lemma 1.Therefore, the convergence rate of Algorithm 3 is no larger than 1 − 1/w * .However, the arguments for establishing matching lower bounds cannot directly apply due to the extra balancing trials x b j , 1 ≤ j ≤ k − 1.So the exact convergence rate of Algorithm 3 remains unknown.

Correlated multiple trials
Compared with the original MTM, the partial MTM-IS differs in that its multiple trials (y a , y b 1 ), . . ., (y a , y b k ) are correlated due to the state space partitioning.As also demonstrated by Craiu and Lemieux (2007), we believe that generating correlated multiple trials is a key in designing efficient MTM algorithms.Although rigorous theoretical analysis for a general correlated MTM design is beyond our reach, we present some theoretical results for two special cases for finite state spaces, which may also be generalization to continuous state-spaces.Implications derived from the analysis apply more generally: good correlated multiple-tries can be obtained with the aid of a deterministic step.
Stratified sampling: Suppose X is a finite state space.We partition it into a few subgroups, X 1 , . . ., X B so that X i ∩ X j = ∅, ∀i = j and ∪ j X j = X .We begin with a block wise IMH step by sampling from {X 1 , . . ., X B } with weight p(X j ) and accept it with w(X j ) = π(X j )/p(X j ) afterwards.Then, we draw y within the sampled block with probability proportional to π(y).It is easy to see that the chain become stationary once it converges at the subgroup level.Thus, the convergence rate of this algorithm is where X * = arg max j w(X j ).This is not generally better than the IMH(k), which has a convergence rate of (1 − 1/w * ) k > 1 − k/w * .But if the weights w's are very uneven and we can partition the states so that the weights w(X j )'s are more balanced, then the stratified IMH can improve upon IMH(k) significantly.We also note that the computation cost of this block-based MTM-IS(k) algorithm is no worse than IMH(2) (the first step of block sampling is no worse than 1-step IMH; and so is the second step of sampling within a block), much better than IMH(k) when k is large.
Sampling without replacement: Another obvious way of introducing correlations for multiple proposals is to do sampling without replacement.Let X = {1, . . ., N }.To simplify the discussion, we here focus on the simple random sampling without replacement (SRSWOR, i.e., p(y) ∝ 1), although it is possible to extend the method to do sampling without replace with unequal probabilities using one of the schemes in Chen et al. (1994).The algorithm is as follows.
The actual transition probability from x to y = x for this scheme is , ∀j < k.

Doing an exact eigenvalue decomposition of matrix A would have brought us
Algorithm 4 MTM-SRSWOR(k): Suppose that the current state is at x. 1: Draw S = (y 1 , . . ., y k ) ⊂ (X \{x}) jointly via SRSWOR.2: Select index J with probability proportional to w(y j ) = (N − 1)π(y j ) and define y = y J .
3: Accept y with the ratio ρ = min 1, w(y,x)+ i =J w(yi,x) w(x,y)+ i =J wi(yi,y) = min 1, a tight bound on the convergence rate.But A does not possess a nice low-rank property as that for the IMH sampler or the MTM-IS.
For S ⊂ X , we define π(S) = x∈X π(x), S * = arg max {S: |S|=k} π(S), and x * = arg max x π(x).We find the following inequality to hold: During each iteration, the chain stays at the current state if and only if the new proposal is rejected since in our construction of Algorithm 4, the proposal set is not allowed to contain the current state.We observe that ρ ≡ 1 whenever x = x * arg min x π(x), leading to A(x * , x * ) = 0.This fact prevents us from using the previous coupling arguments directly.However, as we specify to some circumstances, we could still obtain satisfactory results.
We can then write out A as follows: Note that e is a common right eigenvector for both A and A − G, corresponding to the largest eigenvalue 1.Since A − G is of rank 1, the remaining eigenvalues of A and G have to be the same.Hence the eigenvalues for A are 1, a 2 − a 3 , −a 4 , . . ., −a 4 .This decoupling trick has also been used in Liu (1996) for the IMH algorithm.Given the convergence rate (1 to prove that MTM-SRSWOR( 2) is faster than IMH(2).Clearly, this holds true for , which leads to π 1 = 1/2 + 1/(2N ), π 2 = 1/(2N ).In this case, We note that designing a suitable parallel construction to do SRSWOR can speed up the algorithm considerably.Furthermore, when proposing multiple trials, we may also choose not to exclude x from the proposal set.In this case, we need to modify Algorithm 4 slightly to become Algorithm 5.
1: Draw a subset S ⊂ X of size k at random, denoted as S = (y 1 , . . ., y k ).2: Select index J with probability proportional to w(y j ) = N π(y j ) and define y = y J .

Independent non-identical proposals
Besides introducing correlations between multiple trials, Craiu and Lemieux (2007) also suggests to use different proposals for generating multiple trials in each MTM iteration and provides some supportive empirical evidences.
Here we consider a special case of MTM-IS(k) in which the multiple trials are generated from different proposals, i.e., y j ∼ p j (•) independently for j = Algorithm 6 MTM with independent non-identical proposals with current state x.
1: Draw multiple trials y j ∼ p j (y j ), j = 1, . . ., k independently.Then compute w j (y j ) = π(y j )/p j (y j ).2: Select index J with probability proportional to w j (y j , x) and define y = y J .
3: Accept y with the ratio ρ = min 1, w J (y) + i =J w i (y i ) 1, . . ., k.In this case, we also do not have to draw balancing trials.Defining w j (x) := π(x)/p j (x), we summarize the procedure in Algorithm 6.
To demonstrate the effect of the multiple-try design employed in Algorithm 6, it should be compared with a sequential k-step IMH sampler.During one iteration, this sampler runs an interior loop of length k, within which the j-th step proposes an independent proposal from p j and then accepts/rejects it based on the MH rule as in the ordinary IMH sampler.This sequential IMH sampler has the same computational cost as Algorithm 6.The following theorem provides tight upper bounds for the convergence rates of the two algorithms, and its proof is deferred to appendixes.
Theorem 6 Suppose target π is absolutely continuous with respect to every proposal p j .Algorithm 6 and its corresponding sequential IMH sampler are geometrically convergent, with their corresponding respective convergent rates upper bounded by , respectively, where w * j := sup x∈X w j (x).Furthermore, the following inequality holds, implying that the upper bound for Algorithm 6 is worse than that for the corresponding sequential IMH.
Remark 2 (Tightness of the lower bounds) Suppose ∃ x * such that i.e., different proposals have their importance weight functions w j to attain their respective supremums at a same point x * .Then, the convergence rates for both aforementioned algorithms attain their respective upper bounds.When p 1 = • • • = p k , condition (24) automatically holds, recovering the convergence rate result of Theorem 3.However, when there is no such a x * as required by ( 24), the quantities claimed in Theorem 6 are only upper bounds.It remains unknown under what other conditions one algorithm can be provably better than the other.Our empirical study shows that their computational efficiencies are almost indistinguishable when the target distribution is "hard" relative to the proposals.
Example 6 We conducted a few simulations to examine convergence behaviors of Algorithm 6 and the corresponding sequential IMH sampler at the same computational cost.As shown in Figure 7, we considered target densities of the form of a mixture of two standard distributions with various dimensions.Top plots in Figure 7 correspond to Gaussian mixture targets, π = 1 2 N (0, I d ) + 1 2 N ( 3, I d ), with d=3, 4, and 5, respectively.Two different proposal distributions are employed: p 1 = N (0, I d ) and p 2 = N (0, 9I d ).During one iteration of the MTM-IS(k) algorithm, k/2 trials are independently drawn from of p 1 , and another k/2 trials from p 2 .The bottom plots correspond to t-mixture distributions, π = 1 2 t 3 (0) + 1 2 t 3 ( 4), for d=1, 2, and 3. Two different proposal distributions are: p 1 = t 3 (0) and p 2 = t 5 (0), and the same implementation of MTM-IS(k) as the previous case is employed.These plots show that Algorithm 6 and its corresponding sequential IMH sampler differ very little in their convergence rates although theoretically we cannot claim one is necessarily better than the other without condition (24).All simulations are based on 10 6 iterations on an Apple M2 chip with 16GB memory, each taking a few minutes.

A general framework
Inspired by the variants of MTM just discussed, we propose a general framework to combine these variants in Algorithm 7.With π(•) as the target distribution on X , we let p(x, y) denote the proposal transition function for multiple correlated proposals, where x ∈ X and y = (y 1 , . . ., y k ) ∈ X k .We further write the j-th marginal of p(x, y) as p j (x, y j ) = p(x, y)dy (−j) , and define the jth generalized importance weight as for j = 1, . . ., k, where λ j is a symmetric function.Assuming the current state is x, the updating rule is summarized in Algorithm 7.
Algorithm 7 Generalized MTM.Suppose current state is at x.
k from the conditional distribution of p(y, x * ) conditioned on J-th variable as x.And set x * J = x.
If we require p(x, y) to have the same marginals for different y j 's, the algorithm reduces to that of Craiu and Lemieux (2007); if we require p(x; y) to be independent among the y j 's, it reduces to that of Casarin et al. (2013).Note that the balancing proposals are drawn to facilitate the computation of ρ, and this guarantees the detailed balance of the MTM design.The following result is expected and its detailed proof is deferred to the Appendix.
Theorem 7 The generalized MTM transition rule (Algorithm 7) satisfies the detailed balance condition and hence induces a reversible Markov chain with π as its invariant distribution.
Defining x * (j) (x * 1 , . . ., x * j−1 , x, x * j+1 , x * k ), one can determine the transition density of the generalized MTM framework via the same spirit employed in the proof of Theorem 1: where we write u j (x, y) min for any x = (x 1 , . . ., x k ) and y = (y 1 , . . ., y k ).A detailed derivation of this formula can be found in the proof of Theorem 7.
As demonstrated in Algorithms 4, 5 and 6, we find that sometimes we do not need to draw balancing trials for MTM to retain the detailed balance.A natural question then arises: can we find a general condition under which which MTM can avoid the drawing of balancing trials?The following theorem provides a sufficient condition that covers all the cases we discussed.
Theorem 8 If, for any pair (x, y) and ∀j, the joint proposal distribution satisfies p(x, y (−j) | y j = y) = p(y, y (−j) | y j = x), we can maintain the detailed balance by setting x * j y j for j = J in Algorithm 7.
Remark 3 (Correlated multiple trials) As demonstrated in Sections 5.1 and 5.2, letting the proposed multiple trials be correlated (especially negatively) can be helpful in improving the chain's convergence.A useful strategy is to use multiple trials as stepping stones to move from one mode of the distribution to another, similar in spirit to Hamiltonian/hybrid Monte Carlo (Qin and Liu, 2001;Liu, 2008) and the griddy Gibbs MTM (Liu et al., 2000).Indeed, it was shown empirically in Qin and Liu (2001) that applying MTM to HMC trajectories may further improve the sampling efficiency.However, an in-depth theoretical analysis as carried out here is much more challenging due to the semi-deterministic nature of aforementioned algorithms.
Remark 4 (Employing multiple distributions in MTM) Intuitively, one may hope that using different distributions for each trial could help us explore the state space better.Our results in Section 5.3, however, demonstrate that it is still not very useful under the IMH framework if the multiple trials are independent.It may be helpful for the partial MTM framework discussed in Section 5.1.

Concluding Remarks
We have presented a complete eigen-decomposition and convergence rate analysis for the MTM-IS, and compared it with the "thinned" IMH sampler (of the same computational cost).With the exact form of eigenvalues of the MTM-IS, we proved rigorously that the sampler is not as efficient as the simpler "thinned" IMH approach.To the best of our knowledge, this is the first exact rate result known for a MTM type algorithm, although the result's implication is less than encouraging.A good news is that, in a more realistic setting of MTM applications as explained in Section 5.1, we can show that MTM improves upon the standard IMH and does not have a suitable competitor.
In a quest for finding advantages MTM may offer, we consider a slightly modified framework that encompasses a few variants of MTM published in the literature.We found that even under the IMH framework, it is possible to construct a MTM algorithm, using either stratified sampling or partial sampling, or sampling without replacement, to gain efficiency.A key to such efficiency gain is to allow multiple trials to be either more dispersed than independent ones (Section 5) or applied only to certain "low-cost" parts (Section 5.1).
Detailed theoretical understanding and guiding principles, however, are still lacking and awaiting further endeavors.
Step 2. Given this, combined with the decomposition we know that it suffices to prove that σ d (K 0 ) ⊂ ess-ran(R), i.e. all eigenvalues of K 0 are in the essential range of R. To proceed, we assume that there exists f 0 ∈ L 2 0 (π) and λ / ∈ ess-ran(R), but K 0 f 0 = λf 0 .Direct computations yield that for any f ∈ L 2 0 (π) which can be simplified as N f 0 = f 0 with N being an operator well-defined on L 2 (π) (rather than L 2 0 (π) in which K 0 is defined).Then, we aim to derive a contradiction about the spectral radius radii(G) sup{| λ |: λ ∈ σ(G)} for some linear operator G on L 2 (π) induced by N .
We know that where the second inequality follows from the fact that y / ∈ D 1 and w(y) ≥ w(x) would together imply that x / ∈ D 1 .Obtaining from In the same manner, we have where f 0,Di M Di f 0 and Rearranging these formulae, we know that . . .
We claim that (A3) implies that radii(M Di N M Di ) ≥ 1 holds true for at least one index i ∈ {1, . . ., n}.Assuming the converse is true, then M D1 N M D1 f 0,D1 = f 0,D1 implies that f 0,D1 = 0 (since 1 cannot be an eigenvalue of M D1 N M D1 ).Consequently, h 2 = 0 follows automatically from its definition (A2), and M D2 N M D2 f 0,D2 = f 0,D2 implies that f 0,D2 = 0.This argument can be carried out recursively until n, indicating that f 0 has to vanish on {x ∈ X : u < w(x) ≤ w}, resulting in a contradiction!
Step 4. Finally, we show that for sufficiently small increments, we can make radii(M Di N M Di ) < 1, ∀i.
First, the mapping is continuous, at least on [u, w].Second, ∀g ∈ L 2 i (π) with g = 1, by the Cauchy-Schwarz inequality we have At last, if we choose the partition to be sufficiently small, we would have radii(M Di N M Di ) < 1 for all i.We then derive a final contradiction to assert that σ d (K 0 ) ⊂ ess-ran(R), ending the proof.Convergence of MTM-IS Proof of Theorem 4 In this proof, every random variable X is taken independently from p.This inequality is proved by induction.First, for k = 1, the inequality reduces to equality due to a previous result of Liu (1996) and Atchadé and Perron (2007).For k = 2, we see that For k ≥ 3, we will prove the following recursive inequality, which leads to the conclusion of the theorem: We prove by simply computing the difference between the two sides: .
We note that (i) can be modified as .
For (ii), we have .

Proof of Theorem
.
Plug max{w j (y), w j (x)} ≤ w * j into this formula to get where X i is taken independently from p i (•).Actually this inequality is sufficient to derive a decomposition of A(x, •) as in ( 6).As shown in the proof of Theorem 3, we upper bound the convergence rate by 1 via coupling argument, Lemma 1. Specifically, when there exists x * such that w j (x * ) = w * j for all j = 1, . . ., k, we find for any y Consequently, the rejection probability at Then we lower bound the convergence rate via Lemma 2.
Part 2. Turn to the corresponding sequential IMH sampler.For simplicity, we utilize the concept of L 2 operators introduced in Section 2 to derive upper bounds.Within one iteration, the sampler runs an interior loop of length k, with each step as a vanilla IMH step using proposal p i .The transition probability of a vanilla IMH step is A (i) (x, y) = 1 max{w i (x), w i (y)} π(y) + 1 − X 1 max{w i (x), w i (y)} π(y)dy δx(y).
Denote K (i) as the operator defined in L 2 (π) by K (i) f (x) = f (y)A (i) (x, y)dy, and denote K (i) 0 as the restriction of K (i) onto L 2 0 (π), the orthogonal complement of the constant function of L 2 (π).Theorem 2 implies K (i) 0 ≤ 1 − 1/w * i .Denote the whole transition probability of one iteration as Ā and associated operators as K and K0 .Consequently, (1 − 1/w * i ).
Going through the full interior loop within one iteration, the whole rejection probability is at least By Lemma 2, a matching lower bound thus obtained.
Part 3. We then establish (23).For k = 2, For larger k > 2, we have, for an arbitrary fixed l ∈ {1, . . ., k},  Ep 1 w * j + 1≤i≤k−1,i =j w i (X i ) The proofs of Theorem 4 and Theorem 6 are essentially the same, both utilizing induction to recursively handle a general integer k.
At last, note that π(x)w j (y, x)p j (x, y) = π(x)π(y)p j (x, y)p j (y, x)λ j (x, y) is symmetric by our constructions, which implies that π(x)A(x, y) is symmetric in x and y, proving the detailed balance condition.
Proof of Theorem 8 If we simply set x * j := y j for any j = J in Algorithm 7, the conditional probability becomes π(x)A(x, y) =π(x) k j=1 p(x, y(j)) w j (y, x) min 1 w j (y, x) + i =j w i (y i , x) , 1 w j (x, y) + i =j w i (y i , y) i =j dy i = k j=1 π(x)p j (x, y)w j (y, x)p(x, y (−j) | y j = y) min 1 w j (y, x) + i =j w i (y i , x) , 1 w j (x, y) + i =j w i (y i , y) i =j dy i .
Since π(x)w j (y, x)p j (x, y) is symmetric for x and y, the theorem follows easily from condition (16) in the main text.

Fig. 1
Fig.1Convergence rates for Example 1 with a finite discrete target distribution.
Figure 3 is computed based on 50000 independent Monte Carlo samples with two choices of c, demonstrating that IMH(k) and MTM-IS(k) are nearly indistinguishable if the proposal does not align with the target well.

1:
Draw y a from p a (•); and draw multiple trials y b 1 , . . ., y b k independently from p b (• | y a ); 2: Draw i.i.d."balancing trials" x b 1 , . . ., x b k−1 from p b (•|x a ), and let x b k = x b ; 3: For j = 1, . . ., k, compute j |y a )p b (x b j |x a )dy b j dx b j .

6
Part 1 derives the convergence rate of Algorithm 6.Part 2 derives the convergence rate of the corresponding sequential IMH sampler.Part 3 finishes by deriving the inequality (23) via induction.Part 1. Via straight forward computation, the transition probability of Algorithm 6 has the following formula (x = y) A(x, y) = k j=1 w i (X i ) = w * l + w * j + B jlapplied in the denominators of the two positive terms.The last step of induction is the same as the proof of Theorem 4. Suppose the result holds for k − 1, i.e., j = y, J = j, y J gets accepted) =π(x) k j=1 p(x, y j ) w j (y, x) w j (y, x) + i =j w i (y i , x) ρp(y, x * −j | x) y(j)) w j (y, x) w j (y, x)+ i =j w i (y i , x) min 1, w j (y, x) + i =j w i (y i , x) w j (x, y) + i =j w i (x * i , y) p(y, x * (−j) | x) )w j (y, x)p j (x, y) u j (x * (j), y) p(x, y (−j) | y j = y)p(y, x * (−j) | x j = x) i =j dy i dx * i .
=j w i (X i )][w * j + k i=1,i =j,i =l w i (X i )] =j,i =l w i (X i ) for simplicity.The last inequality is mainly due to