1 Introduction

1.1 Fundamental Metropolis–Hastings method

Markov chain Monte Carlo (MCMC) methods have played important roles in statistical computing and Bayesian inference and have attracted much attention from both theoretical researchers and practitioners. In a nutshell, the set of methods provide general and practical recipes for generating random draws from any given target probability distribution known up to a normalizing constant. Specifically, such an algorithm generates a time-homogeneous Markov chain with its stationary distribution being the target one. Under mild assumptions, this chain converges to the target distribution geometrically (Roberts and Tweedie 1996; Liu et al. 1995). See Liu (2008) and Brooks et al. (2011) for more comprehensive reviews. The scheme first proposed by Metropolis et al. (1953) and then generalized by Hastings (1970) is arguably the most popular and fundamental construction among all MCMC methods. Let \(\pi (\cdot )\) denote the target probability distribution/density function on the state space \({\mathcal {X}}\). The Metropolis–Hastings method constructs a Markov chain \(x^{(1)}, x^{(2)},\ldots , \) on \({\mathcal {X}}\) as follows. At step \(t+1\), it proposes a new state y from a user-specified transition function p(xy), i.e., \(y\sim p(x^{(t)},\cdot )\). Then, the next state \(x^{(t+1)}\) is equal to y with probability \(\rho \) and to \(x^{(t)}\) with probability \(1-\rho \), where

$$\begin{aligned} \rho =\min \left\{ 1,\frac{\pi (y)p(y,x^{(t)})}{\pi (x^{(t)}) p(x^{(t)},y)}\right\} . \end{aligned}$$

This design ensures that the generated Markov chain satisfies the detailed balance with respect to \(\pi \), which guarantees the chain’s reversibility and convergence under mild conditions.

1.2 Geometric convergence

A Markov chain with transition function A is said to be geometrically ergodic if, for \(\pi \)-almost everywhere x, \( \Vert A^n(x,\cdot )-\pi (\cdot )\Vert \le C(x)r^n\) holds true with constant \(r\in (0,1)\). Here \(\Vert \cdot \Vert \) denotes a distance metric between two probability measures, usually taken as the total variation (TV) distance. Other modes of convergence, such as convergence in \(\chi ^2\)-distance (which implies the convergence in total variation), have also been investigated (Liu et al. 1995; Liu 2008). Establishing this inequality and deriving sharp bounds on the rate r are seen as central tasks in studying MCMC algorithms (Tierney 1994; Liu et al. 1995; Roberts and Tweedie 1996).

As a generalization of the standard Metropolis–Hastings algorithm, the multiple-try Metropolis (MTM) scheme as formalized in Liu et al. (2000) allows one to draw multiple trials at each step and select one according to a specially designed probability distribution. Although intuitively the MTM scheme enables one to escape from local optimums more easily, there is little theoretical understanding of the convergence rate of any form of the MTM algorithm, making it a challenging practical concern when deciding whether a MTM approach should be employed for a specific problem. Existing theoretical results on the Metropolis–Hastings algorithm clearly cannot be easily extended to the MTM algorithm. Indeed, getting sharp bounds on the convergence rate of any general-purpose Metropolis–Hastings algorithm can be extremely challenging, except for the Independent Metropolis–Hastings (IMH) algorithm (which is also called the Metropolised independence sampler by Liu (1996) and the independence Metropolis chain by Tierney (1994)). We are therefore tempted to consider whether the IMH’s multiple-try version, which we call the multiple-try Metropolis Independent sampler (MTM-IS), can be tackled theoretically.

1.3 Convergence rate of independent Metropolis–Hastings algorithm

Geometrical ergodicity is not guaranteed for a general Metropolis–Hastings algorithm unless we impose suitable restrictions (Roberts and Tweedie 1996), and exact convergence rates for Metropolis–Hastings algorithms are rare to find (Diaconis and Saloff-Coste 1998). In practice, geometric ergodicity is often established under the ‘drift-and-minorization’ framework (Diaconis et al. 2008). But this technique usually results in a very conservative bound of the convergence rate, not quite practically useful. Because of the very special structure of the IMH algorithm, explicit eigen-analyses of its transition matrix for the finite-discrete state space case were obtained by Liu (1996), which results in the exact convergence rate of the IMH algorithm (also a very tight bound on the constant in front of the rate) and offers a comparison with classical rejection sampling and importance sampling. Atchadé and Perron (2007) studies the continuous case by determining the full spectrum of the transition operator of the IMH algorithm. A recent preprint of Wang (2020) combines previous results and provides a lower bound, hence determining the exact convergence rate. In this paper, we impose similar conditions on the MTM-IS and study its exact convergence rate.

1.4 Multiple-try Metropolis and its variants

The original idea of multiple-try Metropolis (MTM) comes from chemical physicists interested in molecular simulations (Frenkel et al. 1996). Its general formulation constructed in Liu et al. (2000) inspires the development of Ensemble MCMC methods by Neal (2011), connects with particle filtering (Martino et al. 2014), and stimulates ideas of parallelizing MCMC (Calderhead 2014; Yang et al. 2018). We refer interested readers to the review of Martino (2018). Intuitively, the MTM approach enables one to explore the sample space more broadly, and thus potentially gains efficiency in avoiding being trapped in local modes. The method has been incorporated in some applications such as model selection (Pandolfi et al. 2010) and Bayes factor estimation (Dai and Liu 2020).

In the context of molecular simulations (Frenkel et al. 1996), the multiple-try strategy is often applied to a target distribution in which the state space can be partitioned into two parts: position and orientation, i.e., \(\textbf{x}=(\textbf{x}^p, \textbf{x}^o)\). For a given \(\textbf{x}^p\), evaluating multiple configurations corresponding to different orientations, \(\pi (\textbf{x}^p,\textbf{x}^{o1}),\ldots , \pi (\textbf{x}^p,\textbf{x}^{om})\) is not much more expensive than evaluating a single \(\pi (\textbf{x}^p,\textbf{x}^o)\). Thus, MTM can be quite useful in facilitating an efficient move: we can propose the new configuration by (a) first proposing the position \(\textbf{x}^p_{(new)}\); (b) associating with it multiple orientations \(\textbf{x}^{o1}_{(new)},\ldots ,\textbf{x}^{om}_{(new)}\); (c) picking one from them properly, and (d) using the MTM rule to do acceptance/rejection. In addition to this case, MTM is also particularly useful when combined with directional sampling, as in (Liu et al. 2000; Dai and Liu 2020). Specifically, given a sampling direction \(\textbf{e}\) at position \(\textbf{x}\), multiple trials are drawn simultaneously as \(r_1,\ldots ,r_m\sim p(r)\) to construct \(\textbf{y}_j=\textbf{x}+r_j\textbf{e}\).

Several variants of the MTM are worth mentioning: Craiu and Lemieux (2007) propose to use correlated trials to accelerate MTM and introduces antithetic and stratified sampling to bring correlation; Casarin et al. (2013) argue that multiple independent trials from different distributions are worth considering, and connect to interactive sampling algorithms. Theoretically, Bédard et al. (2012) conducts a scaling analysis for MTM. However, to the best of our efforts, we can not find any existing result on the convergence rate of an MTM algorithm.

In this paper, we report the exact convergence rate of the MTM-IS for general target \(\pi (\cdot )\) and proposal \(p(\cdot )\). The result is somewhat surprising as it shows that the MTM-IS with k multiple tries is not as efficient as simply repeating the standard IMH algorithm k times, thus suggesting that the we may want to design the k multiple proposals to be “over-dispersed” (e.g., negatively correlated) in order to take advantage of the MTM structure. Another useful scenario, as discussed previously and detailed in Sect. 5.1, is to help proposing a better configuration for a general Metropolis–Hastings algorithm by orienting part of the proposal better via MTM.

The rest of the article is organized as follows. Section 2 carries out an eigenvalue analysis of MTM-IS; Sect. 3 specifies the exact convergence rate of MTM-IS under the total variation distance and offers an inequality to compare MTM-IS with its corresponding “thinned” IMH algorithm (i.e., taking one draw from every k iterations of the sampler); Sect. 4 provides some empirical results for multivariate Gaussian and Gaussian mixtures; Sect. 5 discusses several variants and extensions of MTM; and Sect. 6 concludes the article with a short remark.

2 Eigen-analysis of multiple-try Metropolis independent sampler

2.1 Notations

Throughout the article, we use \({\mathcal {X}}\) to denote the state space, which can be either discrete or continuous. Notations \(\pi (x)\) and p(xy) represent the target and proposal distributions, respectively, with \(x,y\in {\mathcal {X}}\). If proposal distribution is independent of the current state x, we write it as p(y). The actual transition function/probability/density of the MCMC algorithm is denoted by A(xy). A collection of multiple trials of size k is written as \(\textbf{y}=(y_1,\ldots ,y_k)\). We consider the total variation distance for any two (signed) measures P and Q, which is defined as \(\Vert P-Q\Vert _{TV} =\sup _{A\in {\mathcal {F}}} |P(A)-Q(A)|\), where \({\mathcal {F}}\) denotes the \(\sigma \)-field common to P and Q (e.g., the Borel \(\sigma \)-field for most common uses). In Sect. 5, we slightly abuse the notation by letting \(p(x,\textbf{y})\) be the proposal distribution for \(x\in {\mathcal {X}}\) and \(\textbf{y}=(y_1,\ldots ,y_k)\in {\mathcal {X}}^k\), as we would consider multiple correlated trials in this section. Besides, we write \(p(x,\textbf{y}_{(-j)}\mid y_j)\) as the conditional distribution of \(\textbf{y}_{(-j)}\equiv (y_1,\ldots ,y_{j-1},y_{j+1},\ldots ,y_k)\) given \(y_j\) and x. Lastly, \(p_j(x,y_j)=\int p(x, \textbf{y}) \textrm{d} \textbf{y}_{-j}\) denotes the conditional marginal distribution of \(y_j\) given x.

2.2 Description of the algorithms

The general framework of the MTM as formulated in Liu et al. (2000) is summarized in Algorithm 1. Let the current state be x, and let the number of multiple tries be k. With a proposal transition function p(xy) that defines the conditional distribution of y, we define the generalized importance weight as

$$\begin{aligned} w(y\vert x)=\frac{\pi (y)}{p(x, y)}\lambda (x,y) \end{aligned}$$
(1)

where \(\lambda \) is a symmetric non-negative function (i.e., \(\lambda (x,y)=\lambda (y,x)\ge 0\), \(\forall x, y\)). Thus, the acceptance/rejection ratio in a general MH algorithm is just the ratio of the generalized importance weights.

figure a

Here, \(x_1^*,x_2^*,\ldots ,x_{k-1}^*\) are called balancing trials, which are drawn to guarantee the detailed balance. Liu et al. (2000) also extend the MTM for generating non-independent multiple trials such as semi-deterministic ones along a direction. If we choose \(p(x,y)=p(y)\), we can modify this algorithm to avoid drawing additional balancing trials as the algorithm is still valid if we simply replace the \(x_j^*\) by \(y_j\) in computing \(\rho \). This modified version is summarized in Algorithm 2 and named the MTM-IS(k). In this case, we further select \(\lambda (x,y)\equiv 1\) then the generalized importance weight (1) turns out to be \(w(y\mid x)=\pi (y)/p(y)\), coinciding with the standard notation of importance ratio. In order to simplify the notations, we could write

$$\begin{aligned} w(y)=\pi (y)/p(y). \end{aligned}$$
(2)
figure b

In theory, we assume that \(\pi \) is absolutely continuous with respect to p, so that this importance weight can be interpreted as the Radon-Nikodym derivative. In practice, one should always choose p so that its support covers that of \(\pi \) for the algorithm to work well. The main result of this section is stated in Theorem 2, which can be viewed as a generalization of the results in Liu (1996) and Atchadé and Perron (2007) and provides the exact convergence rate of MTM-IS.

2.3 Transition distribution decomposition

Theorem 1

The transition distribution of MTM-IS can be decomposed as

$$\begin{aligned} A(x,\textrm{d}y)=R(x)\delta _x(\textrm{d}y)+\min \{H_k[w(x)], H_k[w(y)]\}\pi (y)\textrm{d}y, \end{aligned}$$
(3)

where \(H_k\) is defined as

$$\begin{aligned} H_k(z)=k\underbrace{\int \ldots \int }_{k-1} \frac{1}{z+\sum _{i=1}^{k-1}w(y_i)}\prod _{i=1}^{k-1}p(y_i)\textrm{d}y_i, \end{aligned}$$
(4)

and \(R(x)=1-\int _{\mathcal {X}}\min \left\{ H_k[w(x)],H_k[w(y)]\right\} \pi (y)\textrm{d}y\in [0,1]\) denotes the rejection probability when the current state is \(x\in {\mathcal {X}}\). In particular, \(H_k(z)\) is a strictly decreasing function in z. For \(k=1\), \(H_k\) degenerates to \(H_1(z)=z^{-1}\).

Proof

Let \(x\notin B\subset {\mathcal {X}}\) be measurable, the probability of proposing an element in B and accepting it is

$$\begin{aligned} A(x,B)&={\mathbb {P}} \left[ \bigcup _{j=1}^{k} \left\{ \left( y_j\in B\right) \cap \left( J=j\right) \cap \left( y_J\text { gets accepted}\right) \right\} \right] \\&=k{\mathbb {P}} \left[ \left\{ \left( y_{k}\in B\right) \cap \left( J=k\right) \cap \left( y_k\text { gets accepted}\right) \right\} \right] . \end{aligned}$$

The last equality appears irrelevant to x, but the importance ratio \(w(x)=\pi (x)/p(x)\) matters when deciding whether or not the chosen \(y_J\) is accepted. Furthermore,

$$\begin{aligned}&A(x,B)\\&\quad =k\underbrace{\int \ldots \int }_{k-1}\int _B \frac{w(y)}{w(y)+\sum _{j=1}^{k-1}w(y_j)}\\&\qquad \min \left[ 1,\frac{w(y)+\sum _{j=1}^{k-1}w(y_j)}{w(x) +\sum _{j=1}^{k-1}w(y_j)}\right] p(y)\textrm{d}y \prod _{j=1}^{k-1}p(y_j)\textrm{d}y_j\\&\quad =k\underbrace{\int \ldots \int }_{k-1} \int _B \\&\qquad \min \left[ \frac{w(y)}{w(y)+\sum _{j=1}^{k-1}w(y_j)}, \frac{w(y)}{w(x)+\sum _{j=1}^{k-1}w(y_j)}\right] \\&\qquad p(y)\textrm{d}y \prod _{j=1}^{k-1}p(y_j)\textrm{d}y_j\\&\quad =\int _B \min \left\{ H_k[w(x)], H_k[w(y)]\right\} \pi (y)\textrm{d}y, \end{aligned}$$

where \(H_k\) is as defined in (4). Thus, the overall rejection probability is

$$\begin{aligned} R(x)=1-\int _{\mathcal {X}}\min \left\{ H_k[w(x)],H_k[w(y)]\right\} \pi (y)\textrm{d}y, \end{aligned}$$
(5)

and the prescribed decomposition (3) is thus proved. \(\square \)

Let \(w^*\triangleq \inf \{u>0:\pi (x:w(x)\le u)=1\}\) be the essential supremum of w(x) on \({\mathcal {X}}\) w.r.t. \(\pi (\cdot )\) (i.e., \(w^*\) is the smallest value such that \(w(x)\le w^*\) with \(\pi \)-probability 1). Since \(H_k(w)\) is a monotone decreasing function of w (Theorem 1), we have an upper bound \(R(x)\le 1-H_k(w^*)\). Furthermore, since

$$\begin{aligned} A(x,\textrm{d}y)= & {} R(x)\delta _x(\textrm{d}y)+\min \{H_k[w(x)],\\{} & {} H_k[w(y)]\}\pi (y)\textrm{d}y \ge H_k(w^*)\pi (y)\textrm{d}y, \end{aligned}$$

we have the following mixture representation of the transition function, convenient for comparing with \(\pi \):

$$\begin{aligned} A(x,\textrm{d}y)=H(w^*)\pi (y)\textrm{d}y+[1-H(w^*)] q_{\textrm{res}}(x,\textrm{d}y), \end{aligned}$$
(6)

where \(q_{\textrm{res}}(x,B):=\dfrac{A(x,B)-H(w^*) \pi (B)}{1-H(w^*)}\). This representation can be used to facilitate a coupling argument to prove the geometric convergence of the Markov chain (more details in Sect. 3).

2.4 Spectrum of the transition operator

Now we provide a result to fully characterize the spectrum of the transition operator induced by the MTM-IS algorithm. A similar result was derived for the IMH algorithm by Liu (1996) for the discrete state-space case, and then by Atchadé and Perron (2007) in general. To be concrete, we introduce the following definitions.

Definition 1

Let A(xy) be the transition function of a

Markov chain with \(\pi \) as its invariant distribution. We define its transition operator \(K: \ L^2(\pi ) \rightarrow L^2(\pi )\) as

$$\begin{aligned} K f(x)= \int f(y) A(x,y) dy. \end{aligned}$$
(7)

It computes the conditional mean and is called the forward operator in Liu et al. (1995).

Definition 2

Let \(K_0\) be the restriction of K onto \(L^2_0(\pi )\), the orthogonal complement of the constant function of \(L^2(\pi )\). Then the spectrum of \(K_0\) is

$$\begin{aligned} \sigma (K_0)\triangleq \{\lambda \in {\mathbb {R}} : K_0-\lambda I\text { is non-invertible}\}. \end{aligned}$$
(8)

The essential range of a function R is

$$\begin{aligned} \text {ess-ran}(R)\triangleq \{\lambda \in {\mathbb {R}} : \pi (x:\mid R(x)-\lambda \mid <\epsilon )>0,\forall \epsilon >0\}. \end{aligned}$$

Theorem 2

Let K be the transition operator defined by the MTM-IS algorithm, and let \(K_0\) be similarly defined as in Definition 1. Then, \(\sigma (K_0)\subseteq \text {ess-ran}(R)\), where R is the rejection probability defined in (5). The equality holds if \(\forall \) \(\alpha \in \text {ess-ran}(R)\), \(\pi \{y:\ R(y)=\alpha \}=0\).

Since the proof is mostly technical, we defer it to the Appendix. From (5) and Theorem 1, it is obvious to see that an upper bound of R(x) is \(1-H_k(w^*)\). This implies that there is a gap between 1 and the upper edge \(1-H(w^*)\) of the spectrum \(\sigma (K_0)\), provided that \(w^*<\infty \). For the finite discrete state-space case, \(H(w^*)=1/w^*\), and \(1-H(w^*)\) is the exact convergence rate of the chain.

3 Convergence rate and algorithmic comparison

3.1 Convergence in \(\chi ^2\)-distance

The \(\chi ^2\)-distance between two probability distributions \(\pi \) and p is defined as

$$\begin{aligned} d_\chi ^2 (\pi ,p) = \text {var}_\pi [p(x)/\pi (x)]. \end{aligned}$$
(9)

Let \(p_n(x)= A_n(p_0, x)\) denote the distribution of \(X_n\), the state of the Markov chain after n steps from initialization \(p_0\). It was shown in Liu et al. (1995) that \(d_\chi (\pi ,p_n) \le \Vert K_0^n \Vert _2 d_\chi (\pi ,p_0)\), where \(\Vert \cdot \Vert _2\) is \(L^2\)-norm of the operator \(K_0\). It is easy to show that (Liu et al. 1995)

$$\begin{aligned} \rho =\lim _{n\rightarrow \infty } \Vert K_0^n \Vert _2^{1/n} \end{aligned}$$
(10)

is the spectral radius of \(K_0\) (Liu et al. 1995), which is equal to the maximum of \(\sigma (K_0)\) in absolute value. As shown in Theorem 2, this is bounded by \(1-H(w^*)\). Thus, \(d_\chi (\pi ,p_n) \le (1-H(w^*))^n d_\chi (\pi ,p_0)\). It also follows from the Cauchy-Schwartz inequality that

$$\begin{aligned} \Vert p_n -\pi \Vert _{L_1}&= \int \frac{\mid p_n(x) -\pi (x)\mid }{\sqrt{\pi (x)}} \sqrt{\pi (x)} dx \nonumber \\&\le \left[ \int \frac{(p_n(x) -\pi (x))^2}{\pi (x)} dx\right] ^{1/2} =d_\chi (\pi ,p_n) \nonumber \\&\le (1-H(w^*))^n d_\chi (\pi ,p_0). \end{aligned}$$
(11)

Thus, the \(L_1\) distance between \(p_n\) and the target \(\pi \), also known as their total variation distribution and denoted as \(\Vert p_n-\pi \Vert _{TV}\), decreases geometrically bounded by the same rate.

3.2 Maximal total variation distance

Definition 3

Let the transition function of a Markov chain be \(A(\cdot ,\cdot )\), with the corresponding stationary distribution \(\pi (\cdot )\). The maximal total variation distance between the Markov chain’s n-step distribution and \(\pi \) is

$$\begin{aligned} d(n)=\text {ess}\sup _{x\in {\mathcal {X}}}\Vert A^n(x,\cdot ) -\pi (\cdot )\Vert _{TV}. \end{aligned}$$
(12)

Moreover, the quantity

$$\begin{aligned} r= \lim \sup \limits _{n\rightarrow \infty }d(n)^\frac{1}{n} \end{aligned}$$
(13)

is called the exact convergence rate of the Markov chain.

Since the total variation distance is equivalent to the \(L_1\) distance \(\Vert p-\pi \Vert _{TV}=2\Vert p-\pi \Vert _{L^1}\) between two probability measures \(\pi \) and p, it is easy to see from definition of (10) and Eq. (11) that rate \(r\le \rho \). In the following, we use another a coupling argument to validate this upper bound r. Moreover, we will also show that for the transition kernel defined by Algorithm 2, inequality \(r\ge \rho \) also holds. We need the following lemmas to prove our results.

Lemma 1

(Coupling) (Levin and Peres 2017) Suppose \((\Psi _t,\widetilde{\Psi }_t)_{t=0}^\infty \) are a pair of Markov chains with the same transition rule satisfying: (i) If \(\Psi _i=\widetilde{\Psi }_i\) for some i, then for any \(j\ge i\), \(\Psi _j=\widetilde{\Psi }_j\); and (ii) \(\widetilde{\Psi }_0\sim \pi \). Then, for \(\tau =\min \{n:\Psi _n=\widetilde{\Psi }_n\}\), we have a bound

$$\begin{aligned} \Vert A^n(x,\cdot )-\pi (\cdot )\Vert _{TV}\le {\mathbb {P}} (\tau \ge n). \end{aligned}$$

Lemma 2

(Lower bound) (Wang 2020) Let R(x) denote the rejection probability (5) given current state x. That is,

$$\begin{aligned} R(x)=1-\int \min \left\{ H[w(x)],H[w(y)]\right\} \pi (y)\textrm{d}y. \end{aligned}$$

Then, we have a lower bound

$$\begin{aligned} \Vert A^n(x,\cdot )-\pi (\cdot )\Vert _{TV}\ge [R(x)]^n. \end{aligned}$$

Theorem 3

Consider the MTM-IS defined in Algorithm 2 and let \(w^*<\infty \) be the essential supremum of \(w(x)=\pi (x)/p(x)\). Then, the maximal total variation distance of the algorithm to its target distribution \(\pi \) is

$$\begin{aligned} d(n)=[1-H_k(w^*)]^n. \end{aligned}$$

Thus, the exact convergence rate of the MTM-IS is \(1-H_k(w^*)\).

Proof

We will establish that upper and lower bounds of d(n) are equal in the limit.

3.2.1 Upper bound

An upper bound can be obtained by using the coupling idea of Lemma 1. Consider two Markov chains \(\{x_t\}\) and \(\{\tilde{x}_t\}\) defined by MTM-IS. Because of the the decomposition (6), we can interpret the actual transition measure \(A(x,\cdot )\) as a mixture of \(\pi (\cdot )\) and \(q_{\textrm{res}}(x,\cdot )\), and define the following coupling rule for the two chains. First, we let \(x_0=x\) (for some arbitrary \(x\in {\mathcal {X}}\)) and assume that \(\tilde{x}_0\sim \pi (\cdot )\) as the initialization of these two chains. Then, suppose that the two chains are at \(x_t\) and \(\tilde{x}_t\), respectively, at time t. If \(x_t=\tilde{x}_t\), then sample \(x_{t+1}\) from \(A(x_t,\cdot )\) and set \(\widetilde{x}_{t+1}=x_{t+1}\). Thus, their future paths coalesce into one. If \(x_t\ne \tilde{x}_t\), we draw \(z\sim \text {Bernoulli}(H(w^*))\) and sample \(x\sim \pi (\cdot )\). We set \(x_{t+1}=\widetilde{x}_{t+1}=x\) if \(z=1\). Otherwise, we sample \(x_{t+1}\sim q_{\textrm{res}}(x_t,\cdot )\) and \(\widetilde{x}_{t+1}\sim q_{\textrm{res}}(\tilde{x}_t,\cdot )\), independently.

Our constructions of \(\{x_t\}\) and \(\{\tilde{x}_t\}\) have the following properties: (i) marginally these two chains both evolve according to \(A(\cdot ,\cdot )\); (ii) the distribution of \(x_t\) is exactly \(A^t(x,\cdot )\) and the distribution of \(\tilde{x}_t\) is \(\pi (\cdot )\); (iii) once \(x_t=\tilde{x}_t\) for some t, the two chains coalesce into one afterwards. Applying Lemma 1, we have

$$\begin{aligned} \Vert A^n(x,\cdot )-\pi (\cdot )\Vert _{TV}\le {\mathbb {P}} (\tau \ge n)\le [1-H(w^*)]^n. \end{aligned}$$
(14)

Taking the supremum over \(x\in {\mathcal {X}}\) we have \(d(n)\le [1-H(w^*)]^n\).

3.2.2 Lower bound

For the lower bound, we consider the worst case as demonstrated in the proof of Lemma 2 in Wang (2020). In particular, if we can find some \(x^*\) such that \(w(x^*)=w^*\), then the proof is over; but sometimes this is not achievable, in which case we take advantage of the continuity and monotonicity of \(H_k\). For any \(\epsilon >0\), there exists \(\delta >0\) such that \(H(w)<H(w^*)+\epsilon \) once \(w^*-\delta <w\le w^*\). By the definition of essential supremum, we can always find some \(x_\delta \in {\mathcal {X}}\) such that \(w^*-\delta <w(x_\delta )\le w^*\), thus

$$\begin{aligned} d(n)\ge \Vert A^n(x_\delta ,\cdot )-\pi (\cdot )\Vert _{TV} \ge R(x_\delta )^n\ge [1-H(w^*)-\epsilon ]^n, \end{aligned}$$

since we know from (5) that

$$\begin{aligned} R(x_\delta ) \ge 1-\int _{\mathcal {X}}H_k[w(x_\delta )] \pi (\textrm{d}y)\ge 1-H(w^*)-\epsilon . \end{aligned}$$

Letting \(\epsilon \rightarrow 0\), we derive the final result. \(\square \)

3.3 Comparison with the IMH sampler

Since one iteration of MTM-IS is computationally as expensive as k-iterations of the IMH algorithm, we are interested in knowing which one has a better convergence rate. We denote the MTM-IS algorithm with k trials as MTM-IS(k) to emphasize the role of k. Correspondingly, we denote the k-fold thinned IMH algorithm IMH(k) (i.e., collecting 1 draw after every k steps of the standard IMH). Note, however, that a clear advantage of MTM-IS(k) over IMH(k) is that the former is straightforward to parallelise as suggested in Calderhead (2014), which can considerably speed up the algorithm in practice.

Previously, we obtain the exact convergence rate of MTM-IS(k) as \(1-H_k(w^*)\). We rewrite (4) as an expectation form to gain some insights:

$$\begin{aligned} H_k(z)= & {} k\underbrace{\int \ldots \int }_{k-1} \frac{1}{z+\sum _{i=1}^{k-1}w(y_i)}\prod _{i=1}^{k-1}p(y_i)\textrm{d}y_i\\= & {} {\mathbb {E}} _p\left[ \frac{k}{z +\sum _{i=1}^{k-1}w(X_i)}\right] , \end{aligned}$$

where \(X_1,\ldots ,X_{k-1}\) are independent samples from \(p(\cdot )\). Setting \(k=1\), the formula reduces to \(H_1(z)=z^{-1}\), which gives rise to the exact convergence rate \(1-1/{w^*}\) of the IMH algorithm as shown in Liu (1996) and Atchadé and Perron (2007). The convergence rate of IMH(k) is then exactly \((1-1/w^*)^k\). We have the following main result, whose proof is deferred to the Appendix.

Theorem 4

With the same notations as in Theorem 3, we have

$$\begin{aligned} 1-H_k(w^*)=1-{\mathbb {E}} _p \left[ \frac{k}{w^*+\sum _{i=1}^{k-1}w(X_i)}\right] \ge \left( 1-\frac{1}{w^*}\right) ^k \end{aligned}$$
(15)

for any \(k\ge 1\), where all \(X_i\)’s are taken independently from \(p(\cdot )\). Thus, MTM-IS(k) is no more efficient than IMH(k) although the two algorithms are of similar computational cost.

This theorem provides the first theoretical guidance on the use of MTM methods. It implies that in this rather simple MTM-IS framework, multiple independent proposals are not helpful in improving the the mixing of the algorithm. It is not surprising that IMH is preferable when the target distribution is “easy”—after all, the IMH is perfect if the proposal matches the target exactly and having multiple trials is simply a waste. It is surprising to us, though, that such a preference holds universally.

We speculate that k independent multiple proposals in a general MTM framework are also not more efficient than the corresponding k-fold thinned MCMC algorithm. It therefore casts a doubt on the utility of MTM. Our numerical experiences in the past suggest that the MTM strategy is most helpful in jumping among multiple modes of the target distribution (Liu et al. 2000; Dai and Liu 2020). Also as demonstrated in the molecular simulation literature (Frenkel et al. 1996), a form of partial MTM is very useful in building part of the proposal and will be examined in more detail in Sect. 5.1. More general correlated multiple proposals may also help (Craiu and Lemieux 2007) and will be discussed in Sects. 5.2 and 5.4.

4 Numerical illustrations

We illustrate the discrepancy between convergence rates of MTM-IS(k) and IMH(k) numerically. As expected, if the proposal p is already very close to target \(\pi \), IMH(k) is significantly better than MTM-IS(k). The performance difference of the two algorithms becomes quite minimal if the proposal distribution differs from the target one considerably, i.e., when \(w^*\) is large. In these examples, the explicit convergence rate formula for MTM-IS(k) is still complicated, so we use Monte Carlo to approximate the expectation in (15).

4.1 Univariate examples

The first two examples were previously used in Liu (1996) to compare the IMH algorithm with importance sampling and rejection sampling and are reexamined here. The third example is a continuous case with an unbounded domain.

Example 1

Let the state space be \({\mathcal {X}}=\{1,\ldots ,m\}\), \(p(i)=1/m\) and \(\pi (i)=(2m+1-2i)/m^2,p(i)=1/m\). In this case, \(w^*=2-1/m\) is close to 2, leading to an approximate convergence rate 0.5 for the IMH algorithm. Figure 1 displays the convergence rates of MTM-IS(k) and IMH(k) with \(m=1000\) and k ranging from 1 to 10 computed from 50,000 independent uniform Monte Carlo samples.

Fig. 1
figure 1

Convergence rates for example 1 with a finite discrete target distribution

Example 2

We consider the case where the target distribution is binomial \(\text {Bin}(m,\theta )\), and \(p(x)=1/(m+1)\) is uniform. Then

$$\begin{aligned} w(x)=(m+1)\frac{m!}{x!(m-x)!}\theta ^x(1-\theta )^{m-x}. \end{aligned}$$

Using the standard normal approximation, we find that

$$\begin{aligned} w^*\approx \sqrt{\frac{m}{2\pi \theta (1-\theta )}}. \end{aligned}$$

Figure 2 is computed from 50,000 independent uniform Monte Carlo samples with \(m=100\) for two \(\theta \) values. We that in the latter case when the distribution is very skewed, the discrepancy between MTM-IS(k) and IMH(k) is much smaller.

Fig. 2
figure 2

Convergence rates for a binomial target with a \(\theta \)=0.5, and b \(\theta \)=0.05 (Example 2)

Example 3

We investigate a one-dimensional continuous case with the target being \({\mathcal {N}}(0,1)\), and the proposal distribution being a scaled t-distribution with 10 degrees of freedom, \(p(x)=ct_{10}(cx)\) with \(c\ge 1\). For practical uses of both importance sampling and IMH-type algorithms, we strongly recommend to choose a proposal distribution that has a heavier tail than but does not differ too much with the target. In our case, both t-distribution proposals satisfy the fat-tail requirement. But a larger c leads to a larger discrepancy between the target and the proposal. Figure 3 is computed based on 50,000 independent Monte Carlo samples with two choices of c, demonstrating that IMH(k) and MTM-IS(k) are nearly indistinguishable if the proposal does not align with the target well.

Fig. 3
figure 3

Convergence rates for a standard normal target (Example 3) with the sampling distribution p(x) being a scaled t-distribution \(ct_{10}(cx)\) with a \(c=2\), b \(c=20\)

4.2 Multivariate Gaussian and Gaussian mixture

We first use multivariate Gaussian distributions as both the target and proposal to show some practical implications of our result. Let \(\pi ={\mathcal {N}}(0,{\mathbb {I}} _d)\) and \(T={\mathcal {N}}(\vec {\mu },\sigma ^2{\mathbb {I}} _d)\). Then we find that the importance weight can be expressed as:

$$\begin{aligned} w(\vec {x})= & {} \frac{\pi (\vec {x})}{p(\vec {x})}=\sigma ^d\\{} & {} \exp \left[ -\frac{1}{2}\left( 1-\frac{1}{\sigma ^2}\right) \Vert \vec {x}\Vert ^2-\frac{1}{\sigma ^2}\langle \vec {x}, \vec {\mu }\rangle +\frac{1}{2\sigma ^2}\Vert \vec {\mu }\Vert ^2\right] . \end{aligned}$$

Therefore, \(w^*=\sup w(\vec {x})<\infty \) if either \(\sigma >1\) with an arbitrary \(\vec {\mu }\) or \(\sigma =1\) with \(\vec {\mu }=0\). When \(\sigma >1\), the maximal importance weight \(w^*\sim \sigma ^d\) and thus the mixing time of IMH \(\tau _{\textrm{IMH}}(\delta )=\Omega (w^*\log (1/\delta ))\) scales exponentially with the dimension d. In the same manner, the mixing time of MTM-IS also scales exponentially with d, and becomes worse as \(\sigma \) increases. Figure 4 supports that MTM-IS and consecutive IMH have almost the same mixing rates.

Fig. 4
figure 4

Convergence rates (left) and log-mixing times (right) for the standard multivariate Gaussian target \(\pi ={\mathcal {N}}(0,{\mathbb {I}} _d)\) with proposal \(p={\mathcal {N}}(0,4{\mathbb {I}} _d)\)

Fig. 5
figure 5

Convergence rates (left) and log mixing times (right) for a multivariate Gaussian mixture target \(\pi =\frac{1}{3}{\mathcal {N}}(0,{\mathbb {I}} _d)+\frac{2}{3}{\mathcal {N}}(\vec {1},{\mathbb {I}} _d)\) with proposal \(p={\mathcal {N}}(0,\sigma ^2{\mathbb {I}} _d)\). Solid lines: \(\sigma =1.1\); dashed lines: \(\sigma =2\); dotted lines: \(\sigma =5\). MTM-IS and IMH are nearly indistinguishable

Next, we consider a Gaussian mixture distribution \(\pi =\frac{1}{3}{\mathcal {N}}(0,{\mathbb {I}} _d)+\frac{2}{3}{\mathcal {N}}(\vec {1},{\mathbb {I}} _d)\), where \(\vec {1}\) is a d-dimensional vector filled with all 1’s. Employing \(T={\mathcal {N}}(0,\sigma ^2{\mathbb {I}} _d)\), we have the importance weight

$$\begin{aligned} w(\vec {x})&=\frac{1}{3}\sigma ^d\exp \left[ -\frac{1}{2} \left( 1-\frac{1}{\sigma ^2}\right) \Vert \vec {x}\Vert ^2\right] \\&\quad +\frac{2}{3}\sigma ^d\exp \left[ -\frac{1}{2} \left( 1-\frac{1}{\sigma ^2}\right) \Vert \vec {x}\Vert ^2-\frac{1}{\sigma ^2}\langle \vec {x},\vec {1}\rangle +\frac{d}{2\sigma ^2}\right] . \end{aligned}$$

It is easy to see that \(w^*<\infty \) if and only if \(\sigma >1\). Figure 5 depicts theoretical convergence rates and log mixing times for varying dimension and proposal standard deviation \(\sigma \). Again the mixing times scale exponentially with dimension d. Unlike the single Gaussian case, however, Fig. 5b shows that the slope of log mixing times is not a monotone function of \(\sigma \).

Figure 6 explores the optimization with \(\sigma \). Specifically, Fig. 6a plots the convergence rates against varying \(\sigma \) when \(d=2\), showing that the optimal choice is \(\sigma \approx 1.594949\). When d grows, the optimal \(\sigma \) remains approximately in the range of 1.55–1.62. Figure 6c indicates that the mixing time still scales exponentially with d even if \(\sigma \) is optimized.

Fig. 6
figure 6

Multidimensional mixture Gaussian: \(\pi =\frac{1}{3}{\mathcal {N}}(0,{\mathbb {I}} _d)+\frac{2}{3}{\mathcal {N}}(\vec {1},{\mathbb {I}} _d)\) and \(T={\mathcal {N}}(0,\sigma ^2{\mathbb {I}} _d)\). a Convergence rates against \(\sigma \) with varying \(1.1\le \sigma \le 6\) when d=2; b and c Plot respectively the convergence rates and log mixing times against the varying dimensions under the optimized \(\sigma \)

5 Variants of multiple-try Metropolis

5.1 Partial MTM-IS: an efficient variant

To reflect how MTM has actually been used in molecular simulations (Frenkel et al. 1996), we assume a partition of the state-space, \(\textbf{x}=(\textbf{x}^a,\textbf{x}^b)\), and the corresponding partition of the target distribution \(\pi (\textbf{x})\propto q(\textbf{x}^a,\textbf{x}^b)=q_a(\textbf{x}^a)q_b(\textbf{x}^b| \textbf{x}^a)\), where \(q_b\) may not be normalized. We assume that \(q_a(\textbf{x}^a)\) is much more expensive to evaluate than \(q_b(\textbf{x}^b|\textbf{x}^a)\). An important point to note is that we want to move \((\textbf{x}^a,\textbf{x}^b)\) jointly instead of iterating between conditional draws of \(\textbf{x}_a|\textbf{x}_b\) and \(\textbf{x}_b|\textbf{x}_a\) (for reasons such as the two components may be tightly coupled). We consider the independent proposal: \(p(\textbf{x})=p_a(\textbf{x}^a)p_b(\textbf{x}^b|\textbf{x}^a)\). A Partial MTM-IS algorithm is as follows:

figure c

Remark 1

(PMTM-IS versus MTM-IS) Note that, compared with the vanilla MTM-IS (Algorithm 2), PMTM-IS needs to draw extra balancing samples. Since we assume that sampling \(\textbf{x}^b\) and evaluating it are both very cheap, it is still worth doing. In this case, there are no standard IMH or MCMC variants for comparisons.

Typically, one iteration of IMH involves evaluating \(q_a/p_a\) twice (respectively on \(\textbf{x}^a\) and \(\textbf{y}^a\)) and evaluating \(q_b/p_b\) twice (respectively on \(\textbf{x}^b|\textbf{x}^a\) and \(\textbf{y}^b|\textbf{y}^a\)). In contrast, one iteration of Algorithm 3 consists of evaluating \(q_a/p_a\) twice (respectively on \(\textbf{x}^a\) and \(\textbf{y}^a\)) and evaluating \(q_b/p_b\) for 2k times (respectively on \(\textbf{x}^b_j|\textbf{x}^a\) and \(\textbf{y}^b_j|\textbf{y}^a\) with \(j=1,\ldots ,k\)). When evaluating \(q_b\) is significantly computationally more expensive than \(q_a\), Algorithm 3 nearly matches the computational cost of one-step IMH. Under certain reasonable regularity conditions, the following proposition shows that Algorithm 3 provably converges faster.

Proposition 1

Let \(\textbf{x}=(\textbf{x}^a,\textbf{x}^b)\), and \(\pi (\textbf{x})=\pi _a(\textbf{x}^a)\pi _b(\textbf{x}^b| \textbf{x}^a)\propto q_a(\textbf{x}_a) q_b (\textbf{x}^b|\textbf{x}^a)\), where \(\pi _a\) and \(\pi _b\) are normalized marginal and conditional distributions. Under the following regularity conditions with proposal p (all parts normalized):

$$\begin{aligned} \text {ess}\sup _{\textbf{x}^a,\textbf{x}^b}\frac{\pi (\textbf{x}^a,\textbf{x}^b)}{p(\textbf{x}^a,\textbf{x}^b)} =w^*<\infty , \end{aligned}$$
(16)

IMH converges with rate \(1-1/w^*\). In contrast, the partial MTM-IS (Algorithm 3) has a convergence rate no slower than \(1-1/w^*\).

Proof

Noting that \(\text {ess}\sup _{\textbf{x}}\frac{\pi (\textbf{x}^a,\textbf{x}^b)}{p(\textbf{x}^a, \textbf{x}^b)}=w^*\), we obtain the convergence rate of IMH as \(1-1/w^*\) by Theorem 3. As for Algorithm 3, we decompose the transition kernel as

$$\begin{aligned}&A((\textbf{x}^a,\textbf{x}^b),(\textbf{y}^a,\textbf{y}^b))\\&\quad =k{\mathbb {P}} \left[ \left\{ (\textbf{y}^a\text { gets proposed}) \cap (\textbf{y}^b_k=\textbf{y}^b)\right. \right. \\&\qquad \quad \qquad \left. \left. \cap (J=k)\cap (\text {joint }(\textbf{y}^a,\textbf{y}^b) \text { gets accepted})\right\} \right] \\&\quad =k\underbrace{\int _{{\mathcal {X}}^b}\ldots \int _{{\mathcal {X}}^b}}_{k-1}\underbrace{\int _{{\mathcal {X}}^b} \ldots \int _{{\mathcal {X}}^b}}_{k-1}\frac{p_a(\textbf{y}^a)p_b (\textbf{y}^b|\textbf{y}^a)w_k}{\sum _{j=1}^k w_j}\\&\qquad \min \left\{ 1,\frac{\sum _{j=1}^k w_j}{\sum _{j=1}^k w_j^\prime }\right\} \\&\qquad \prod _{j=1}^{k-1}p_b(\textbf{y}^b_j|\textbf{y}^a)p_b(\textbf{x}^b_j|\textbf{x}^a) \textrm{d}\textbf{y}_j^b\textrm{d}\textbf{x}_j^b. \end{aligned}$$

Suppose the normalizing constant of \(q(\textbf{x}^a,\textbf{x}^b)\) is C, i.e., \(\pi (\textbf{x}^a,\textbf{x}^b)=q(\textbf{x}^a,\textbf{x}^b)/C\). Then,

$$\begin{aligned}&\frac{p_a(\textbf{y}^a)p_b(\textbf{y}^b|\textbf{y}^a)w_k}{\sum _{j=1}^k w_j} \min \left\{ 1,\frac{\sum _{j=1}^k w_j}{\sum _{j=1}^k w_j^\prime }\right\} \\&\quad =\frac{q(\textbf{y}^a,\textbf{y}^b)}{\max \left\{ \sum _{j=1}^{k}w_j, \sum _{j=1}^{k}w_j^\prime \right\} }\\&\quad =\frac{q(\textbf{y}^a,\textbf{y}^b)/C}{\max \left\{ \sum _{j=1}^k\frac{q(\textbf{y}^a,\textbf{y}^b_j)/C}{p(\textbf{y}^a,\textbf{y}^b_j)}, \sum _{j=1}^k\frac{q(\textbf{x}^a,\textbf{x}^b_j)/C}{p(\textbf{x}^a,\textbf{x}^b_j)}\right\} }\\&\quad =\frac{\pi (\textbf{y}^a,\textbf{y}^b)}{\max \left\{ \sum _{j=1}^k \frac{\pi (\textbf{y}^a,\textbf{y}^b_j)}{p(\textbf{y}^a,\textbf{y}^b_j)},\sum _{j=1}^k \frac{\pi (\textbf{x}^a,\textbf{x}^b_j)}{p(\textbf{x}^a,\textbf{x}^b_j)}\right\} }, \end{aligned}$$

in which \(\textbf{y}_k^b=\textbf{y}^b\) and \(\textbf{x}^b_k=\textbf{x}^b\). Therefore, it gives rise to

$$\begin{aligned}&A((\textbf{x}^a,\textbf{x}^b),(\textbf{y}^a,\textbf{y}^b))\\&\quad =k\pi (\textbf{y}^a,\textbf{y}^b)\underbrace{\int _{{\mathcal {X}}^b} \ldots \int _{{\mathcal {X}}^b}}_{k-1}\underbrace{\int _{{\mathcal {X}}^b} \ldots \int _{{\mathcal {X}}^b}}_{k-1}\\&\qquad \frac{\prod _{j=1}^{k-1}p_b (\textbf{y}^b_j|\textbf{y}^a)p_b(\textbf{x}^b_j|\textbf{x}^a)\textrm{d}\textbf{y}_j^b\textrm{d}\textbf{x}_j^b}{\max \left\{ W(\textbf{y}^a;\textbf{y}^b_{1:k-1},\textbf{y}^b),W(\textbf{x}^a;\textbf{x}^b_{1:k-1},\textbf{x}^b)\right\} }, \end{aligned}$$

where \(W(\textbf{x}^a;\textbf{x}^b_{1:k})\triangleq \sum _{j=1}^k \frac{\pi _b(\textbf{x}^a,\textbf{x}^b_j)}{p_b(\textbf{x}^a,\textbf{x}^b_j)}\) for any \(\textbf{x}^a\in {\mathcal {X}}^a,\textbf{x}^b_{1:k} =(\textbf{x}^b_1,\ldots ,\textbf{x}^b_k) \in ({\mathcal {X}}^b)^k\). By definition, we find

$$\begin{aligned} W(\textbf{x}^a;\textbf{x}^b_{1:k})=\sum _{j=1}^k\frac{\pi _b(\textbf{x}^a, \textbf{x}^b_j)}{p_b(\textbf{x}^a,\textbf{x}^b_j)}\le kw^*. \end{aligned}$$

The following inequality immediately follows:

$$\begin{aligned} A((\textbf{x}^a,\textbf{x}^b),(\textbf{y}^a,\textbf{y}^b))\ge \frac{\pi (\textbf{y}^a,\textbf{y}^b)}{w^*}. \end{aligned}$$
(17)

Surprisingly, (3) leads to a mixture decomposition like (17) and thus is sufficient to construct the upper bound in Theorem 3 by the coupling argument and Lemma 1. Therefore, the convergence rate of Algorithm 3 is no larger than \(1-1/w^*\). However, the arguments for establishing matching lower bounds cannot directly apply due to the extra balancing trials \(\textbf{x}^b_j,1\le j\le k-1\). So the exact convergence rate of Algorithm 3 remains unknown. \(\square \)

5.2 Correlated multiple trials

Compared with the original MTM, the partial MTM-IS differs in that its multiple trials \((\textbf{y}^a,\textbf{y}^b_1),\ldots ,(\textbf{y}^a,\textbf{y}^b_k)\) are correlated due to the state space partitioning. As also demonstrated by Craiu and Lemieux (2007), we believe that generating correlated multiple trials is a key in designing efficient MTM algorithms. Although rigorous theoretical analysis for a general correlated MTM design is beyond our reach, we present some theoretical results for two special cases for finite state spaces, which may also be generalization to continuous state-spaces. Implications derived from the analysis apply more generally: good correlated multiple-tries can be obtained with the aid of a deterministic step.

5.2.1 Stratified sampling

Suppose \({\mathcal {X}}\) is a finite state space. We partition it into a few subgroups, \({\mathcal {X}}_1, \ldots , {\mathcal {X}}_B\) so that \({\mathcal {X}}_{i}\cap {\mathcal {X}}_j=\emptyset \), \(\forall i\ne j\) and \(\cup _j {\mathcal {X}}_j ={\mathcal {X}}\). We begin with a block wise IMH step by sampling from \(\{{\mathcal {X}}_1, \ldots , {\mathcal {X}}_B\}\) with weight \(p({\mathcal {X}}_j)\) and accept it with \(w({\mathcal {X}}_j)=\pi ({\mathcal {X}}_j)/p({\mathcal {X}}_j)\) afterwards. Then, we draw y within the sampled block with probability proportional to \(\pi (y)\). It is easy to see that the chain become stationary once it converges at the subgroup level. Thus, the convergence rate of this algorithm is

$$\begin{aligned} r_B =1-{w({\mathcal {X}}^*)}^{-1}, \end{aligned}$$

where \({\mathcal {X}}^*=\arg \max _j w({\mathcal {X}}_j)\). This is not generally better than the IMH(k), which has a convergence rate of \((1-1/w^*)^k> 1- k/w^*\). But if the weights w’s are very uneven and we can partition the states so that the weights \(w({\mathcal {X}}_j)\)’s are more balanced, then the stratified IMH can improve upon IMH(k) significantly. We also note that the computation cost of this block-based MTM-IS(k) algorithm is no worse than IMH(2) (the first step of block sampling is no worse than 1-step IMH; and so is the second step of sampling within a block), much better than IMH(k) when k is large.

Example 4

(Example 1 continued) Let \({\mathcal {X}}=\{1,\ldots , N\}\), and suppose that the target \(\pi (x)\propto x\) and \(p(x)\propto 1\). Then, the original weights are \(w(x) \propto x\) and \(w^*=\frac{2N}{1+N}\approx 2\). Let \(k=2\), then IMH(2) has a rate of \((1-1/w^*)^{2}\approx 0.25\), which is quite good. Assume that N is an even number and we partition the space as \({\mathcal {X}}_j=\{j, N-j+1\}\) for \(j=1,\ldots , N/2\). Then \(w({\mathcal {X}}_j)\propto 1\), and the resulting MTM-IS(2) converges in one step. More generally, for an arbitrary distribution \(\pi (x)\) and the uniform proposal \(p(x)=(2N)^{-1}\), we have \(w^*=\pi (x^*)\) with \(x^*=\arg \max _x \pi (x)\). Thus, if we can partition the state space so that \(\pi ({\mathcal {X}}_j)\) are approximately equal for \(j=1,\ldots ,B\), the algorithm can be much improved.

5.2.2 Sampling without replacement

Another obvious way of introducing correlations for multiple proposals is to do sampling without replacement. Let \({\mathcal {X}}=\{1,\ldots , N\}\). To simplify the discussion, we here focus on the simple random sampling without replacement (SRSWOR, i.e., \(p(\textbf{y})\propto 1\)), although it is possible to extend the method to do sampling without replace with unequal probabilities using one of the schemes in Chen et al. (1994). The algorithm is as follows.

figure d

The actual transition probability from x to \(y\ne x\) for this scheme is

$$\begin{aligned} A(x,y)&=\sum _{S^{(k-1)}_y}\frac{1}{\left( {\begin{array}{c}N-1\\ k\end{array}}\right) }\pi (y)\nonumber \\&\quad \min \left[ \frac{1}{\pi (y)+\sum _{i<k}\pi (y_i)},\frac{1}{\pi (x) +\sum _{i<k}\pi (y_i)}\right] , \end{aligned}$$
(18)

where \(S^{(k-1)}_y\subset {\mathcal {X}}\backslash \{x,y\}\), \(|S^{(k-1)}_y|=k-1\), and \( y_j \in S^{(k-1)}_y, \forall j<k\). Doing an exact eigenvalue decomposition of matrix A would have brought us a tight bound on the convergence rate. But A does not possess a nice low-rank property as that for the IMH sampler or the MTM-IS.

For \(S\subset {\mathcal {X}}\), we define \(\pi (S)=\sum _{x\in {\mathcal {X}}}\pi (x)\), \(S^*=\arg \max _{\{S: \ |S|=k\} } \pi (S)\), and \(x^*=\arg \max _x\pi (x)\). We find the following inequality to hold:

$$\begin{aligned} A(x,y)\ge \frac{k\pi (y)}{(N-1)\pi (S^*)}, \ \ x\ne y. \end{aligned}$$

During each iteration, the chain stays at the current state if and only if the new proposal is rejected since in our construction of Algorithm 4, the proposal set is not allowed to contain the current state. We observe that \(\rho \equiv 1\) whenever \(x=x_*\triangleq \arg \min _x\pi (x)\), leading to \(A(x_*,x_*)=0\). This fact prevents us from using the previous coupling arguments directly. However, as we specify to some circumstances, we could still obtain satisfactory results.

Example 5

Choosing \(k=2\) and \({\mathcal {X}}=\{1,\ldots ,N\}\),we set

$$\begin{aligned} \pi _1=1-p,\ \ \pi _2=\cdots =\pi _N=\frac{p}{N-1}, \end{aligned}$$
(19)

where \(0\le p\le (N-1)/N\), which guarantees that \(x^*=1\) and \(\{2,\ldots ,N\}\in \arg \min _x\pi (x)\). As a result, we know that \(A(2,2)=\cdots =A(N,N)=0\). Furthermore, matrix A can be completely determined by the following four quantities:

$$\begin{aligned} a_1&=A(1,2)=\frac{2\pi _2}{(N-1)(\pi _1+\pi _2)},\\ a_2&=A(1,1)=\frac{\pi _1-\pi _2}{\pi _1+\pi _2},\\ a_3&=A(2,1)=\frac{2\pi _1}{(N-1)(\pi _1+\pi _2)},\\ a_4&=A(2,3)=\frac{(N-3)}{(N-1)(N-2)}\\&\quad +\frac{2\pi _2}{(N-1)(N-2)(\pi _1+\pi _2)}. \end{aligned}$$

We can then write out A as follows:

$$\begin{aligned} A=\left[ \begin{array}{cccccc} a_2 &{} a_1 &{} a_1 &{} a_1 &{} \ldots &{} a_1\\ a_3 &{} 0 &{} a_4 &{} a_4 &{} \ldots &{} a_4\\ a_3 &{} a_4 &{} 0 &{} a_4 &{} \ldots &{} a_4\\ a_3 &{} a_4 &{} a_4 &{} 0 &{} \ldots &{} a_4\\ \ldots &{}&{}&{}&{}&{}\ldots \\ a_3 &{} a_4 &{} a_4 &{} a_4 &{} \ldots &{} 0\\ \end{array}\right] . \end{aligned}$$
(20)

Now this matrix admits a useful low-rank decoupling: \(A=G+ep^T\), where \(e=\left[ 1,\ldots ,1\right] ^T\), \(p=\left[ a_3,a_4,\ldots ,a_4\right] ^T\) and

$$\begin{aligned} G=\left[ \begin{array}{cccccc} a_2-a_3 &{} a_1-a_4 &{} a_1-a_4 &{} a_1-a_4 &{} \ldots &{} a_1-a_4\\ 0 &{} -a_4 &{} 0 &{} 0 &{} \ldots &{} 0\\ 0 &{} 0 &{} -a_4 &{} 0 &{} \ldots &{} 0\\ 0 &{} 0 &{} 0 &{} -a_4 &{} \ldots &{} 0\\ \ldots &{}&{}&{}&{}&{}\ldots \\ 0 &{} 0 &{} 0 &{} 0 &{} \ldots &{} -a_4\\ \end{array}\right] . \end{aligned}$$
(21)

Note that e is a common right eigenvector for both A and \(A-G\), corresponding to the largest eigenvalue 1. Since \(A-G\) is of rank 1, the remaining eigenvalues of A and G have to be the same. Hence the eigenvalues for A are \(1,a_2-a_3,-a_4,\ldots ,-a_4\). This decoupling trick has also been used in Liu (1996) for the IMH algorithm. Given the convergence rate \(\left( 1-1/(N\pi _1)\right) ^2\) of IMH(2), it suffices to show

$$\begin{aligned} \mid a_2-a_3\mid \le \left( 1-1/(N\pi _1)\right) ^2,\ \ a_4 \le \left( 1-1/(N\pi _1)\right) ^2, \end{aligned}$$
(22)

to prove that MTM-SRSWOR(2) is faster than IMH(2). Clearly, this holds true for \(p=\dfrac{N-1}{2N}\), which leads to \(\pi _1=1/2+1/(2N)\), \(\pi _2=1/(2N)\). In this case,

$$\begin{aligned} a_2-a_3&=1-\frac{4N}{(N+2)(N-1)}<1-\frac{4}{N+1}\\&<\left( 1-\frac{2}{N+1}\right) ^2=\left( 1-\frac{1}{N\pi _1}\right) ^2,\\ a_4&=\frac{(N-3)}{(N-1)(N-2)}+\frac{2}{(N-1)(N-2)(N+2)}\\&<\left( 1-\frac{1}{N\pi _1}\right) ^2. \end{aligned}$$

We note that designing a suitable parallel construction to do SRSWOR can speed up the algorithm considerably. Furthermore, when proposing multiple trials, we may also choose not to exclude x from the proposal set. In this case, we need to modify Algorithm 4 slightly to become Algorithm 5.

figure e

5.3 Independent non-identical proposals

Besides introducing correlations between multiple trials, Craiu and Lemieux (2007) also suggests to use different proposals for generating multiple trials in each MTM iteration and provides some supportive empirical evidences. Here we consider a special case of MTM-IS(k) in which the multiple trials are generated from different proposals, i.e., \(y_j\sim p_j(\cdot )\) independently for \(j=1,\ldots ,k\). In this case, we also do not have to draw balancing trials. Defining \(w_j(x):=\pi (x)/p_j(x)\), we summarize the procedure in Algorithm 6.

figure f

To demonstrate the effect of the multiple-try design employed in Algorithm 6, it should be compared with a sequential k-step IMH sampler. During one iteration, this sampler runs an interior loop of length k, within which the j-th step proposes an independent proposal from \(p_j\) and then accepts/rejects it based on the MH rule as in the ordinary IMH sampler. This sequential IMH sampler has the same computational cost as Algorithm 6. The following theorem provides tight upper bounds for the convergence rates of the two algorithms, and its proof is deferred to appendixes.

Theorem 5

Suppose target \(\pi \) is absolutely continuous with respect to every proposal \(p_j\). Algorithm 6 and its corresponding sequential IMH sampler are geometrically convergent, with their corresponding respective convergent rates upper bounded by \(1-\sum _{j=1}^{k}{\mathbb {E}}_{p}\left[ \frac{1}{w_j^*+\sum _{1\le i\le k,i\ne j}w_i(X_i)}\right] \) and \(\prod _{j=1}^k \left( 1-\frac{1}{w_j^*}\right) \), respectively, where \(w_j^*:=\sup _{x\in {\mathcal {X}}}w_j(x)\). Furthermore, the following inequality holds,

$$\begin{aligned} 1-\sum _{j=1}^{k}{\mathbb {E}} _{p}\left[ \frac{1}{w_j^*+\sum _{1\le i\le k,i\ne j}w_i(X_i)}\right] \ge \prod _{i=1}^k \left( 1-\frac{1}{w_i^*}\right) , \end{aligned}$$
(23)

implying that the upper bound for Algorithm 6 is worse than that for the corresponding sequential IMH.

Remark 2

(Tightness of the lower bounds) Suppose \(\exists \ x^*\) such that

$$\begin{aligned} w_j(x^*)=w_j^*:=\sup _{x\in {\mathcal {X}}}w_j(x) =\sup _{x\in {\mathcal {X}}}\pi (x)/p_j(x)<\infty , \ \text {for all } j, \end{aligned}$$
(24)

i.e., different proposals have their importance weight functions \(w_j\) to attain their respective supremums at a same point \(x^*\). Then, the convergence rates for both aforementioned algorithms attain their respective upper bounds. When \(p_1=\cdots =p_k\), condition (24) automatically holds, recovering the convergence rate result of Theorem 3. However, when there is no such a \(x^*\) as required by (24), the quantities claimed in Theorem 5 are only upper bounds. It remains unknown under what other conditions one algorithm can be provably better than the other. Our empirical study shows that their computational efficiencies are almost indistinguishable when the target distribution is “hard” relative to the proposals.

Example 6

We conducted a few simulations to examine convergence behaviors of Algorithm 6 and the corresponding sequential IMH sampler at the same computational cost.

As shown in Fig. 7, we considered target densities of the form of a mixture of two standard distributions with various dimensions. Top plots in Fig. 7 correspond to Gaussian mixture targets, \(\pi =\frac{1}{2}{\mathcal {N}}(0,{\mathbb {I}} _d)+\frac{1}{2}{\mathcal {N}}(\textbf{3},{\mathbb {I}} _d)\), with d=3, 4, and 5, respectively. Two different proposal distributions are employed: \(p_1={\mathcal {N}}(0,{\mathbb {I}} _d)\) and \(p_2={\mathcal {N}}(0,9{\mathbb {I}} _d)\). During one iteration of the MTM-IS(k) algorithm, k/2 trials are independently drawn from of \(p_1\), and another k/2 trials from \(p_2\). The bottom plots correspond to t-mixture distributions, \(\pi =\frac{1}{2}t_3(0)+\frac{1}{2}t_3(\textbf{4})\), for d=1, 2, and 3. Two different proposal distributions are: \(p_1=t_3(0)\) and \(p_2=t_5(0)\), and the same implementation of MTM-IS(k) as the previous case is employed. These plots show that Algorithm 6 and its corresponding sequential IMH sampler differ very little in their convergence rates although theoretically we cannot claim one is necessarily better than the other without condition (24). All simulations are based on \(10^6\) iterations on an Apple M2 chip with 16GB memory, each taking a few minutes.

Fig. 7
figure 7

Top: Auto-correlation plots for the Gaussian mixture targets in Example 6 From left to right: dimension d= 3, 4, 5, respectively. Bottom: Auto-correlation plot for the t-mixture targets in Example 6 From left to right: dimension \(d=1,2,3\), respectively. Solid lines: \(k=2\); dashed lines: \(k=6\); dotted lines: \(k=10\)

5.4 A general framework

Inspired by the variants of MTM just discussed, we propose a general framework to combine these variants in Algorithm 7. With \(\pi (\cdot )\) as the target distribution on \({\mathcal {X}}\), we let \(p(x,\textbf{y})\) denote the proposal transition function for multiple correlated proposals, where \(x\in {\mathcal {X}}\) and \(\textbf{y}=(y_1,\ldots ,y_k)\in {\mathcal {X}}^k\). We further write the j-th marginal of \(p(x,\textbf{y})\) as \(p_j(x,y_j)=\int p(x,\textbf{y})\textrm{d}\textbf{y}_{(-j)}\), and define the jth lab:ssk08 as

$$\begin{aligned} w_j(y\mid x)=\frac{\pi (y)}{p_j(x,y)}\lambda _j(x,y), \end{aligned}$$
(25)

for \(j=1,\ldots ,k\), where \(\lambda _j\) is a symmetric function. Assuming the current state is x, the updating rule is summarized in Algorithm 7.

figure g

If we require \(p(x,\textbf{y})\) to have the same marginals for different \(y_j\)’s, the algorithm reduces to that of Craiu and Lemieux (2007); if we require \(p(x;\textbf{y})\) to be independent among the \(y_j\)’s, it reduces to that of Casarin et al. (2013). Note that the balancing proposals are drawn to facilitate the computation of \(\rho \), and this guarantees the detailed balance of the MTM design. The following result is expected and its detailed proof is deferred to the Appendix.

Theorem 6

The generalized MTM transition rule (Algorithm 7) satisfies the detailed balance condition and hence induces a reversible Markov chain with \(\pi \) as its invariant distribution.

Defining \(\textbf{x}^*(j)\triangleq (x^*_1,\ldots ,x^*_{j-1}, x,x^*_{j+1},x^*_k)\), one can determine the transition density of the generalized MTM framework via the same spirit employed in the proof of Theorem 1:

$$\begin{aligned} A(x,y)&=\pi (y) \sum _{j=1}^k \Bigg [ p_j(x,y)p_j(y,x) \lambda _j(x,y) \times \\&\quad \int u_j(\textbf{x}^*(j),\textbf{y}) p(x,\textbf{y}_{(-j)} \mid y_j=y) \\&\quad p(y,\textbf{x}^*_{(-j)}\mid x_j^*=x)\prod _{i\ne j}\textrm{d} y_i\textrm{d}x_i^*\Bigg ], \end{aligned}$$

where we write

$$\begin{aligned} u_j(\textbf{x},\textbf{y})\triangleq \min \left\{ \left( \sum _{i=1}^k w_i(y_i\mid x_j)\right) ^{-1},\left( \sum _{i=1}^kw_i(x_i\mid y_j)\right) ^{-1}\right\} \end{aligned}$$

for any \(\textbf{x}=(x_1,\ldots ,x_k)\) and \(\textbf{y}=(y_1,\ldots ,y_k)\). A detailed derivation of this formula can be found in the proof of Theorem 6.

As demonstrated in Algorithms 4, 5 and 6, we find that sometimes we do not need to draw balancing trials for MTM to retain the detailed balance. A natural question then arises: can we find a general condition under which which MTM can avoid the drawing of balancing trials? The following theorem provides a sufficient condition that covers all the cases we discussed.

Theorem 7

If, for any pair (xy) and \(\forall j\), the joint proposal distribution satisfies

$$\begin{aligned} p(x,\textbf{y}_{(-j)}\mid y_j=y)=p(y,\textbf{y}_{(-j)}\mid y_j=x), \end{aligned}$$
(26)

we can maintain the detailed balance by setting \(x^*_j\triangleq y_j\) for \(j\ne J\) in Algorithm 7.

Remark 3

(Correlated multiple trials) As demonstrated in Sects. 5.1 and 5.2, letting the proposed multiple trials be correlated (especially negatively) can be helpful in improving the chain’s convergence. A useful strategy is to use multiple trials as stepping stones to move from one mode of the distribution to another, similar in spirit to Hamiltonian/hybrid Monte Carlo (Qin and Liu 2001; Liu 2008) and the griddy Gibbs MTM (Liu et al. 2000). Indeed, it was shown empirically in Qin and Liu (2001) that applying MTM–HMC trajectories may further improve the sampling efficiency. However, an in-depth theoretical analysis as carried out here is much more challenging due to the semi-deterministic nature of aforementioned algorithms.

Remark 4

(Employing multiple distributions in MTM) Intuitively, one may hope that using different distributions for each trial could help us explore the state space better. Our results in Sect. 5.3, however, demonstrate that it is still not very useful under the IMH framework if the multiple trials are independent. It may be helpful for the partial MTM framework discussed in Sect. 5.1.

6 Concluding remarks

We have presented a complete eigen-decomposition and convergence rate analysis for the MTM-IS, and compared it with the “thinned” IMH sampler (of the same computational cost). With the exact form of eigenvalues of the MTM-IS, we proved rigorously that the sampler is not as efficient as the simpler “thinned” IMH approach. To the best of our knowledge, this is the first exact rate result known for a MTM type algorithm, although the result’s implication is less than encouraging. A good news is that, in a more realistic setting of MTM applications as explained in Sect. 5.1, we can show that MTM improves upon the standard IMH and does not have a suitable competitor.

In a quest for finding advantages MTM may offer, we consider a slightly modified framework that encompasses a few variants of MTM published in the literature. We found that even under the IMH framework, it is possible to construct a MTM algorithm, using either stratified sampling or partial sampling, or sampling without replacement, to gain efficiency. A key to such efficiency gain is to allow multiple trials to be either more dispersed than independent ones (Sect. 5) or applied only to certain “low-cost” parts (Sect. 5.1). Detailed theoretical understanding and guiding principles, however, are still lacking and awaiting further endeavors.