Abstract
The multiple-try Metropolis method is an interesting extension of the classical Metropolis–Hastings algorithm. However, theoretical understanding about its usefulness and convergence behavior is still lacking. We here derive the exact convergence rate for the multiple-try Metropolis Independent sampler (MTM-IS) via an explicit eigen analysis. As a by-product, we prove that an naive application of the MTM-IS is less efficient than using the simpler approach of “thinned” independent Metropolis–Hastings method at the same computational cost. We further explore more variants and find it possible to design more efficient algorithms by applying MTM to part of the target distribution or creating correlated multiple trials.
Similar content being viewed by others
Explore related subjects
Find the latest articles, discoveries, and news in related topics.Avoid common mistakes on your manuscript.
1 Introduction
1.1 Fundamental Metropolis–Hastings method
Markov chain Monte Carlo (MCMC) methods have played important roles in statistical computing and Bayesian inference and have attracted much attention from both theoretical researchers and practitioners. In a nutshell, the set of methods provide general and practical recipes for generating random draws from any given target probability distribution known up to a normalizing constant. Specifically, such an algorithm generates a time-homogeneous Markov chain with its stationary distribution being the target one. Under mild assumptions, this chain converges to the target distribution geometrically (Roberts and Tweedie 1996; Liu et al. 1995). See Liu (2008) and Brooks et al. (2011) for more comprehensive reviews. The scheme first proposed by Metropolis et al. (1953) and then generalized by Hastings (1970) is arguably the most popular and fundamental construction among all MCMC methods. Let \(\pi (\cdot )\) denote the target probability distribution/density function on the state space \({\mathcal {X}}\). The Metropolis–Hastings method constructs a Markov chain \(x^{(1)}, x^{(2)},\ldots , \) on \({\mathcal {X}}\) as follows. At step \(t+1\), it proposes a new state y from a user-specified transition function p(x, y), i.e., \(y\sim p(x^{(t)},\cdot )\). Then, the next state \(x^{(t+1)}\) is equal to y with probability \(\rho \) and to \(x^{(t)}\) with probability \(1-\rho \), where
This design ensures that the generated Markov chain satisfies the detailed balance with respect to \(\pi \), which guarantees the chain’s reversibility and convergence under mild conditions.
1.2 Geometric convergence
A Markov chain with transition function A is said to be geometrically ergodic if, for \(\pi \)-almost everywhere x, \( \Vert A^n(x,\cdot )-\pi (\cdot )\Vert \le C(x)r^n\) holds true with constant \(r\in (0,1)\). Here \(\Vert \cdot \Vert \) denotes a distance metric between two probability measures, usually taken as the total variation (TV) distance. Other modes of convergence, such as convergence in \(\chi ^2\)-distance (which implies the convergence in total variation), have also been investigated (Liu et al. 1995; Liu 2008). Establishing this inequality and deriving sharp bounds on the rate r are seen as central tasks in studying MCMC algorithms (Tierney 1994; Liu et al. 1995; Roberts and Tweedie 1996).
As a generalization of the standard Metropolis–Hastings algorithm, the multiple-try Metropolis (MTM) scheme as formalized in Liu et al. (2000) allows one to draw multiple trials at each step and select one according to a specially designed probability distribution. Although intuitively the MTM scheme enables one to escape from local optimums more easily, there is little theoretical understanding of the convergence rate of any form of the MTM algorithm, making it a challenging practical concern when deciding whether a MTM approach should be employed for a specific problem. Existing theoretical results on the Metropolis–Hastings algorithm clearly cannot be easily extended to the MTM algorithm. Indeed, getting sharp bounds on the convergence rate of any general-purpose Metropolis–Hastings algorithm can be extremely challenging, except for the Independent Metropolis–Hastings (IMH) algorithm (which is also called the Metropolised independence sampler by Liu (1996) and the independence Metropolis chain by Tierney (1994)). We are therefore tempted to consider whether the IMH’s multiple-try version, which we call the multiple-try Metropolis Independent sampler (MTM-IS), can be tackled theoretically.
1.3 Convergence rate of independent Metropolis–Hastings algorithm
Geometrical ergodicity is not guaranteed for a general Metropolis–Hastings algorithm unless we impose suitable restrictions (Roberts and Tweedie 1996), and exact convergence rates for Metropolis–Hastings algorithms are rare to find (Diaconis and Saloff-Coste 1998). In practice, geometric ergodicity is often established under the ‘drift-and-minorization’ framework (Diaconis et al. 2008). But this technique usually results in a very conservative bound of the convergence rate, not quite practically useful. Because of the very special structure of the IMH algorithm, explicit eigen-analyses of its transition matrix for the finite-discrete state space case were obtained by Liu (1996), which results in the exact convergence rate of the IMH algorithm (also a very tight bound on the constant in front of the rate) and offers a comparison with classical rejection sampling and importance sampling. Atchadé and Perron (2007) studies the continuous case by determining the full spectrum of the transition operator of the IMH algorithm. A recent preprint of Wang (2020) combines previous results and provides a lower bound, hence determining the exact convergence rate. In this paper, we impose similar conditions on the MTM-IS and study its exact convergence rate.
1.4 Multiple-try Metropolis and its variants
The original idea of multiple-try Metropolis (MTM) comes from chemical physicists interested in molecular simulations (Frenkel et al. 1996). Its general formulation constructed in Liu et al. (2000) inspires the development of Ensemble MCMC methods by Neal (2011), connects with particle filtering (Martino et al. 2014), and stimulates ideas of parallelizing MCMC (Calderhead 2014; Yang et al. 2018). We refer interested readers to the review of Martino (2018). Intuitively, the MTM approach enables one to explore the sample space more broadly, and thus potentially gains efficiency in avoiding being trapped in local modes. The method has been incorporated in some applications such as model selection (Pandolfi et al. 2010) and Bayes factor estimation (Dai and Liu 2020).
In the context of molecular simulations (Frenkel et al. 1996), the multiple-try strategy is often applied to a target distribution in which the state space can be partitioned into two parts: position and orientation, i.e., \(\textbf{x}=(\textbf{x}^p, \textbf{x}^o)\). For a given \(\textbf{x}^p\), evaluating multiple configurations corresponding to different orientations, \(\pi (\textbf{x}^p,\textbf{x}^{o1}),\ldots , \pi (\textbf{x}^p,\textbf{x}^{om})\) is not much more expensive than evaluating a single \(\pi (\textbf{x}^p,\textbf{x}^o)\). Thus, MTM can be quite useful in facilitating an efficient move: we can propose the new configuration by (a) first proposing the position \(\textbf{x}^p_{(new)}\); (b) associating with it multiple orientations \(\textbf{x}^{o1}_{(new)},\ldots ,\textbf{x}^{om}_{(new)}\); (c) picking one from them properly, and (d) using the MTM rule to do acceptance/rejection. In addition to this case, MTM is also particularly useful when combined with directional sampling, as in (Liu et al. 2000; Dai and Liu 2020). Specifically, given a sampling direction \(\textbf{e}\) at position \(\textbf{x}\), multiple trials are drawn simultaneously as \(r_1,\ldots ,r_m\sim p(r)\) to construct \(\textbf{y}_j=\textbf{x}+r_j\textbf{e}\).
Several variants of the MTM are worth mentioning: Craiu and Lemieux (2007) propose to use correlated trials to accelerate MTM and introduces antithetic and stratified sampling to bring correlation; Casarin et al. (2013) argue that multiple independent trials from different distributions are worth considering, and connect to interactive sampling algorithms. Theoretically, Bédard et al. (2012) conducts a scaling analysis for MTM. However, to the best of our efforts, we can not find any existing result on the convergence rate of an MTM algorithm.
In this paper, we report the exact convergence rate of the MTM-IS for general target \(\pi (\cdot )\) and proposal \(p(\cdot )\). The result is somewhat surprising as it shows that the MTM-IS with k multiple tries is not as efficient as simply repeating the standard IMH algorithm k times, thus suggesting that the we may want to design the k multiple proposals to be “over-dispersed” (e.g., negatively correlated) in order to take advantage of the MTM structure. Another useful scenario, as discussed previously and detailed in Sect. 5.1, is to help proposing a better configuration for a general Metropolis–Hastings algorithm by orienting part of the proposal better via MTM.
The rest of the article is organized as follows. Section 2 carries out an eigenvalue analysis of MTM-IS; Sect. 3 specifies the exact convergence rate of MTM-IS under the total variation distance and offers an inequality to compare MTM-IS with its corresponding “thinned” IMH algorithm (i.e., taking one draw from every k iterations of the sampler); Sect. 4 provides some empirical results for multivariate Gaussian and Gaussian mixtures; Sect. 5 discusses several variants and extensions of MTM; and Sect. 6 concludes the article with a short remark.
2 Eigen-analysis of multiple-try Metropolis independent sampler
2.1 Notations
Throughout the article, we use \({\mathcal {X}}\) to denote the state space, which can be either discrete or continuous. Notations \(\pi (x)\) and p(x, y) represent the target and proposal distributions, respectively, with \(x,y\in {\mathcal {X}}\). If proposal distribution is independent of the current state x, we write it as p(y). The actual transition function/probability/density of the MCMC algorithm is denoted by A(x, y). A collection of multiple trials of size k is written as \(\textbf{y}=(y_1,\ldots ,y_k)\). We consider the total variation distance for any two (signed) measures P and Q, which is defined as \(\Vert P-Q\Vert _{TV} =\sup _{A\in {\mathcal {F}}} |P(A)-Q(A)|\), where \({\mathcal {F}}\) denotes the \(\sigma \)-field common to P and Q (e.g., the Borel \(\sigma \)-field for most common uses). In Sect. 5, we slightly abuse the notation by letting \(p(x,\textbf{y})\) be the proposal distribution for \(x\in {\mathcal {X}}\) and \(\textbf{y}=(y_1,\ldots ,y_k)\in {\mathcal {X}}^k\), as we would consider multiple correlated trials in this section. Besides, we write \(p(x,\textbf{y}_{(-j)}\mid y_j)\) as the conditional distribution of \(\textbf{y}_{(-j)}\equiv (y_1,\ldots ,y_{j-1},y_{j+1},\ldots ,y_k)\) given \(y_j\) and x. Lastly, \(p_j(x,y_j)=\int p(x, \textbf{y}) \textrm{d} \textbf{y}_{-j}\) denotes the conditional marginal distribution of \(y_j\) given x.
2.2 Description of the algorithms
The general framework of the MTM as formulated in Liu et al. (2000) is summarized in Algorithm 1. Let the current state be x, and let the number of multiple tries be k. With a proposal transition function p(x, y) that defines the conditional distribution of y, we define the generalized importance weight as
where \(\lambda \) is a symmetric non-negative function (i.e., \(\lambda (x,y)=\lambda (y,x)\ge 0\), \(\forall x, y\)). Thus, the acceptance/rejection ratio in a general MH algorithm is just the ratio of the generalized importance weights.
Here, \(x_1^*,x_2^*,\ldots ,x_{k-1}^*\) are called balancing trials, which are drawn to guarantee the detailed balance. Liu et al. (2000) also extend the MTM for generating non-independent multiple trials such as semi-deterministic ones along a direction. If we choose \(p(x,y)=p(y)\), we can modify this algorithm to avoid drawing additional balancing trials as the algorithm is still valid if we simply replace the \(x_j^*\) by \(y_j\) in computing \(\rho \). This modified version is summarized in Algorithm 2 and named the MTM-IS(k). In this case, we further select \(\lambda (x,y)\equiv 1\) then the generalized importance weight (1) turns out to be \(w(y\mid x)=\pi (y)/p(y)\), coinciding with the standard notation of importance ratio. In order to simplify the notations, we could write
In theory, we assume that \(\pi \) is absolutely continuous with respect to p, so that this importance weight can be interpreted as the Radon-Nikodym derivative. In practice, one should always choose p so that its support covers that of \(\pi \) for the algorithm to work well. The main result of this section is stated in Theorem 2, which can be viewed as a generalization of the results in Liu (1996) and Atchadé and Perron (2007) and provides the exact convergence rate of MTM-IS.
2.3 Transition distribution decomposition
Theorem 1
The transition distribution of MTM-IS can be decomposed as
where \(H_k\) is defined as
and \(R(x)=1-\int _{\mathcal {X}}\min \left\{ H_k[w(x)],H_k[w(y)]\right\} \pi (y)\textrm{d}y\in [0,1]\) denotes the rejection probability when the current state is \(x\in {\mathcal {X}}\). In particular, \(H_k(z)\) is a strictly decreasing function in z. For \(k=1\), \(H_k\) degenerates to \(H_1(z)=z^{-1}\).
Proof
Let \(x\notin B\subset {\mathcal {X}}\) be measurable, the probability of proposing an element in B and accepting it is
The last equality appears irrelevant to x, but the importance ratio \(w(x)=\pi (x)/p(x)\) matters when deciding whether or not the chosen \(y_J\) is accepted. Furthermore,
where \(H_k\) is as defined in (4). Thus, the overall rejection probability is
and the prescribed decomposition (3) is thus proved. \(\square \)
Let \(w^*\triangleq \inf \{u>0:\pi (x:w(x)\le u)=1\}\) be the essential supremum of w(x) on \({\mathcal {X}}\) w.r.t. \(\pi (\cdot )\) (i.e., \(w^*\) is the smallest value such that \(w(x)\le w^*\) with \(\pi \)-probability 1). Since \(H_k(w)\) is a monotone decreasing function of w (Theorem 1), we have an upper bound \(R(x)\le 1-H_k(w^*)\). Furthermore, since
we have the following mixture representation of the transition function, convenient for comparing with \(\pi \):
where \(q_{\textrm{res}}(x,B):=\dfrac{A(x,B)-H(w^*) \pi (B)}{1-H(w^*)}\). This representation can be used to facilitate a coupling argument to prove the geometric convergence of the Markov chain (more details in Sect. 3).
2.4 Spectrum of the transition operator
Now we provide a result to fully characterize the spectrum of the transition operator induced by the MTM-IS algorithm. A similar result was derived for the IMH algorithm by Liu (1996) for the discrete state-space case, and then by Atchadé and Perron (2007) in general. To be concrete, we introduce the following definitions.
Definition 1
Let A(x, y) be the transition function of a
Markov chain with \(\pi \) as its invariant distribution. We define its transition operator \(K: \ L^2(\pi ) \rightarrow L^2(\pi )\) as
It computes the conditional mean and is called the forward operator in Liu et al. (1995).
Definition 2
Let \(K_0\) be the restriction of K onto \(L^2_0(\pi )\), the orthogonal complement of the constant function of \(L^2(\pi )\). Then the spectrum of \(K_0\) is
The essential range of a function R is
Theorem 2
Let K be the transition operator defined by the MTM-IS algorithm, and let \(K_0\) be similarly defined as in Definition 1. Then, \(\sigma (K_0)\subseteq \text {ess-ran}(R)\), where R is the rejection probability defined in (5). The equality holds if \(\forall \) \(\alpha \in \text {ess-ran}(R)\), \(\pi \{y:\ R(y)=\alpha \}=0\).
Since the proof is mostly technical, we defer it to the Appendix. From (5) and Theorem 1, it is obvious to see that an upper bound of R(x) is \(1-H_k(w^*)\). This implies that there is a gap between 1 and the upper edge \(1-H(w^*)\) of the spectrum \(\sigma (K_0)\), provided that \(w^*<\infty \). For the finite discrete state-space case, \(H(w^*)=1/w^*\), and \(1-H(w^*)\) is the exact convergence rate of the chain.
3 Convergence rate and algorithmic comparison
3.1 Convergence in \(\chi ^2\)-distance
The \(\chi ^2\)-distance between two probability distributions \(\pi \) and p is defined as
Let \(p_n(x)= A_n(p_0, x)\) denote the distribution of \(X_n\), the state of the Markov chain after n steps from initialization \(p_0\). It was shown in Liu et al. (1995) that \(d_\chi (\pi ,p_n) \le \Vert K_0^n \Vert _2 d_\chi (\pi ,p_0)\), where \(\Vert \cdot \Vert _2\) is \(L^2\)-norm of the operator \(K_0\). It is easy to show that (Liu et al. 1995)
is the spectral radius of \(K_0\) (Liu et al. 1995), which is equal to the maximum of \(\sigma (K_0)\) in absolute value. As shown in Theorem 2, this is bounded by \(1-H(w^*)\). Thus, \(d_\chi (\pi ,p_n) \le (1-H(w^*))^n d_\chi (\pi ,p_0)\). It also follows from the Cauchy-Schwartz inequality that
Thus, the \(L_1\) distance between \(p_n\) and the target \(\pi \), also known as their total variation distribution and denoted as \(\Vert p_n-\pi \Vert _{TV}\), decreases geometrically bounded by the same rate.
3.2 Maximal total variation distance
Definition 3
Let the transition function of a Markov chain be \(A(\cdot ,\cdot )\), with the corresponding stationary distribution \(\pi (\cdot )\). The maximal total variation distance between the Markov chain’s n-step distribution and \(\pi \) is
Moreover, the quantity
is called the exact convergence rate of the Markov chain.
Since the total variation distance is equivalent to the \(L_1\) distance \(\Vert p-\pi \Vert _{TV}=2\Vert p-\pi \Vert _{L^1}\) between two probability measures \(\pi \) and p, it is easy to see from definition of (10) and Eq. (11) that rate \(r\le \rho \). In the following, we use another a coupling argument to validate this upper bound r. Moreover, we will also show that for the transition kernel defined by Algorithm 2, inequality \(r\ge \rho \) also holds. We need the following lemmas to prove our results.
Lemma 1
(Coupling) (Levin and Peres 2017) Suppose \((\Psi _t,\widetilde{\Psi }_t)_{t=0}^\infty \) are a pair of Markov chains with the same transition rule satisfying: (i) If \(\Psi _i=\widetilde{\Psi }_i\) for some i, then for any \(j\ge i\), \(\Psi _j=\widetilde{\Psi }_j\); and (ii) \(\widetilde{\Psi }_0\sim \pi \). Then, for \(\tau =\min \{n:\Psi _n=\widetilde{\Psi }_n\}\), we have a bound
Lemma 2
(Lower bound) (Wang 2020) Let R(x) denote the rejection probability (5) given current state x. That is,
Then, we have a lower bound
Theorem 3
Consider the MTM-IS defined in Algorithm 2 and let \(w^*<\infty \) be the essential supremum of \(w(x)=\pi (x)/p(x)\). Then, the maximal total variation distance of the algorithm to its target distribution \(\pi \) is
Thus, the exact convergence rate of the MTM-IS is \(1-H_k(w^*)\).
Proof
We will establish that upper and lower bounds of d(n) are equal in the limit.
3.2.1 Upper bound
An upper bound can be obtained by using the coupling idea of Lemma 1. Consider two Markov chains \(\{x_t\}\) and \(\{\tilde{x}_t\}\) defined by MTM-IS. Because of the the decomposition (6), we can interpret the actual transition measure \(A(x,\cdot )\) as a mixture of \(\pi (\cdot )\) and \(q_{\textrm{res}}(x,\cdot )\), and define the following coupling rule for the two chains. First, we let \(x_0=x\) (for some arbitrary \(x\in {\mathcal {X}}\)) and assume that \(\tilde{x}_0\sim \pi (\cdot )\) as the initialization of these two chains. Then, suppose that the two chains are at \(x_t\) and \(\tilde{x}_t\), respectively, at time t. If \(x_t=\tilde{x}_t\), then sample \(x_{t+1}\) from \(A(x_t,\cdot )\) and set \(\widetilde{x}_{t+1}=x_{t+1}\). Thus, their future paths coalesce into one. If \(x_t\ne \tilde{x}_t\), we draw \(z\sim \text {Bernoulli}(H(w^*))\) and sample \(x\sim \pi (\cdot )\). We set \(x_{t+1}=\widetilde{x}_{t+1}=x\) if \(z=1\). Otherwise, we sample \(x_{t+1}\sim q_{\textrm{res}}(x_t,\cdot )\) and \(\widetilde{x}_{t+1}\sim q_{\textrm{res}}(\tilde{x}_t,\cdot )\), independently.
Our constructions of \(\{x_t\}\) and \(\{\tilde{x}_t\}\) have the following properties: (i) marginally these two chains both evolve according to \(A(\cdot ,\cdot )\); (ii) the distribution of \(x_t\) is exactly \(A^t(x,\cdot )\) and the distribution of \(\tilde{x}_t\) is \(\pi (\cdot )\); (iii) once \(x_t=\tilde{x}_t\) for some t, the two chains coalesce into one afterwards. Applying Lemma 1, we have
Taking the supremum over \(x\in {\mathcal {X}}\) we have \(d(n)\le [1-H(w^*)]^n\).
3.2.2 Lower bound
For the lower bound, we consider the worst case as demonstrated in the proof of Lemma 2 in Wang (2020). In particular, if we can find some \(x^*\) such that \(w(x^*)=w^*\), then the proof is over; but sometimes this is not achievable, in which case we take advantage of the continuity and monotonicity of \(H_k\). For any \(\epsilon >0\), there exists \(\delta >0\) such that \(H(w)<H(w^*)+\epsilon \) once \(w^*-\delta <w\le w^*\). By the definition of essential supremum, we can always find some \(x_\delta \in {\mathcal {X}}\) such that \(w^*-\delta <w(x_\delta )\le w^*\), thus
since we know from (5) that
Letting \(\epsilon \rightarrow 0\), we derive the final result. \(\square \)
3.3 Comparison with the IMH sampler
Since one iteration of MTM-IS is computationally as expensive as k-iterations of the IMH algorithm, we are interested in knowing which one has a better convergence rate. We denote the MTM-IS algorithm with k trials as MTM-IS(k) to emphasize the role of k. Correspondingly, we denote the k-fold thinned IMH algorithm IMH(k) (i.e., collecting 1 draw after every k steps of the standard IMH). Note, however, that a clear advantage of MTM-IS(k) over IMH(k) is that the former is straightforward to parallelise as suggested in Calderhead (2014), which can considerably speed up the algorithm in practice.
Previously, we obtain the exact convergence rate of MTM-IS(k) as \(1-H_k(w^*)\). We rewrite (4) as an expectation form to gain some insights:
where \(X_1,\ldots ,X_{k-1}\) are independent samples from \(p(\cdot )\). Setting \(k=1\), the formula reduces to \(H_1(z)=z^{-1}\), which gives rise to the exact convergence rate \(1-1/{w^*}\) of the IMH algorithm as shown in Liu (1996) and Atchadé and Perron (2007). The convergence rate of IMH(k) is then exactly \((1-1/w^*)^k\). We have the following main result, whose proof is deferred to the Appendix.
Theorem 4
With the same notations as in Theorem 3, we have
for any \(k\ge 1\), where all \(X_i\)’s are taken independently from \(p(\cdot )\). Thus, MTM-IS(k) is no more efficient than IMH(k) although the two algorithms are of similar computational cost.
This theorem provides the first theoretical guidance on the use of MTM methods. It implies that in this rather simple MTM-IS framework, multiple independent proposals are not helpful in improving the the mixing of the algorithm. It is not surprising that IMH is preferable when the target distribution is “easy”—after all, the IMH is perfect if the proposal matches the target exactly and having multiple trials is simply a waste. It is surprising to us, though, that such a preference holds universally.
We speculate that k independent multiple proposals in a general MTM framework are also not more efficient than the corresponding k-fold thinned MCMC algorithm. It therefore casts a doubt on the utility of MTM. Our numerical experiences in the past suggest that the MTM strategy is most helpful in jumping among multiple modes of the target distribution (Liu et al. 2000; Dai and Liu 2020). Also as demonstrated in the molecular simulation literature (Frenkel et al. 1996), a form of partial MTM is very useful in building part of the proposal and will be examined in more detail in Sect. 5.1. More general correlated multiple proposals may also help (Craiu and Lemieux 2007) and will be discussed in Sects. 5.2 and 5.4.
4 Numerical illustrations
We illustrate the discrepancy between convergence rates of MTM-IS(k) and IMH(k) numerically. As expected, if the proposal p is already very close to target \(\pi \), IMH(k) is significantly better than MTM-IS(k). The performance difference of the two algorithms becomes quite minimal if the proposal distribution differs from the target one considerably, i.e., when \(w^*\) is large. In these examples, the explicit convergence rate formula for MTM-IS(k) is still complicated, so we use Monte Carlo to approximate the expectation in (15).
4.1 Univariate examples
The first two examples were previously used in Liu (1996) to compare the IMH algorithm with importance sampling and rejection sampling and are reexamined here. The third example is a continuous case with an unbounded domain.
Example 1
Let the state space be \({\mathcal {X}}=\{1,\ldots ,m\}\), \(p(i)=1/m\) and \(\pi (i)=(2m+1-2i)/m^2,p(i)=1/m\). In this case, \(w^*=2-1/m\) is close to 2, leading to an approximate convergence rate 0.5 for the IMH algorithm. Figure 1 displays the convergence rates of MTM-IS(k) and IMH(k) with \(m=1000\) and k ranging from 1 to 10 computed from 50,000 independent uniform Monte Carlo samples.
Example 2
We consider the case where the target distribution is binomial \(\text {Bin}(m,\theta )\), and \(p(x)=1/(m+1)\) is uniform. Then
Using the standard normal approximation, we find that
Figure 2 is computed from 50,000 independent uniform Monte Carlo samples with \(m=100\) for two \(\theta \) values. We that in the latter case when the distribution is very skewed, the discrepancy between MTM-IS(k) and IMH(k) is much smaller.
Example 3
We investigate a one-dimensional continuous case with the target being \({\mathcal {N}}(0,1)\), and the proposal distribution being a scaled t-distribution with 10 degrees of freedom, \(p(x)=ct_{10}(cx)\) with \(c\ge 1\). For practical uses of both importance sampling and IMH-type algorithms, we strongly recommend to choose a proposal distribution that has a heavier tail than but does not differ too much with the target. In our case, both t-distribution proposals satisfy the fat-tail requirement. But a larger c leads to a larger discrepancy between the target and the proposal. Figure 3 is computed based on 50,000 independent Monte Carlo samples with two choices of c, demonstrating that IMH(k) and MTM-IS(k) are nearly indistinguishable if the proposal does not align with the target well.
4.2 Multivariate Gaussian and Gaussian mixture
We first use multivariate Gaussian distributions as both the target and proposal to show some practical implications of our result. Let \(\pi ={\mathcal {N}}(0,{\mathbb {I}} _d)\) and \(T={\mathcal {N}}(\vec {\mu },\sigma ^2{\mathbb {I}} _d)\). Then we find that the importance weight can be expressed as:
Therefore, \(w^*=\sup w(\vec {x})<\infty \) if either \(\sigma >1\) with an arbitrary \(\vec {\mu }\) or \(\sigma =1\) with \(\vec {\mu }=0\). When \(\sigma >1\), the maximal importance weight \(w^*\sim \sigma ^d\) and thus the mixing time of IMH \(\tau _{\textrm{IMH}}(\delta )=\Omega (w^*\log (1/\delta ))\) scales exponentially with the dimension d. In the same manner, the mixing time of MTM-IS also scales exponentially with d, and becomes worse as \(\sigma \) increases. Figure 4 supports that MTM-IS and consecutive IMH have almost the same mixing rates.
Next, we consider a Gaussian mixture distribution \(\pi =\frac{1}{3}{\mathcal {N}}(0,{\mathbb {I}} _d)+\frac{2}{3}{\mathcal {N}}(\vec {1},{\mathbb {I}} _d)\), where \(\vec {1}\) is a d-dimensional vector filled with all 1’s. Employing \(T={\mathcal {N}}(0,\sigma ^2{\mathbb {I}} _d)\), we have the importance weight
It is easy to see that \(w^*<\infty \) if and only if \(\sigma >1\). Figure 5 depicts theoretical convergence rates and log mixing times for varying dimension and proposal standard deviation \(\sigma \). Again the mixing times scale exponentially with dimension d. Unlike the single Gaussian case, however, Fig. 5b shows that the slope of log mixing times is not a monotone function of \(\sigma \).
Figure 6 explores the optimization with \(\sigma \). Specifically, Fig. 6a plots the convergence rates against varying \(\sigma \) when \(d=2\), showing that the optimal choice is \(\sigma \approx 1.594949\). When d grows, the optimal \(\sigma \) remains approximately in the range of 1.55–1.62. Figure 6c indicates that the mixing time still scales exponentially with d even if \(\sigma \) is optimized.
5 Variants of multiple-try Metropolis
5.1 Partial MTM-IS: an efficient variant
To reflect how MTM has actually been used in molecular simulations (Frenkel et al. 1996), we assume a partition of the state-space, \(\textbf{x}=(\textbf{x}^a,\textbf{x}^b)\), and the corresponding partition of the target distribution \(\pi (\textbf{x})\propto q(\textbf{x}^a,\textbf{x}^b)=q_a(\textbf{x}^a)q_b(\textbf{x}^b| \textbf{x}^a)\), where \(q_b\) may not be normalized. We assume that \(q_a(\textbf{x}^a)\) is much more expensive to evaluate than \(q_b(\textbf{x}^b|\textbf{x}^a)\). An important point to note is that we want to move \((\textbf{x}^a,\textbf{x}^b)\) jointly instead of iterating between conditional draws of \(\textbf{x}_a|\textbf{x}_b\) and \(\textbf{x}_b|\textbf{x}_a\) (for reasons such as the two components may be tightly coupled). We consider the independent proposal: \(p(\textbf{x})=p_a(\textbf{x}^a)p_b(\textbf{x}^b|\textbf{x}^a)\). A Partial MTM-IS algorithm is as follows:
Remark 1
(PMTM-IS versus MTM-IS) Note that, compared with the vanilla MTM-IS (Algorithm 2), PMTM-IS needs to draw extra balancing samples. Since we assume that sampling \(\textbf{x}^b\) and evaluating it are both very cheap, it is still worth doing. In this case, there are no standard IMH or MCMC variants for comparisons.
Typically, one iteration of IMH involves evaluating \(q_a/p_a\) twice (respectively on \(\textbf{x}^a\) and \(\textbf{y}^a\)) and evaluating \(q_b/p_b\) twice (respectively on \(\textbf{x}^b|\textbf{x}^a\) and \(\textbf{y}^b|\textbf{y}^a\)). In contrast, one iteration of Algorithm 3 consists of evaluating \(q_a/p_a\) twice (respectively on \(\textbf{x}^a\) and \(\textbf{y}^a\)) and evaluating \(q_b/p_b\) for 2k times (respectively on \(\textbf{x}^b_j|\textbf{x}^a\) and \(\textbf{y}^b_j|\textbf{y}^a\) with \(j=1,\ldots ,k\)). When evaluating \(q_b\) is significantly computationally more expensive than \(q_a\), Algorithm 3 nearly matches the computational cost of one-step IMH. Under certain reasonable regularity conditions, the following proposition shows that Algorithm 3 provably converges faster.
Proposition 1
Let \(\textbf{x}=(\textbf{x}^a,\textbf{x}^b)\), and \(\pi (\textbf{x})=\pi _a(\textbf{x}^a)\pi _b(\textbf{x}^b| \textbf{x}^a)\propto q_a(\textbf{x}_a) q_b (\textbf{x}^b|\textbf{x}^a)\), where \(\pi _a\) and \(\pi _b\) are normalized marginal and conditional distributions. Under the following regularity conditions with proposal p (all parts normalized):
IMH converges with rate \(1-1/w^*\). In contrast, the partial MTM-IS (Algorithm 3) has a convergence rate no slower than \(1-1/w^*\).
Proof
Noting that \(\text {ess}\sup _{\textbf{x}}\frac{\pi (\textbf{x}^a,\textbf{x}^b)}{p(\textbf{x}^a, \textbf{x}^b)}=w^*\), we obtain the convergence rate of IMH as \(1-1/w^*\) by Theorem 3. As for Algorithm 3, we decompose the transition kernel as
Suppose the normalizing constant of \(q(\textbf{x}^a,\textbf{x}^b)\) is C, i.e., \(\pi (\textbf{x}^a,\textbf{x}^b)=q(\textbf{x}^a,\textbf{x}^b)/C\). Then,
in which \(\textbf{y}_k^b=\textbf{y}^b\) and \(\textbf{x}^b_k=\textbf{x}^b\). Therefore, it gives rise to
where \(W(\textbf{x}^a;\textbf{x}^b_{1:k})\triangleq \sum _{j=1}^k \frac{\pi _b(\textbf{x}^a,\textbf{x}^b_j)}{p_b(\textbf{x}^a,\textbf{x}^b_j)}\) for any \(\textbf{x}^a\in {\mathcal {X}}^a,\textbf{x}^b_{1:k} =(\textbf{x}^b_1,\ldots ,\textbf{x}^b_k) \in ({\mathcal {X}}^b)^k\). By definition, we find
The following inequality immediately follows:
Surprisingly, (3) leads to a mixture decomposition like (17) and thus is sufficient to construct the upper bound in Theorem 3 by the coupling argument and Lemma 1. Therefore, the convergence rate of Algorithm 3 is no larger than \(1-1/w^*\). However, the arguments for establishing matching lower bounds cannot directly apply due to the extra balancing trials \(\textbf{x}^b_j,1\le j\le k-1\). So the exact convergence rate of Algorithm 3 remains unknown. \(\square \)
5.2 Correlated multiple trials
Compared with the original MTM, the partial MTM-IS differs in that its multiple trials \((\textbf{y}^a,\textbf{y}^b_1),\ldots ,(\textbf{y}^a,\textbf{y}^b_k)\) are correlated due to the state space partitioning. As also demonstrated by Craiu and Lemieux (2007), we believe that generating correlated multiple trials is a key in designing efficient MTM algorithms. Although rigorous theoretical analysis for a general correlated MTM design is beyond our reach, we present some theoretical results for two special cases for finite state spaces, which may also be generalization to continuous state-spaces. Implications derived from the analysis apply more generally: good correlated multiple-tries can be obtained with the aid of a deterministic step.
5.2.1 Stratified sampling
Suppose \({\mathcal {X}}\) is a finite state space. We partition it into a few subgroups, \({\mathcal {X}}_1, \ldots , {\mathcal {X}}_B\) so that \({\mathcal {X}}_{i}\cap {\mathcal {X}}_j=\emptyset \), \(\forall i\ne j\) and \(\cup _j {\mathcal {X}}_j ={\mathcal {X}}\). We begin with a block wise IMH step by sampling from \(\{{\mathcal {X}}_1, \ldots , {\mathcal {X}}_B\}\) with weight \(p({\mathcal {X}}_j)\) and accept it with \(w({\mathcal {X}}_j)=\pi ({\mathcal {X}}_j)/p({\mathcal {X}}_j)\) afterwards. Then, we draw y within the sampled block with probability proportional to \(\pi (y)\). It is easy to see that the chain become stationary once it converges at the subgroup level. Thus, the convergence rate of this algorithm is
where \({\mathcal {X}}^*=\arg \max _j w({\mathcal {X}}_j)\). This is not generally better than the IMH(k), which has a convergence rate of \((1-1/w^*)^k> 1- k/w^*\). But if the weights w’s are very uneven and we can partition the states so that the weights \(w({\mathcal {X}}_j)\)’s are more balanced, then the stratified IMH can improve upon IMH(k) significantly. We also note that the computation cost of this block-based MTM-IS(k) algorithm is no worse than IMH(2) (the first step of block sampling is no worse than 1-step IMH; and so is the second step of sampling within a block), much better than IMH(k) when k is large.
Example 4
(Example 1 continued) Let \({\mathcal {X}}=\{1,\ldots , N\}\), and suppose that the target \(\pi (x)\propto x\) and \(p(x)\propto 1\). Then, the original weights are \(w(x) \propto x\) and \(w^*=\frac{2N}{1+N}\approx 2\). Let \(k=2\), then IMH(2) has a rate of \((1-1/w^*)^{2}\approx 0.25\), which is quite good. Assume that N is an even number and we partition the space as \({\mathcal {X}}_j=\{j, N-j+1\}\) for \(j=1,\ldots , N/2\). Then \(w({\mathcal {X}}_j)\propto 1\), and the resulting MTM-IS(2) converges in one step. More generally, for an arbitrary distribution \(\pi (x)\) and the uniform proposal \(p(x)=(2N)^{-1}\), we have \(w^*=\pi (x^*)\) with \(x^*=\arg \max _x \pi (x)\). Thus, if we can partition the state space so that \(\pi ({\mathcal {X}}_j)\) are approximately equal for \(j=1,\ldots ,B\), the algorithm can be much improved.
5.2.2 Sampling without replacement
Another obvious way of introducing correlations for multiple proposals is to do sampling without replacement. Let \({\mathcal {X}}=\{1,\ldots , N\}\). To simplify the discussion, we here focus on the simple random sampling without replacement (SRSWOR, i.e., \(p(\textbf{y})\propto 1\)), although it is possible to extend the method to do sampling without replace with unequal probabilities using one of the schemes in Chen et al. (1994). The algorithm is as follows.
The actual transition probability from x to \(y\ne x\) for this scheme is
where \(S^{(k-1)}_y\subset {\mathcal {X}}\backslash \{x,y\}\), \(|S^{(k-1)}_y|=k-1\), and \( y_j \in S^{(k-1)}_y, \forall j<k\). Doing an exact eigenvalue decomposition of matrix A would have brought us a tight bound on the convergence rate. But A does not possess a nice low-rank property as that for the IMH sampler or the MTM-IS.
For \(S\subset {\mathcal {X}}\), we define \(\pi (S)=\sum _{x\in {\mathcal {X}}}\pi (x)\), \(S^*=\arg \max _{\{S: \ |S|=k\} } \pi (S)\), and \(x^*=\arg \max _x\pi (x)\). We find the following inequality to hold:
During each iteration, the chain stays at the current state if and only if the new proposal is rejected since in our construction of Algorithm 4, the proposal set is not allowed to contain the current state. We observe that \(\rho \equiv 1\) whenever \(x=x_*\triangleq \arg \min _x\pi (x)\), leading to \(A(x_*,x_*)=0\). This fact prevents us from using the previous coupling arguments directly. However, as we specify to some circumstances, we could still obtain satisfactory results.
Example 5
Choosing \(k=2\) and \({\mathcal {X}}=\{1,\ldots ,N\}\),we set
where \(0\le p\le (N-1)/N\), which guarantees that \(x^*=1\) and \(\{2,\ldots ,N\}\in \arg \min _x\pi (x)\). As a result, we know that \(A(2,2)=\cdots =A(N,N)=0\). Furthermore, matrix A can be completely determined by the following four quantities:
We can then write out A as follows:
Now this matrix admits a useful low-rank decoupling: \(A=G+ep^T\), where \(e=\left[ 1,\ldots ,1\right] ^T\), \(p=\left[ a_3,a_4,\ldots ,a_4\right] ^T\) and
Note that e is a common right eigenvector for both A and \(A-G\), corresponding to the largest eigenvalue 1. Since \(A-G\) is of rank 1, the remaining eigenvalues of A and G have to be the same. Hence the eigenvalues for A are \(1,a_2-a_3,-a_4,\ldots ,-a_4\). This decoupling trick has also been used in Liu (1996) for the IMH algorithm. Given the convergence rate \(\left( 1-1/(N\pi _1)\right) ^2\) of IMH(2), it suffices to show
to prove that MTM-SRSWOR(2) is faster than IMH(2). Clearly, this holds true for \(p=\dfrac{N-1}{2N}\), which leads to \(\pi _1=1/2+1/(2N)\), \(\pi _2=1/(2N)\). In this case,
We note that designing a suitable parallel construction to do SRSWOR can speed up the algorithm considerably. Furthermore, when proposing multiple trials, we may also choose not to exclude x from the proposal set. In this case, we need to modify Algorithm 4 slightly to become Algorithm 5.
5.3 Independent non-identical proposals
Besides introducing correlations between multiple trials, Craiu and Lemieux (2007) also suggests to use different proposals for generating multiple trials in each MTM iteration and provides some supportive empirical evidences. Here we consider a special case of MTM-IS(k) in which the multiple trials are generated from different proposals, i.e., \(y_j\sim p_j(\cdot )\) independently for \(j=1,\ldots ,k\). In this case, we also do not have to draw balancing trials. Defining \(w_j(x):=\pi (x)/p_j(x)\), we summarize the procedure in Algorithm 6.
To demonstrate the effect of the multiple-try design employed in Algorithm 6, it should be compared with a sequential k-step IMH sampler. During one iteration, this sampler runs an interior loop of length k, within which the j-th step proposes an independent proposal from \(p_j\) and then accepts/rejects it based on the MH rule as in the ordinary IMH sampler. This sequential IMH sampler has the same computational cost as Algorithm 6. The following theorem provides tight upper bounds for the convergence rates of the two algorithms, and its proof is deferred to appendixes.
Theorem 5
Suppose target \(\pi \) is absolutely continuous with respect to every proposal \(p_j\). Algorithm 6 and its corresponding sequential IMH sampler are geometrically convergent, with their corresponding respective convergent rates upper bounded by \(1-\sum _{j=1}^{k}{\mathbb {E}}_{p}\left[ \frac{1}{w_j^*+\sum _{1\le i\le k,i\ne j}w_i(X_i)}\right] \) and \(\prod _{j=1}^k \left( 1-\frac{1}{w_j^*}\right) \), respectively, where \(w_j^*:=\sup _{x\in {\mathcal {X}}}w_j(x)\). Furthermore, the following inequality holds,
implying that the upper bound for Algorithm 6 is worse than that for the corresponding sequential IMH.
Remark 2
(Tightness of the lower bounds) Suppose \(\exists \ x^*\) such that
i.e., different proposals have their importance weight functions \(w_j\) to attain their respective supremums at a same point \(x^*\). Then, the convergence rates for both aforementioned algorithms attain their respective upper bounds. When \(p_1=\cdots =p_k\), condition (24) automatically holds, recovering the convergence rate result of Theorem 3. However, when there is no such a \(x^*\) as required by (24), the quantities claimed in Theorem 5 are only upper bounds. It remains unknown under what other conditions one algorithm can be provably better than the other. Our empirical study shows that their computational efficiencies are almost indistinguishable when the target distribution is “hard” relative to the proposals.
Example 6
We conducted a few simulations to examine convergence behaviors of Algorithm 6 and the corresponding sequential IMH sampler at the same computational cost.
As shown in Fig. 7, we considered target densities of the form of a mixture of two standard distributions with various dimensions. Top plots in Fig. 7 correspond to Gaussian mixture targets, \(\pi =\frac{1}{2}{\mathcal {N}}(0,{\mathbb {I}} _d)+\frac{1}{2}{\mathcal {N}}(\textbf{3},{\mathbb {I}} _d)\), with d=3, 4, and 5, respectively. Two different proposal distributions are employed: \(p_1={\mathcal {N}}(0,{\mathbb {I}} _d)\) and \(p_2={\mathcal {N}}(0,9{\mathbb {I}} _d)\). During one iteration of the MTM-IS(k) algorithm, k/2 trials are independently drawn from of \(p_1\), and another k/2 trials from \(p_2\). The bottom plots correspond to t-mixture distributions, \(\pi =\frac{1}{2}t_3(0)+\frac{1}{2}t_3(\textbf{4})\), for d=1, 2, and 3. Two different proposal distributions are: \(p_1=t_3(0)\) and \(p_2=t_5(0)\), and the same implementation of MTM-IS(k) as the previous case is employed. These plots show that Algorithm 6 and its corresponding sequential IMH sampler differ very little in their convergence rates although theoretically we cannot claim one is necessarily better than the other without condition (24). All simulations are based on \(10^6\) iterations on an Apple M2 chip with 16GB memory, each taking a few minutes.
5.4 A general framework
Inspired by the variants of MTM just discussed, we propose a general framework to combine these variants in Algorithm 7. With \(\pi (\cdot )\) as the target distribution on \({\mathcal {X}}\), we let \(p(x,\textbf{y})\) denote the proposal transition function for multiple correlated proposals, where \(x\in {\mathcal {X}}\) and \(\textbf{y}=(y_1,\ldots ,y_k)\in {\mathcal {X}}^k\). We further write the j-th marginal of \(p(x,\textbf{y})\) as \(p_j(x,y_j)=\int p(x,\textbf{y})\textrm{d}\textbf{y}_{(-j)}\), and define the jth lab:ssk08 as
for \(j=1,\ldots ,k\), where \(\lambda _j\) is a symmetric function. Assuming the current state is x, the updating rule is summarized in Algorithm 7.
If we require \(p(x,\textbf{y})\) to have the same marginals for different \(y_j\)’s, the algorithm reduces to that of Craiu and Lemieux (2007); if we require \(p(x;\textbf{y})\) to be independent among the \(y_j\)’s, it reduces to that of Casarin et al. (2013). Note that the balancing proposals are drawn to facilitate the computation of \(\rho \), and this guarantees the detailed balance of the MTM design. The following result is expected and its detailed proof is deferred to the Appendix.
Theorem 6
The generalized MTM transition rule (Algorithm 7) satisfies the detailed balance condition and hence induces a reversible Markov chain with \(\pi \) as its invariant distribution.
Defining \(\textbf{x}^*(j)\triangleq (x^*_1,\ldots ,x^*_{j-1}, x,x^*_{j+1},x^*_k)\), one can determine the transition density of the generalized MTM framework via the same spirit employed in the proof of Theorem 1:
where we write
for any \(\textbf{x}=(x_1,\ldots ,x_k)\) and \(\textbf{y}=(y_1,\ldots ,y_k)\). A detailed derivation of this formula can be found in the proof of Theorem 6.
As demonstrated in Algorithms 4, 5 and 6, we find that sometimes we do not need to draw balancing trials for MTM to retain the detailed balance. A natural question then arises: can we find a general condition under which which MTM can avoid the drawing of balancing trials? The following theorem provides a sufficient condition that covers all the cases we discussed.
Theorem 7
If, for any pair (x, y) and \(\forall j\), the joint proposal distribution satisfies
we can maintain the detailed balance by setting \(x^*_j\triangleq y_j\) for \(j\ne J\) in Algorithm 7.
Remark 3
(Correlated multiple trials) As demonstrated in Sects. 5.1 and 5.2, letting the proposed multiple trials be correlated (especially negatively) can be helpful in improving the chain’s convergence. A useful strategy is to use multiple trials as stepping stones to move from one mode of the distribution to another, similar in spirit to Hamiltonian/hybrid Monte Carlo (Qin and Liu 2001; Liu 2008) and the griddy Gibbs MTM (Liu et al. 2000). Indeed, it was shown empirically in Qin and Liu (2001) that applying MTM–HMC trajectories may further improve the sampling efficiency. However, an in-depth theoretical analysis as carried out here is much more challenging due to the semi-deterministic nature of aforementioned algorithms.
Remark 4
(Employing multiple distributions in MTM) Intuitively, one may hope that using different distributions for each trial could help us explore the state space better. Our results in Sect. 5.3, however, demonstrate that it is still not very useful under the IMH framework if the multiple trials are independent. It may be helpful for the partial MTM framework discussed in Sect. 5.1.
6 Concluding remarks
We have presented a complete eigen-decomposition and convergence rate analysis for the MTM-IS, and compared it with the “thinned” IMH sampler (of the same computational cost). With the exact form of eigenvalues of the MTM-IS, we proved rigorously that the sampler is not as efficient as the simpler “thinned” IMH approach. To the best of our knowledge, this is the first exact rate result known for a MTM type algorithm, although the result’s implication is less than encouraging. A good news is that, in a more realistic setting of MTM applications as explained in Sect. 5.1, we can show that MTM improves upon the standard IMH and does not have a suitable competitor.
In a quest for finding advantages MTM may offer, we consider a slightly modified framework that encompasses a few variants of MTM published in the literature. We found that even under the IMH framework, it is possible to construct a MTM algorithm, using either stratified sampling or partial sampling, or sampling without replacement, to gain efficiency. A key to such efficiency gain is to allow multiple trials to be either more dispersed than independent ones (Sect. 5) or applied only to certain “low-cost” parts (Sect. 5.1). Detailed theoretical understanding and guiding principles, however, are still lacking and awaiting further endeavors.
References
Atchadé, Y.F., Perron, F.: On the geometric ergodicity of metropolis-hastings algorithms. Statistics 41(1), 77–84 (2007)
Bédard, M., Douc, R., Moulines, E.: Scaling analysis of multiple-try mcmc methods. Stoch. Process. Appl. 122(3), 758–786 (2012)
Brooks, S., Gelman, A., Jones, G., Meng, X.-L.: Handbook of Markov Chain Monte Carlo. CRC Press (2011)
Calderhead, B.: A general construction for parallelizing metropolis-hastings algorithms. Proc. Natl. Acad. Sci. 111(49), 17408–17413 (2014)
Casarin, R., Craiu, R., Leisen, F.: Interacting multiple try algorithms with different proposal distributions. Stat. Comput. 23(2), 185–200 (2013)
Chen, X.-H., Dempster, A.P., Liu, J.S.: Weighted finite population sampling to maximize entropy. Biometrika 81(3), 457–469 (1994)
Craiu, R.V., Lemieux, C.: Acceleration of the multiple-try metropolis algorithm using antithetic and stratified sampling. Stat. Comput. 17(2), 109–120 (2007)
Dai, C., Liu, J.S.: Monte Carlo approximation of bayes factors via mixing with surrogate distributions. J. Am. Stat. Assoc. 117, 765 (2020)
Diaconis, P., Khare, K., Saloff-Coste, L.: Gibbs sampling, exponential families and orthogonal polynomials. Stat. Sci. 23(2), 151–178 (2008)
Diaconis, P., Saloff-Coste, L.: What do we know about the metropolis algorithm? J. Comput. Syst. Sci. 57(1), 20–36 (1998)
Frenkel, D., Smit, B., Ratner, M.A.: Understanding Molecular Simulation: from Algorithms to Applications. Academic press San Diego (1996)
Hastings, W.K.: Monte carlo sampling methods using markov chains and their applications (1970)
Levin, D.A., Peres, Y.: Markov Chains and Mixing Times, vol. 107. American Mathematical Soc (2017)
Liu, J.S.: Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6(2), 113–119 (1996)
Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer (2008)
Liu, J.S., Liang, F., Wong, W.H.: The multiple-try method and local optimization in metropolis sampling. J. Am. Stat. Assoc. 95(449), 121–134 (2000)
Liu, J.S., Wong, W.H., Kong, A.: Covariance structure and convergence rate of the gibbs sampler with various scans. J. R. Stat. Soc. Ser. B 57(1), 157–169 (1995)
Martino, L.: A review of multiple try mcmc algorithms for signal processing. Digit. Signal Process. 75, 134–152 (2018)
Martino, L., Leisen, F., Corander, J.: On multiple try schemes and the particle metropolis-hastings algorithm. arXiv preprint arXiv:1409.0051 (2014)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Neal, R.M.: Mcmc using ensembles of states for problems with fast and slow variables such as gaussian process regression. arXiv preprint arXiv:1101.0387 (2011)
Pandolfi, S., Bartolucci, F., Friel, N.: A generalization of the multiple-try metropolis algorithm for bayesian estimation and model selection. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 581–588. JMLR Workshop and Conference Proceedings (2010)
Qin, Z.S., Liu, J.S.: Multipoint metropolis method with application to hybrid monte Carlo. J. Comput. Phys. 172(2), 827–840 (2001)
Roberts, G.O., Tweedie, R.L.: Geometric convergence and central limit theorems for multidimensional hastings and metropolis algorithms. Biometrika 83(1), 95–110 (1996)
Tierney, L.: Markov chains for exploring posterior distributions. Ann. Stat. 22, 1701–1728 (1994)
Wang, G.: Exact convergence analysis of the independent metropolis-hastings algorithms. arXiv preprint arXiv:2008.02455 (2020)
Yang, S., Chen, Y., Bernton, E., Liu, J.S.: On parallelizable Markov chain monte Carlo algorithms with waste-recycling. Stat. Comput. 28(5), 1073–1081 (2018)
Acknowledgements
We thank the National Science Foundation of the United States (DMS-1903139 and DMS-2015411) for partially supporting the research. Part or of work was done when Yang was a student in the School of Gifted Young, University of Science and Technology of China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests that are directly or indirectly related to the work submitted for publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Detailed proofs
Appendix A: Detailed proofs
Proof of Theorem 2
Before proving the theorem, we first define the following additional notations and concepts. Let \(A(\cdot ,\cdot )\) denote the Markov transition kernel implied by our algorithm. The operator K associated with the resulting Markov chain is defined as follows: for any measurable function f defined on \({\mathcal {X}}\), operator K maps f to another function defined on \({\mathcal {X}}\):
We require that function \(f\in L^2(\pi )\). It is easy to see that \(Kf\in L^2(\pi )\) as well, meaning that K defines a linear bounded operator on the Hilbert space \(L^2(\pi )\) with operator norm 1. For any set \(S\subset {\mathcal {X}}\), we shall also denote \(\chi _S:{\mathcal {X}}\rightarrow \{0,1\}\) as the indicator function which equals 1 if and only if on S. Intuitively, K is just a conditional expectation operator. Note that the constant function 1 is automatically an eigenfunction of eigenvalue 1. We are interested in finding the spectral gap, i.e., the difference between 1 and the second largest eigenvalue. We thus focus on the restricted operator \(K_0\) defined on the orthogonal complement of the constant function:
Given Theorem 1, we divide the operator \(K_0\) into two parts: \(\forall f\in L^2_0(\pi )\),
Before presenting the formal proof, we remark that this decomposition has the same nature as that in Section 2.1 of Liu (1996), in which the multiplication operator \(M_R\) is a low-rank component and the integral-like operator U that resembles the upper triangular matrix in the discrete case. This proof is analogous to that in Atchadé and Perron (2007). The formal proof is divided into the following steps.
Step 1. We first show that operator U is compact. Under the following condition,
operator U is Hilbert-Schmidt, and therefore compact. Hence, by Weyl’s perturbation theorem, we have
Step 2. Given this, combined with the decomposition
we know that it suffices to prove that \(\sigma _d(K_0)\subset \text {ess-ran}(R)\), i.e. all eigenvalues of \(K_0\) are in the essential range of R. To proceed, we assume that there exists \(f_0\in L^2_0(\pi )\) and \(\lambda \notin \text {ess-ran}(R)\), but \(K_0f_0=\lambda f_0\).
Direct computations yield that for any \(f\in L^2_0(\pi )\)
Since we assume that \(\lambda \notin \text {ess-ran}(R)\), we have \(\kappa =\text {ess}\inf \left( \mid R(x)-\lambda \mid \right) >0\). We can rearrange equation \(K_0f_0=\lambda f_0\) to arrive at
which can be simplified as \(Nf_0=f_0\) with N being an operator well-defined on \(L^2(\pi )\) (rather than \(L^2_0(\pi )\) in which \(K_0\) is defined). Then, we aim to derive a contradiction about the spectral radius \(\text {radii}(G)\triangleq \sup \{\mid \lambda \mid :\lambda \in \sigma (G)\}\) for some linear operator G on \(L^2(\pi )\) induced by N.
Step 3. Since \(f_0\) is not identically vanishing, we can find \(u<w^*\) so that \(f_0\) is not null on \(\{x\in {\mathcal {X}}:u<w(x)\le w^*\}\). For any partition \(I_n=(u=u_n\le u_{n-1}\le \ldots \le u_0=w^*)\), we denote \(D_i=\{x\in {\mathcal {X}}:u_i<w(x)\le u_{i-1}\}\) and \(L^2_i(\pi )=\{h\in L^2_0(\pi ):h(x)=0,\forall x\notin D_i\}\) for \(i=1,\ldots ,n\). Then \(L^2_i(\pi )\) is a closed subspace of \(L^2_0(\pi )\), thus a Hilbert space. Moreover, we introduce \(M_{D_i}\) as the restriction operator onto \(D_i\) on \(L^2(\pi )\), by letting \(M_{D_i}g(x)=\chi _{D_i}(x)g(x)\) for any \(g\in L^2(\pi )\).
We know that
where the second inequality follows from the fact that \(y\notin D_1\) and \(w(y)\ge w(x)\) would together imply that \(x\notin D_1\). Obtaining from \(Nf_0=f_0\) and \(M^2_{D_1}=M_{D_1}\), we then have \(f_{0,D_1}\triangleq M_{D_1}f_0=M_{D_1}Nf_0=M_{D_1}NM_{D_1} f_{0}=M_{D_1}NM_{D_1}f_{0,D_1}\).
In the same manner, we have
where \(f_{0,D_i}\triangleq M_{D_i}f_0\) and
Rearranging these formulae, we know that
We claim that (A3) implies that \(\text {radii}(M_{D_i}NM_{D_i})\ge 1\) holds true for at least one index \(i\in \{1,\ldots ,n\}\). Assuming the converse is true, then \(M_{D_1}NM_{D_1}f_{0,D_1}=f_{0,D_1}\) implies that \(f_{0,D_1}=0\) (since 1 cannot be an eigenvalue of \(M_{D_1}NM_{D_1}\)). Consequently, \(h_2=0\) follows automatically from its definition (A2), and \(M_{D_2}NM_{D_2}f_{0,D_2}=f_{0,D_2}\) implies that \(f_{0,D_2}=0\). This argument can be carried out recursively until n, indicating that \(f_0\) has to vanish on \(\{x\in {\mathcal {X}}:u<w(x)\le \bar{w}\}\), resulting in a contradiction!
Step 4. Finally, we show that for sufficiently small increments, we can make
First, the mapping
is continuous, at least on \([u,\bar{w}]\).
Second, \(\forall g\in L^2_i(\pi )\) with \(\Vert g\Vert =1\), by the Cauchy-Schwarz inequality we have
where \(\text {osc}_{[u_i,u_{i-1}]}H\triangleq \max _{[u_i,u_{i-1}]}H-\min _{[u_i,u_{i-1}]}H\) denotes the oscillation of H within \([u_i,u_{i-1}]\). Therefore, \(\Vert M_{D_i}NM_{D_i}\Vert \le \text {osc}_{[u_i,u_{i-1}]}H/\kappa \). At last, if we choose the partition to be sufficiently small, we would have \(\text {radii}(M_{D_i}NM_{D_i})<1\) for all i. We then derive a final contradiction to assert that \(\sigma _d(K_0)\subset \text {ess-ran}(R)\), ending the proof. \(\square \)
Proof of Theorem 4
In this proof, every random variable X is taken independently from p. This inequality is proved by induction. First, for \(k=1\), the inequality reduces to equality due to a previous result of Liu (1996) and Atchadé and Perron (2007). For \(k=2\), we see that
For \(k\ge 3\), we will prove the following recursive inequality, which leads to the conclusion of the theorem:
We prove by simply computing the difference between the two sides:
We note that (i) can be modified as
For (ii), we have
In conclusion, we have
Consequently, suppose the inequality (15) holds for \(k-1\), i.e.,
from (A6) it immediately follows
By induction, the final result (15) holds for arbitrary \(k\ge 1\). \(\square \)
Proof of Theorem 5
Part 1 derives the convergence rate of Algorithm 6. Part 2 derives the convergence rate of the corresponding sequential IMH sampler. Part 3 finishes by deriving the inequality (23) via induction.
Part 1. Via straight forward computation, the transition probability of Algorithm 6 has the following formula \((x\ne y)\)
Plug \(\max \{w_j(y),w_j(x)\}\le w_j^*\) into this formula to get
where \(X_i\) is taken independently from \(p_i(\cdot )\). Actually this inequality is sufficient to derive a decomposition of \(A(x,\cdot )\) as in (6). As shown in the proof of Theorem 3, we upper bound the convergence rate by \(1-\sum _{j=1}^{k}{\mathbb {E}} \left[ \frac{1}{w_j^*+\sum _{i=1,i\ne j}^{k}w_i(X_i)}\right] \) via coupling argument, Lemma 1.
Specifically, when there exists \(x^*\) such that \(w_j(x^*)=w_j^*\) for all \(j=1,\ldots ,k\), we find for any \(y\ne x^*\),
Consequently, the rejection probability at \(x^*\) is
Then we lower bound the convergence rate via Lemma 2.
Part 2. Turn to the corresponding sequential IMH sampler. For simplicity, we utilize the concept of \(L^2\) operators introduced in Sect. 2 to derive upper bounds. Within one iteration, the sampler runs an interior loop of length k, with each step as a vanilla IMH step using proposal \(p_i\). The transition probability of a vanilla IMH step is
Denote \(K^{(i)}\) as the operator defined in \(L^2(\pi )\) by \(K^{(i)}f(x)=\int f(y)A^{(i)}(x,y)\textrm{d}y\), and denote \(K_0^{(i)}\) as the restriction of \(K^{(i)}\) onto \(L^2_0(\pi )\), the orthogonal complement of the constant function of \(L^2(\pi )\). Theorem 2 implies \(\Vert K_0^{(i)}\Vert \le 1-1/w_i^*\). Denote the whole transition probability of one iteration as \(\bar{A}\) and associated operators as \(\bar{K}\) and \(\bar{K}_0\). Consequently,
Let \(p_n(x)= \bar{A}_n(p_0, x)\) denote the distribution of the n-th state of the Markov chain after n steps from initialization \(p_0\). Liu et al. (1995) establishes
Furthermore, we obtain an upper bound on the convergence rate defined in (13): \(r\le \Vert \bar{K}_0\Vert _0=\prod _{i=1}^{k}(1-1/w_i^*)\).
For a matching lower bound, we consider the special point \(x^*\in {\mathcal {X}}\) such that for all i,
Going through the full interior loop within one iteration, the whole rejection probability is at least
By Lemma 2, a matching lower bound thus obtained.
Part 3. We then establish (23). For \(k=2\),
For larger \(k>2\), we have, for an arbitrary fixed \(l\in \{1,\ldots ,k\}\),
where we denote \(B_{jl}=\sum _{i=1,i\ne j,i\ne l}^k w_i(X_i)\) for simplicity. The last inequality is mainly due to
applied in the denominators of the two positive terms. The last step of induction is the same as the proof of Theorem 4. Suppose the result holds for \(k-1\), i.e.,
it immediately follows from (A7) with \(l=k\) that
The proofs of Theorem 4 and Theorem 5 are essentially the same, both utilizing induction to recursively handle a general integer k. \(\square \)
Proof of Theorem 6
To make our notations more explicit, we assume that every distribution mentioned here has a density with respect to the Lebesgue measure. Denote A(x, y) as the actual transition density, we compute directly that
where we write \(\textbf{x}^*_{(-j)}=(x_1^*,x_2^*,\ldots , x_{j-1}^*,x_{j+1}^*,\ldots ,x_k^*)\in {\mathcal {X}}^{k-1}\) and \(\textbf{y}(j)=(y_1,\ldots ,y_{j-1},y,y_{j+1},\ldots ,y_k) \in {\mathcal {X}}^k\). Plugging in the definition of \(\rho \), we use the notations \(\textbf{x}^*(j)\triangleq (x^*_1,\ldots ,x^*_{j-1}, x,x^*_{j+1},x^*_k)\) and \(u_j(\textbf{x},\textbf{y})\triangleq \min \left\{ \frac{1}{\sum _{i=1}^k w_i(y_i,x_j)}, \frac{1}{\sum _{i=1}^kw_i(x_i,y_j)}\right\} \) to get
In the above formula, we use the identity
At last, note that \(\pi (x)w_j(y,x)p_j(x,y)=\pi (x)\pi (y) p_j(x,y) p_j(y,x) \lambda _j(x,y)\) is symmetric by our constructions, which implies that \(\pi (x)A(x,y)\) is symmetric in x and y, proving the detailed balance condition. \(\square \)
Proof of Theorem 7
If we simply set \(x^*_j:=y_j\) for any \(j\ne J\) in Algorithm 7, the conditional probability becomes
Since \(\pi (x)w_j(y,x)p_j(x,y)\) is symmetric for x and y, the theorem follows easily from condition (16) in the main text. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, X., Liu, J.S. Convergence rate of multiple-try Metropolis independent sampler. Stat Comput 33, 79 (2023). https://doi.org/10.1007/s11222-023-10241-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10241-3