# Weight-preserving simulated tempering

Open Access
Article

## Abstract

Simulated tempering is a popular method of allowing MCMC algorithms to move between modes of a multimodal target density $$\pi$$. One problem with simulated tempering for multimodal targets is that the weights of the various modes change for different inverse-temperature values, sometimes dramatically so. In this paper, we provide a fix to overcome this problem, by adjusting the mode weights to be preserved (i.e. constant) over different inverse-temperature settings. We then apply simulated tempering algorithms to multimodal targets using our mode weight correction. We present simulations in which our weight-preserving algorithm mixes between modes much more successfully than traditional tempering algorithms. We also prove a diffusion limit for an version of our algorithm, which shows that under appropriate assumptions, our algorithm mixes in time $$O(d [\log d]^2)$$.

## Keywords

Simulated tempering Parallel tempering MCMC Multimodality and Monte Carlo

## 1 Introduction

Consider the problem of drawing samples from a target distribution, $$\pi (x)$$ on a d-dimensional state space $$\mathcal {X}$$ where $$\pi (\cdot )$$ is only known up to a scaling constant. A popular approach is to use Markov chain Monte Carlo (MCMC) which uses a Markov chain that is designed in such a way that the invariant distribution of the chain is $$\pi (\cdot )$$.

However, if $$\pi (\cdot )$$ exhibits multimodality, then the majority of MCMC algorithms which use tuned localised proposal mechanisms, e.g. Roberts et al. (1997) and Roberts and Rosenthal (2001), fail to explore the state space, which leads to biased samples. Two approaches to overcome this multimodality issue are the simulated and parallel tempering algorithms. These methods augment the state space with auxiliary target distributions that enable the chain to rapidly traverse the entire state space.

The major problem with these auxiliary targets is that in general they do not preserve regional mass; see Woodard et al. (2009a, b) and Bhatnagar and Randall (2016). This problem can result in the required run time of the simulated and parallel tempering algorithms growing exponentially with the dimensionality of the problem.

In this paper, we provide a fix to overcome this problem, by adjusting the mode weights to be preserved (i.e. constant) over different inverse temperatures. We apply our mode weight correction to produce new simulated and parallel tempering algorithms for multimodal target distributions. We show that assuming the chain mixes at the hottest temperature, our mode-preserving algorithm will mix well for the original target as well.

This paper is organised as follows. Section 2 reviews the simulated and parallel tempering algorithms and the existing literature for their optimal set-up. Section 3 describes the problems with modal weight preservation that are inherent with the traditional approaches to tempering, and introduces a prototype solution called the HAT algorithm that is similar to the parallel tempering algorithm but uses novel auxiliary targets. Section 4 presents some simulation studies of the new algorithms. Section 5 provides a theoretical analysis of a diffusion limit and the resulting computational complexity of the HAT algorithm in high dimensions. Section 6 concludes and provides a discussion of further work.

## 2 Tempering algorithms

There is an array of methodology available to overcome the issues of multimodality in MCMC, the majority of which use state space augmentation, e.g. Wang and Swendsen (1990), Geyer (1991), Marinari and Parisi (1992), Neal (1996), Kou et al. (2006) and Nemeth et al. (2017). Auxiliary distributions that allow a Markov chain to explore the entirety of the state space are targeted, and their mixing information is then passed on to aid intermodal mixing in the desired target. A convenient approach for augmentation methods, such as the popular simulated tempering (ST) and parallel tempering (PT) algorithms introduced in Geyer (1991) and Marinari and Parisi (1992), is to use power-tempered target distributions, for which the target distribution at inverse temperature level $$\beta$$ is defined as
\begin{aligned}\pi _\beta (x)\propto \left[ \pi (x)\right] ^\beta \end{aligned}
for $$\beta \in (0,1]$$. For each algorithm, one needs to choose a sequence of $$n+1$$ “inverse temperatures” such that $$\varDelta :=\{\beta _0,\ldots ,\beta _n\}$$ where $$0 \le \beta _n<\beta _{n-1}<\cdots<\beta _1<\beta _0=1$$, so that $$\pi _{\beta _0}$$ equals the original target density $$\pi$$, and hopefully the hottest distribution $$\pi _{\beta _n}(x)$$ is sufficiently flat that it can be easily sampled.
The ST algorithm augments the original state space with a single-dimensional component indicating the current inverse-temperature level creating a $$(d+1)$$-dimensional chain, $$(\beta ,X)$$, defined on the extended state space $$\{\beta _0,\ldots ,\beta _n\} \times \mathcal {X}$$ that targets
\begin{aligned} \pi (\beta ,x)\propto K(\beta )\pi (x)^{\beta } \end{aligned}
(1)
where ideally $$K(\beta )=\left[ \int _{x}\pi (x)^{\beta }\text{ d }x\right] ^{-1}$$, resulting in a uniform marginal distribution over the temperature component of the chain. Techniques to learn $$K(\beta )$$ exist in the literature, e.g. Wang and Landau (2001) and Atchadé and Liu (2004), but these techniques can be misleading unless used with care. The ST algorithm procedure is given in Algorithm 1.
The PT approach is designed to overcome the issues arising due to the typically unknown marginal normalisation constants. The PT algorithm runs a Markov chain on an augmented state space $$\mathcal {X}^{(n+1)}$$ with target distribution defined as
\begin{aligned} \pi _n(x_0,x_1,\ldots ,x_n) \propto \pi _{\beta _0}(x_0)\pi _{\beta _1}(x_1)\ldots \pi _{\beta _n}(x_n). \end{aligned}
The PT algorithm procedure is given in Algorithm 2.

### 2.1 Optimal scaling for the ST and PT algorithms

Atchadé et al. (2011) and Roberts and Rosenthal (2014) investigated the problem of selecting optimal inverse-temperature spacings for the ST and PT algorithms. Specifically, if a move between two consecutive temperature levels, $$\beta$$ and $$\beta '=\beta +\epsilon$$, is to be proposed, then what is the optimal choice of $$\epsilon$$? Too large, and the move will probably be rejected; too small, and the move will accomplish little (similar to the situation for the Metropolis algorithm, cf. Roberts et al. 1997 and Roberts and Rosenthal 2001).

For ease of analysis, Atchadé et al. (2011) and Roberts and Rosenthal (2014) restrict to d-dimensional target distributions of the i.i.d. form:
\begin{aligned} \pi (x) \propto \prod _{i=1}^d f(x_i). \end{aligned}
(4)
They assume that the process mixes immediately (i.e. infinitely quickly) within each temperature, to allow them to concentrate solely on the mixing of the inverse-temperature process itself. To achieve non-degeneracy of the limiting behaviour of the inverse-temperature process as $$d \rightarrow \infty$$, the spacings are scaled as $$O(d^{-1/2})$$, i.e. $$\epsilon = \ell /d^{1/2}$$ where $$\ell = \ell (\beta )$$ a positive value to be chosen optimally.
Under these assumptions, Atchadé et al. (2011) and Roberts and Rosenthal (2014) prove that the inverse-temperature processes of the ST and PT algorithms converge, when speeded up by a factor of d, to a specific diffusion limit, independent of dimension, which thus mixes in time O(1), implying that the original ST and PT algorithms mix in time O(d) as $$d\rightarrow \infty$$. They also prove that the mixing times of the ST and PT algorithms are optimised when the value of $$\ell$$ is chosen to maximise the quantity
\begin{aligned}\ell ^2 \times 2 \varPhi \left( -\ell \frac{\sqrt{I(\beta )}}{2} \right) \end{aligned}
where $$I(\beta )=\text {Var}_{\pi ^\beta }\big ( \log f(x) \big )$$. Furthermore, this optimal choice of $$\ell$$ corresponds to an acceptance rate of inverse-temperature moves of 0.234 (to three decimal places), similar to the earlier Metropolis algorithm results of Roberts et al. (1997) and Roberts and Rosenthal (2001).

From a practical perspective, setting up the temperature levels to achieve optimality can be done via a stochastic approximation approach (Robbins and Monro 1951), similarly to Miasojedow et al. (2013) who use an adaptive MCMC framework (see, e.g. Roberts and Rosenthal 2009).

### 2.2 Torpid mixing of ST and PT algorithms

The above optimal scaling results suggest that the mixing time of the ST and PT algorithms through the temperature schedule is O(d), i.e. grows only linearly with the dimension of the problem, which is very promising. However, this optimal, non-degenerate scaling was derived under the assumption of immediate, infinitely fast within-temperature mixing, which is almost certainly violated in any real application. Indeed, this assumption appears to be overly strong once one considers the contrasting results regarding the scalability of the ST algorithm from Woodard et al. (2009a, b). Their approach instead relies on a detailed analysis of the spectral gap of the ST Markov chain and how it behaves asymptotically in dimension. They show that in cases where the different modal structures/scalings are distinct, this can lead to mixing times that grow exponentially in dimension, and one can only hope to attain polynomial mixing times in special cases where the modes are all symmetric.

The fundamental issue with the ST/PT approaches is that in cases where the modes are not symmetric, the tempered targets do not preserve the regional/modal weights. That motivates the current work, which is designed to preserve the modal weights even when performing tempering transformations, as we discuss next.

Interestingly, a lack of modal symmetry in the multimodal target will affect essentially all the standard multimodal-focused methods: the Equi-Energy Sampler of Kou et al. (2006), the Tempered Transitions of Neal (1996) and the Mode Jumping Proposals of Tjelmeland and Hegstad (2001), all suffer in this setting. Hence, the work in this paper is applicable beyond the immediate setting of the ST/PT approaches.

## 3 Weight-stabilised tempering

In this section, we present our modifications which preserve the weights of the different modes when performing tempering transformations. We first motivate our algorithm by considering mixtures of Gaussian distributions.

Consider a d-dimensional bimodal Gaussian target distribution with means, covariance matrices and weights given by $$\mu _i,~\varSigma _i,~ w_i$$ for $$i=1,2$$ respectively. Hence, the target density is given by:
\begin{aligned} \pi (x) = w_1 \phi (x,\mu _1,\varSigma _1)+ w_2 \phi (x,\mu _2,\varSigma _2), \end{aligned}
(5)
where $$\phi (x,\mu ,\varSigma )$$ is the pdf of a multivariate Gaussian with mean $$\mu$$ and covariance matrix $$\varSigma$$. Assuming the modes are well separated, the power-tempered target from (1) can be approximated by a bimodal Gaussian mixture (cf. Woodard et al. 2009b; Tawn 2017):
\begin{aligned} \pi (x) = W_{(1,\beta )} \phi \left( x,\mu _1,\frac{\varSigma _1}{\beta }\right) + W_{(2,\beta )} \phi \left( x,\mu _2,\frac{\varSigma _2}{\beta }\right) , \nonumber \\ \end{aligned}
(6)
where the weights are approximated as
\begin{aligned} W_{(i,\beta )}= & {} \frac{ w_i^{\beta } |\varSigma _i|^\frac{1-\beta }{2} }{w_1^{\beta } |\varSigma _1|^\frac{1-\beta }{2}+ w_2^{\beta } |\varSigma _2|^\frac{1-\beta }{2}}\nonumber \\\propto & {} w_i^{\beta } |\varSigma _i|^\frac{1-\beta }{2}. \end{aligned}
(7)
A one-dimensional example of this is illustrated in Fig. 1, which shows plots of a bimodal Gaussian mixture distribution as $$\beta \rightarrow 0$$. Clearly, the second mode, which was originally wide but very short and hence of low weight, takes on larger and larger weight as $$\beta \rightarrow 0$$, thus distorting the problem and making it very difficult for a tempering algorithm to move from the second mode to the first when $$\beta$$ is small. Fig. 1 Power-tempered target densities of a bimodal Gaussian mixture using inverse-temperature levels $$\beta =\{ 1,0.1,0.05,0.005 \}$$, respectively. At the hot state, it is evident that the mode centred on 40 begins to dominate the weight as $$\beta \rightarrow \infty$$
Higher dimensionality makes this weight-distorting issue exponentially worse. Consider the situation with $$w_1=w_2$$ but $$\varSigma _1=I_d$$ and $$\varSigma _2=\sigma ^2 I_d$$ where $$I_d$$ is the $$d \times d$$ identity matrix. Then,
\begin{aligned} \frac{W_{(2,\beta )} }{W_{(1,\beta )}}\approx \sigma ^{d(1-\beta )}, \end{aligned}
(8)
so the ratio of the weights degenerates exponentially fast in the dimensionality of the problem for a fixed $$\beta$$. This heuristic result in (8) shows that between levels there can be a “phase-transition” in the location of probability mass, which becomes critical as dimensionality increases.

### 3.1 Weight-stabilised Gaussian mixture targets

Consider targeting a Gaussian mixture given by
\begin{aligned} \pi (x) \propto \sum _{j=1}^J w_j \phi (x,\mu _j,\varSigma _j) \end{aligned}
(9)
in the (practically unrealistic) setting where the target is a Gaussian mixture with known parameters, including the weights. By only tempering the variance component of the modes, one can spread out the modes while preserving the component weights. We formalise this notion as follows:

### Definition 1

(Weight-Stabilised Gaussian Mixture (WSGM))

For a Gaussian mixture target distribution $$\pi (\cdot )$$, as in (9), the weight-stabilised Gaussian mixture (WSGM) target at inverse-temperature level $$\beta$$ is defined as
\begin{aligned} \pi _\beta ^{WS}(x) \propto \sum _{j=1}^J w_j \phi \left( x,\mu _j,\frac{\varSigma _j}{\beta }\right) . \end{aligned}
(10)
Figure 2 shows the comparison between the target distributions used when using power-based targets vs the WSGM targets for the example from Fig. 1. Fig. 2 For a bimodal Gaussian mixture target, plots of the normalised tempered target densities generated by both powering the target (solid black line) and the WSGM targets (dashed red line) at inverse-temperature levels $$\beta =\{ 1,0.1,0.05,0.005 \}$$. (Color figure online)

Using these WSGM targets in the PT scheme can give substantially better performance than when using the standard power-based targets. This is very clearly illustrated in Sect. 4.1. Henceforth, when the term “WSGM ST/PT Algorithm” is used, it refers to the implementation of the standard ST/PT algorithm but now uses the WSGM targets from (10).

### 3.2 Approximating the WSGM targets

In practice, the actual target distribution would be non-Gaussian and only approximated by a Gaussian mixture target. On the other hand, due to the improved performance gained from using the WSGM over just targeting the respective power-tempered mixture, there is motivation to approximate the WSGM in the practical setting where parameters are unknown. To this end, we present a theorem establishing useful equivalent forms of the WSGM; these alternative equivalent forms give rise to a practically applicable approximation to the WSGM.

To establish the notation, let the target be a mixture distribution given by
\begin{aligned} \pi (x) \propto \sum _{j=1}^J h_j(x) = \sum _{j=1}^J w_j g_j(x) \end{aligned}
(11)
where $$g_j(x)$$ is a normalised target density. Then, set
\begin{aligned} \pi _\beta (x) \propto \sum _{j=1}^J f_j(x,\beta ) = \sum _{j=1}^J W_{(j,\beta )} \frac{[g_j(x)]^\beta }{\int [g_j(x)]^\beta \mathrm{d}x}. \end{aligned}
(12)
Then, we have the following result, proved in Appendix.

### Theorem 1

(WSGM Equivalences) Consider the setting of (11) and (12) where the mixture components consist of multivariate Gaussian distributions i.e. $$g_j(x) = \phi (x;\mu _j,\varSigma _j)$$. Then, $$\forall j \in 1,\ldots , J$$
1. (a)
[Standard, non-weight-preserving tempering] If $$f_j(x,\beta ) = [h_j(x)]^\beta$$ then
\begin{aligned} W_{(j,\beta )} \propto w_j^\beta |\varSigma _j|^{\frac{1-\beta }{2}}. \end{aligned}

2. (b)
[Weight-preserving tempering, version #1] Denoting $$\nabla _j = \nabla \log {h_j(x)}$$ and $$\nabla _j^2=\nabla ^2\log {h_j(x)}$$; if $$f_j(x,\beta )$$ takes the form
\begin{aligned} h_j(x)\exp \left\{ \left( \frac{1-\beta }{2}\right) (\nabla _j(x))^T \left[ \nabla ^2_j(x)\right] ^{-1} \nabla _j (x) \right\} \end{aligned}
then $$W_{(j,\beta )} \propto w_j$$.

3. (c)
[Weight-preserving tempering, version #2] If
\begin{aligned} f_j(x,\beta )= h_j(x)^\beta h_j(\mu _j)^{(1-\beta )} \end{aligned}
then $$W_{(j,\beta )} \propto w_j$$.

Remark 1 In Theorem 1, statement (b) shows that second-order gradient information of the $$h_j$$’s can be used to preserve the component weight in this setting.

Remark 2 Statement (c) extends statement (b) to no longer require the gradient information about the $$h_j$$ but simply the mode/mean point $$\mu _j$$. Essentially, this shows that by appropriately rescaling according to the height of the component as the components are “powered up,” then component weights are preserved in this setting.

Remark 3 A simple calculation shows that statement (c) holds for a more general mixture setting when all components of the mixture share a common distribution but different location and scale parameters.

### 3.3 Hessian adjusted tempering

The results of Theorem 1 are derived under the impractical setting that the components are all known and that $$\pi (\cdot )$$ is indeed a mixture target. One would like to exploit the results of (b) and (c) from Theorem 1 to aid mixing in a practical setting where the target form is unknown but may be well approximated by a mixture.

Suppose herein that the modes of the multimodal target of interest, $$\pi (\cdot )$$, are well separated. Thus, an approximating mixture of the form given in (11) would approximately satisfy
\begin{aligned} \pi (x) \propto h_M(x) \end{aligned}
where $$M=\sup _j\left\{ h_j(x) \right\}$$. Hence, it is tempting to apply a version of Theorem 1(b) to $$\pi (\cdot )$$ directly as opposed to the $$h_j$$. So at inverse temperature $$\beta$$, the point-wise target would be proportional to
\begin{aligned} \pi (x)\exp \left\{ \left( \frac{1-\beta }{2}\right) (\nabla _\pi (x))^T \left[ \nabla ^2 _\pi (x)\right] ^{-1} \nabla _\pi (x) \right\} . \end{aligned}
where $$\nabla _\pi =\nabla \log {\pi (x)}$$ and $$\nabla ^2_\pi =\nabla ^2 \left( \log {\pi (x)}\right)$$. This only uses point-wise gradient information up to second order. At many locations in the state space, provided that $$\beta$$ is at a temperature level that is sufficiently cool that the tail overlap is insignificant, and if the target was indeed a Gaussian mixture, then this approach would give almost exactly the same evaluations as $$\pi _\beta (\cdot )$$ from (12) in the setting of (b). However, at locations between modes when the Hessian of $$\log (\pi (x))$$ is positive semi-definite, this target behaves very badly, with explosion points that make it improper.
Instead, under the setting of well-separated modes, one can appeal instead to the weight-preserving characterisation in Theorem 1(c). Assume that one can assign each location in the state space to a “mode point” via some function $$x \rightarrow \mu _{x,\beta }$$, with a corresponding tempered target given by
\begin{aligned} \pi _{\beta }(x) \propto \pi (x)^{\beta } \pi (\mu _{x,\beta })^{1-\beta }. \end{aligned}
Note the mode assignment function’s dependence on $$\beta$$. This can be understood to be necessary by appealing to Fig. 2 where it is clear that the narrow mode in the WSGM target has a “basin of attraction” that expands as the temperature increases.

### Definition 2

(Basic Hessian Adjusted Tempering (BHAT) Target) For a target distribution $$\pi (\cdot )$$ on $${\mathbb {R}}^d$$ with a corresponding “mode point assigning function” $$\mu _{x,\beta }: \mathbb {R}^d \rightarrow \mathbb {R}^d$$; the BHAT target at inverse temperature level $$\beta \in (0,\infty )$$ is defined as
\begin{aligned} \pi _{\beta }^{BH}(x) \propto \pi (x)^{\beta } \pi (\mu _{x,\beta })^{1-\beta }. \end{aligned}
(13)

However, in this basic form there is an issue with this target distribution at hot temperatures when $$\beta \rightarrow 0$$. The problem is that it leaves discontinuities that can grow exponentially large, and this can make the hot state temperature level mixing exponentially slow if using standard MCMC methods for the within-temperature moves.

This problem can be mitigated if one assumes more knowledge about the target distribution. Suppose that the mode points are known and so there is a collection of K mode points $$M=\{ \mu _1,\ldots ,\mu _K \}$$. This assumption seems quite strong but in general if one cannot find mode points, then this is essentially saying that one cannot find the basins of attraction and thus the desire to obtain the modal relative masses (as MCMC is trying to do) must be relatively impossible. Indeed, being able to find mode points either prior to or online in the run of the algorithm is possible, e.g. Tjelmeland and Hegstad (2001), Behrens (2008) and Tawn et al. (2018). Furthermore, assume that the target, $$\pi (\cdot )$$, is $$C^2$$ in a neighbourhood of the K mode locations and so there is an associated collection of positive definite covariance matrices $$S=\{ \varSigma _1,\ldots ,\varSigma _K \}$$ where $$\varSigma _j= -\left( \nabla ^2 \log \pi (\mu _j) \right) ^{-1}$$. From this and knowing the evaluations of $$\pi (\cdot )$$ at the mode points, one can approximate the weights in the regions to attain a collection $${\hat{\mathbf{W}}}=\{\hat{w}_1,\ldots ,\hat{w}_K\}$$ where
\begin{aligned} \hat{w_j}= \frac{\pi (\mu _j) |\varSigma _j|^{1/2}}{\sum _{k=1}^K \pi (\mu _k) |\varSigma _k|^{1/2}} \end{aligned}
With $$\phi (\cdot |\mu _j,\varSigma _j)$$ denoting the pdf of a $$N(\mu _j,\varSigma _j)$$, we define the following modal assignment function motivated by the WSGM:

### Definition 3

(WSGM mode assignment function) With collections M, S and $${\hat{\mathbf{W}}}$$ specified above then for a location $$x\in \mathbb {R}^d$$ and inverse temperature $$\beta$$ define the WSGM mode assignment function as
\begin{aligned} A(x,\beta ) = \mathop {\hbox {arg max}}\limits _{j}\left\{ \hat{w}_j \phi \left( x|\mu _j,\frac{\varSigma _j}{\beta }\right) \right\} . \end{aligned}
(14)

Under the assumption that there are collections M, S and $${\hat{\mathbf{W}}}$$ that have either been found through prior optimisation or through an adaptive online approach, we define the following.

### Definition 4

(Hessian Adjusted Tempering (HAT) Target) For a target distribution $$\pi (\cdot )$$ on $$\mathbb {R}^d$$ with collections M, S and $$\hat{W}$$ defined above along with the associated mode assignment function given in (14), the Hessian adjusted tempering (HAT) target is defined as
\begin{aligned} \pi _{\beta }^{H}(x) \propto {\left\{ \begin{array}{ll} \pi (x)^{\beta } \pi (\mu _{A(x,\beta )})^{1-\beta } &{} \hbox { if}\ A(x,\beta )=A(x,1) \\ G(x,\beta ) &{} \hbox { if}\ A(x,\beta ) \ne A(x,1) \end{array}\right. } \nonumber \\ \end{aligned}
(15)
where with $$\hat{A}:=A(x,\beta )$$
\begin{aligned} G(x,\beta )= \frac{\pi (\mu _{\hat{A}})\left( (2\pi )^{d}\varSigma _{\hat{A}}\right) ^{1/2}\phi \left( x|\mu _{\hat{A}},\frac{\varSigma _{\hat{A}}}{\beta }\right) }{\beta ^{d/2}}. \end{aligned}
Essentially, the function “G” specifies the target distribution when the chain’s location, x, is in a part of the state space where the narrower modes expand their basins of attraction as the temperature gets hotter. Both the choice of G and the mode assignment function used in Definition 4 are somewhat canonical to the Gaussian mixture setting. With the same assignment function specified in Definition 3, an alternative and seemingly robust “G” that one could use is given by
\begin{aligned} G(x,\beta )= & {} \pi (x,1,A) \\&+ \left( \frac{2P(A(x,\beta ))}{P(A(x,\beta ))+P(A(x,1))}-1\right) \\&\quad \times [\pi (x,\beta ,A) -\pi (x,1,A) ] \end{aligned}
where $$\pi (x,\beta ,A)= \pi (x)^{\beta } \pi (\mu _{A(x,\beta )})^{1-\beta }$$ and $$P(j)=\hat{w}_j \phi \left( x|\mu _j,\frac{\varSigma _j}{\beta }\right)$$.
With either of the suggested forms of the function G, then under the assumption that the target is continuous and bounded on $$\mathbb {R}^d$$ and that for all $$\beta \in (0,\infty )$$,
\begin{aligned} \int _{\mathcal {X}} \pi ^\beta (x) \mathrm{d} x <\infty , \end{aligned}
then $$\pi _{\beta }^{H}(x)$$ is a well-defined probability density, i.e. Definition 4 makes sense.
For a bimodal Gaussian mixture example, Fig. 3 compares the HAT target relative to the WSGM target, showing that the HAT targets are a very good approximation to the WSGM targets, even at the hotter temperature levels. Fig. 3 For the same bimodal Gaussian target from Fig. 2, here is a comparison of the HAT vs WSGM targets at inverse temperatures $$\beta =0.05$$ and $$\beta =0.005$$, respectively. Note they are almost identical at the colder temperature, but they do differ slightly in the interval $$(-25 , 10)$$ at the hotter temperature where the “G” function is allowing the footprint of the narrow mode to expand

We propose to use the HAT targets in place of the power-based targets for the tempering algorithms given in Sect. 2. We thus define the following algorithms, which are explored in the following sections.

### Definition 5

(Hessian Adjusted Simulated Tempering (HAST)Algorithm) The HAST algorithm is an implementation of the ST algorithm (Sect. 2, Algorithm 1) where the target distribution at inverse-temperature level $$\beta$$ is given by $$\pi _{\beta }^{H}(\cdot )$$ from Definition 4.

### Definition 6

(Hessian Adjusted (Parallel) Tempering (HAT) Algorithm) The HAT algorithm is an implementation of the PT algorithm (Sect. 2, Algorithm 2) where the target distribution at inverse-temperature level $$\beta$$ is given by $$\pi _{\beta }^{H}(\cdot )$$ from Definition 4.

## 4 Simulation studies

### 4.1 WSGM Algorithm simulation study

We begin by comparing the performances of a ST algorithm targeting both the power-based and WSGM targets for a simple but challenging bimodal Gaussian mixture target example. The example will illustrate that the traditional ST algorithm, using power-based targets, struggles to mix effectively through the temperature levels due to a bottleneck effect caused by the lack of regional weight preservation.

The example considered is the ten-dimensional target distribution given by the bimodal Gaussian mixture
\begin{aligned} \pi (x)= w_1 \phi _{(\mu _1,\varSigma _1)}(x)+w_2 \phi _{(\mu _2,\varSigma _2)}(x) \end{aligned}
(16)
where $$w_1=0.2$$, $$w_2=0.8$$, $$\mu _1=(-10,-10,\ldots ,-10)$$, $$\mu _2=(10,10,\ldots ,10)$$, $$\varSigma _1= 9\text{ I }_{10}$$ and $$\varSigma _2= \text{ I }_{10}$$. When power-based tempering is used, then mode 1 accounts for only 20% of the mass at the cold level, but at the hotter temperature levels becomes the dominant mode containing almost all the mass.
For both runs, the same geometric temperature schedule was used:
\begin{aligned} \varDelta =\left\{ 1, \, 0.32, \, 0.32^2, \, \ldots , \, 0.32^6\right\} . \end{aligned}
This geometric schedule is justified by Corollary 1 of Tawn and Roberts (2018), which suggests this is an optimal setup in the case of a regionally weight-preserved PT setting. Indeed, this schedule induces a swap move acceptance rates around 0.22 for the WSGM algorithm, close to the suggested 0.234 optimal value.

This temperature schedule gave swap acceptance rates of approximately 0.23 between all levels of the power-based ST algorithm except for the coldest level swap where this degenerated to 0.17. That shows that the power-based ST algorithm was set up essentially optimally according to the results in Atchadé et al. (2011).

In order to ensure that the within-mode mixing is not influencing the temperature space mixing, a local modal independence sampler was used for the within-mode moves. This essentially means that once a mode has been found, the mixing is infinitely fast. We use the modal assignment function $$\mu _{x,\beta }$$ which specifies that the location x is in mode 1 if $${\bar{x}}<0$$ and in mode 2 otherwise. Then, the within-move proposal distribution for a move at inverse temperature level $$\beta$$ is given by
\begin{aligned} q_{\beta }(x,y)=\phi _{\left( \mu _1,\frac{\varSigma _1}{\beta }\right) }(y)\mathbb {1}_{{\bar{x}}<0}+\phi _{\left( \mu _2,\frac{\varSigma _2}{\beta }\right) }(y)\mathbb {1}_{{\bar{x}}\ge 0}, \end{aligned}
(17) Fig. 4 Top: trace plots of the functional of the simulated tempering chains given in (18). On the left is the version using the WSGM targets, which mixes well through the temperature schedule and finds both modal regions. On the right is the version using the standard power-based targets, which fails to ever find one of the modes. Bottom: trace plots of $${\bar{x}}_t$$ in each of the cases, respectively

where $$\phi _{\mu ,\varSigma }(.)$$ is the density function of a Gaussian random variable with mean $$\mu$$ and variance matrix $$\varSigma$$.

Figure 4 plots a functional of the inverse temperature at each iteration of the algorithm. The functional is
\begin{aligned} h(\beta _t,x_t) :=\frac{\log \left( \frac{\beta _t}{\beta _{\text {min}}} \right) }{\log \left( \frac{1}{\beta _{\text {min}}}\right) }\text{ sgn }\left( {\bar{x}}_t \right) \end{aligned}
(18)
where $$\text{ sgn }(.)$$ is the usual sign function and $$\beta _{\text {min}}$$ is the minimum of the inverse temperatures. The significance of this functional will become evident from the results of the core theoretical contributions made in this paper in Theorems 2 and 3 in Sect. 5. Essentially, it is taking a transformation of the current inverse temperature at iteration t of the Markov chain, such that when $$\beta _t=1$$, the magnitude of h is 1 and when the temperature is at its hottest level, i.e. $$\beta _t = \beta _{\text {min}}$$, h is zero. Furthermore, in this example, the sign of $${\bar{x}}_t$$ is a reasonable proxy to identify the mode that the chain is contained in with a negative value suggesting the chain is in the mode centred on $$\mu _1$$ and $$\mu _2$$ otherwise.

Figure 4 clearly illustrates that the hot state modal weight inconsistency leads the chain down a poor trajectory since at hot temperatures nearly all the mass is in modal region 1. This results in the chain never reaching the other mode in the entire (finite) run of the algorithm. Indeed, the trace plots in Fig. 4 show that the chain is effectively trapped in mode 1, which although it only has 20% of the mass in the cold state, is completely dominant at the hotter states.

### 4.2 Simulation study for HAT

To demonstrate the capabilities of the HAT algorithm in a non-Gaussian setting where the modes exhibit skew, a five-dimensional four-mode skew normal mixture target example is presented. Albeit a mixture, this example can be seen to give similar target distribution geometries to non-mixture settings due to the skew nature of the modes.
\begin{aligned} \pi (x) \propto \sum _{k=1}^4 w_k \prod _{i=1}^5 f(x_i|\mu _k,\sigma _k,\alpha ) \end{aligned}
(19)
where the skew normal density is given by
\begin{aligned} f(z|\mu ,\sigma ,\alpha )= \frac{2}{\sigma }\phi \left( \frac{z-\mu }{\sigma }\right) \varPhi \left( \frac{\alpha (z-\mu )}{\sigma }\right) \end{aligned}
and where $$w_1=w_2=w_3=w_4=0.25$$, $$\mu _1=-15$$, $$\mu _2=15$$, $$\mu _3=45$$, $$\mu _4=-45$$, $$\sigma _1=1$$, $$\sigma _2=1$$, $$\sigma _3=3$$, $$\sigma _4=3$$ and $$\alpha =2$$.

As will be seen in the forthcoming simulation results, the imbalance of scales within each modal region ensures that this is a very challenging problem for the PT algorithm.

Since this target fits into the setting of Corollary 1 of Tawn and Roberts (2018), a geometric inverse-temperature schedule is approximately optimal for the HAT target in this setting. Indeed, Tawn and Roberts (2018) suggest that the geometric ratio should be tuned so that the acceptance rate for swap moves between consecutive temperatures is approximately 0.234. In this case, eight tempering levels were used to obtain effective mixing; these were geometrically spaced and given by $$\{1,0.31,0.31^2,\ldots , 0.31^7 \}$$, were found to be approximately optimal and gave an average of 0.22 for the swaps between consecutive levels for the HAT algorithm.

Using this temperature schedule along with appropriately tuned RWM proposals for the within-temperature moves, ten runs of both the PT and HAT algorithms were performed. In each individual run, each temperature marginal was updated with $$m=5$$ RWM proposals followed by a temperature swap move proposal and this was repeated with $$s=100{,}000$$ sweeps. This results in a sample output of 600,001 of the cold state chain prior to any burn-in removal. Herein, for this example, denote $$N=600{,}001$$.

As expected, the scale imbalance between the modes resulted in the PT algorithm performing poorly and with significant bias in the sample output. In contrast, the HAT approach was highly successful in converging relatively rapidly to the target distribution, exhibiting far more frequent intermodal jumps at the cold state.

Figure 5 shows one representative example of a run of the PT and HAT algorithms by plotting the first component of the five-dimensional marginal chain at the coldest target state. It illustrates the impressive intermodal mixing of HAT across all four modal regions as opposed to the very sticky mixing exhibited by the PT algorithm. Fig. 5 Two trace plots of the first marginal component cold state chain targeting the distribution in (19) using the HAT and PT algorithms, respectively. Note the HAT algorithm run illustrates a chain that is performing rapid exploration between all four modes, whereas the PT algorithm exhibits significant sticky patches Fig. 6 Running estimate of $$\mathbb {P}_{\pi }(-30 Figure 6 shows the running approximation of \(\mathbb {P}_{\pi }(-30<X^1_i<0)$$ (which is approximately the weight of the first mode i.e. $$w_1=0.25$$) after the kth iteration of the cold state chains, after removing a burn-in period of 10,000 initial iterations, for the ten runs of the PT and HAT runs, respectively. The approximation after iteration $$k \le N$$ is given by
\begin{aligned} \hat{W}^k_1 := \frac{1}{k-10000}\sum _{i=10001}^k \mathbb {1}_{\left( -30<X^1_i<0\right) } \end{aligned}
(20)
where $$X^1_i$$ is the location of the first component of the five-dimensional chain at the coldest temperature level after the ith iteration. This figure indicates that the PT algorithm fails to provide a stable estimate for $$\mathbb {P}_{\pi }(-30<X^1_i<0)$$ with the running weight approximations far from stable at the end of the runs; in stark contrast, the HAT algorithm exhibits very stable performance in this case. In fact, the final estimates for $$\hat{W}^N_1$$ are given in Table 1.
Table 1

End point estimates, $$\hat{W}^N_1$$, of $$\mathbb {P}_{\pi }(-30<X^1_i<0)$$ from the ten runs of the PT and HAT algorithms. The true value of 0.25 appears to be well approximated by HAT but not by PT

PT

0.23

0.36

0.19

0.31

0.10

0.12

0.18

0.39

0.51

0

HAT

0.27

0.24

0.26

0.22

0.22

0.27

0.23

0.28

0.25

0.26

Table 2 presents the results of using the ten runs of each algorithm in a batch-means approach to estimate the Monte Carlo variance of the estimator of $$\hat{W}^N_1$$. The results in Table 2 show that the Monte Carlo error is approximately a factor of 10 higher for the PT algorithm than the HAT approach.

However, it is also important to analyse this inferential gain jointly with the increase in computational cost. Table 2 also shows that the average run time for the ten HAT runs was 451 s which is a little more than two times slower than the average run time of the PT algorithm (217 s) in this example. The major extra expense is due to the cost of computing the WSGM mode assignment function in (14) at both the cold and current temperature of interest at each evaluation of the HAT target. Anyhow, this is very promising since for a little more than twice the computational cost, the inferential accuracy appears to be ten times better in this instance.
Table 2

Using the ten runs of each algorithm in a batch-means approach to estimate the Monte Carlo variance of the pooled estimator $$\hat{W}^{10N}_1$$ i.e. $$\text {SD}(\hat{W}^N_1)$$. Also displayed is the average run time (RT, measured in seconds) of a single one of the ten repeated runs for both methods, respectively

$$\hat{W}^{10 N}_1$$

$$\hat{\text {SD}}(\hat{W}^N_1)$$

$$\hat{\text {SD}}$$($$\hat{W}^{10 N}_1$$)

RT (secs)

PT

0.288

0.187

0.0593

217

HAT

0.249

0.019

0.0063

451

## 5 Diffusion limit and computational complexity

In this section, we provide some theoretical analysis for our algorithm. We shall prove in Theorems 2 and 3 that as the dimension goes to infinity, a simplified and speeded-up version of our weight-preserving simulated tempering algorithm (i.e. the HAST Algorithm from Definition 5, equivalent to the ST Algorithm 1 with the adjusted target from Definition 4) converges to a certain specific diffusion limit. This limit will allow us to draw some conclusions about the computational complexity of our algorithm.

### 5.1 Assumptions

We assume for simplicity (though see below) that our target density $$\pi$$ is a mixture of form (11) with just $$J=2$$ modes, of weights $$w_1=p$$ and $$w_2=1-p$$, respectively, with each mixture component a special i.i.d. product $$g_j(x) = \prod _{i=1}^d f_j(x_i)$$ as in (4). We further assume that a weight-preserving transformation (perhaps inspired by Theorem 1(b) or (c)) has already been done, so that
\begin{aligned}&\pi _\beta (x) \propto \ p {[g_1(x)]^\beta \over \int [g_1(x)]^\beta dx} + (1-p) {[g_2(x)]^\beta \over \int [g_2(x)]^\beta dx} \\ \\&\quad \equiv \ p g_1^\beta (x) + (1-p) g_2^\beta (x) \end{aligned}
for each $$\beta$$. We consider a simplified version of the weight-preserving process, in which the chain always mixes immediately within each mode, but the chain can only jump between modes when at the hottest temperature $$\beta _\mathrm{min}$$, at which point it jumps to one of the two modes with probabilities p and $$1-p$$, respectively. Let I denote the indicator of which mode the process is in, taking value 1 or 2.

We shall sometimes concentrate on the Exponential Power Family special case in which each of the two mixture component factors is of the form $$f_j(x) \propto e^{-\lambda _j|x|^{r_j}}$$ for some $$\lambda _j,r_j>0$$. This includes the Gaussian case for which $$r_1=r_2=2$$ and $$\lambda _j = 1/\sigma _j^2$$. (Note that the HAT target in (15) requires the existence of second derivatives about the mode points, corresponding to $$r_j \ge 2$$.)

As in Atchadé et al. (2011) and Roberts and Rosenthal (2014), following Predescu et al. (2004) and Kone and Kofke (2005), we assume that the inverse temperatures are given by $$1=\beta _0^{(d)},\beta _1^{(d)}, \ldots ,\beta ^{(d)}_{k(d)} \approx \beta _{\text {min}}$$, with
\begin{aligned} \beta _i = \beta _{i-1} - \ell (\beta _{i-1})/d^{1/2} \end{aligned}
(21)
for some fixed $$C^1$$ function $$\ell$$. In many cases, including the Exponential Power Family case, the optimal choice of $$\ell$$ is $$\ell (\beta ) = \beta \ell _0$$ for a constant $$\ell _0 \doteq 2.38$$.
We let $$\beta _t^{(d)}$$ be the inverse temperature at time t for the d-dimensional process. To study weak convergence, we let $$\beta ^{(\mathrm{d})}_{N(\mathrm{d}t)}$$ be a continuous-time version of the $$\beta _t^{(d)}$$ process, speeded up by a factor of d, where $$\{N(t)\}$$ is an independent standard rate 1 Poisson process. To combine the two modes into one single process, we further augment this process by multiplying it by $$-1$$ when the algorithm’s state is closer to the second mode, while leaving it positive (unchanged) when state is closer to the first mode. Thus, define
\begin{aligned} X_t^{(d)}= (3 - 2I) \, \beta ^{(\mathrm{d})}_{N(\mathrm{d}t)} . \end{aligned}
(22)

### 5.2 Main results

Our first diffusion limit result (proved in Appendix), following Roberts and Rosenthal (2014), states that when we are at an inverse temperature greater than $$\beta _{\text {min}}$$, the inverse temperature process behaves identically to the case where there is only one mode (i.e. $$J=1$$).

### Theorem 2

Assume the target $$\pi$$ is of form (11), with $$J=2$$ modes of weights $$w_1=p$$ and $$w_2=1-p$$, with inverse weights chosen as in (21). Then, up until the first time the process $$X^{(d)}$$ hits $$\pm \beta _{\text {min}}$$, as $$d\rightarrow \infty$$, $$\{X_t^{(d)}\}$$ converges weakly to a fixed diffusion process X given by (22).

Theorem 2 describes what happens away from $$\beta _\mathrm{min}$$. However, it says nothing about what happens at $$\beta _{\text {min}}$$. Moreover, its state space $$[-1,-\beta _{\text {min}})\cup (\beta _{\text {min}}, 1]$$ is not connected, and we have not even properly defined h at $$\pm \beta _{\text {min}}$$. To resolve these issues, we define
\begin{aligned} h(x) = {\left\{ \begin{array}{ll} \int _{\beta _{\text {min}}}^x {1\over \ell (u)} du, &{}\hbox { when } x>0\\ -\int _{\beta _{\text {min}}}^{-x} {1\over \ell (u)} du, &{}\hbox { when } x<0\\ 0, &{}\hbox { when } x=0\\ \end{array}\right. } \end{aligned}
and set $$H_t= h(X_t)$$, thus making the process H continuous at 0.

### Remark 1

The process H leaves constant densities locally invariant, $${{\tilde{G}}}^*g(v)=0$$ for all $$v\ne 0$$ where $${{\tilde{G}}}^*$$ is the adjoint of the infinitesimal generator of H, as will be shown in Appendix. This suggests that the density of the invariant distribution of H (if it exists) should be piecewise uniform; i.e. it should be constant for $$v>0$$ and also constant for $$v<0$$ though these two constants might not be equal.

To make further progress, we require a proportionality condition. Namely, we assume that the quantities corresponding to $$I(\beta )=\text {Var}_{\pi ^\beta }\big ( \log f(x) \big )$$ are proportional to each other in the two modes. More precisely, we extend the definition of I to $$I(\beta ) = \text {Var}_{x\sim f_1^\beta }(\log f_1(x))$$ for $$\beta >0$$ (corresponding to the first mode), and $$I(\beta ) = \text {Var}_{x\sim f_2^{|\beta |}}(\log f_2(x))$$ for $$\beta <0$$ (corresponding to the second mode), and assume there is a fixed function $$I_0:\mathbb {R}_{+}\rightarrow \mathbb {R}_{+}$$ and positive constants $$r_1$$ and $$r_2$$ such that we have $$I(\beta ) = I_0(\beta )/r_1$$ for $$\beta >0$$ (in the first mode), while $$I(\beta ) = I_0(|\beta |)/r_2$$ for $$\beta <0$$ (in the second mode). For example, it follows from Section 2.4 of Atchadé et al. (2011) that in the Exponential Power Family case, $$I(\beta ) = 1/r_1\beta ^2$$ for $$\beta >0$$ and $$I(\beta ) = 1/r_2\beta ^2$$ for $$\beta <0$$, so that this proportionality condition holds in that case.

Corresponding to this, we choose the inverse-temperature spacing function as follows (following Atchadé et al. 2011 and Roberts and Rosenthal 2014):
\begin{aligned} \ell (\beta ) \ = \ I_0^{-1/2}(\beta ) \, \ell _0 \end{aligned}
(23)
for some fixed constant $$\ell _0>0$$.
To state our next result, we require the notion of skew Brownian motion, a generalisation of usual Brownian motion. Informally, this is a process that behaves just like a Brownian motion, except that the sign of each excursion from 0 is chosen using an independent Bernoulli random variable; for further details and constructions and discussion, see, e.g. Lejay (2006). We also require the function
\begin{aligned} z(h) \ = \ h\, \left[ 2 \, \varPhi \left( {- \ell _0 \over 2 \sqrt{r(h)}} \right) \right] ^{-1/2} . \end{aligned}
where $$r(h)=r_1$$ for $$h>0$$ and $$r(h)=r_2$$ for $$h<0$$. We then have the following result (also proved in Appendix).

### Theorem 3

Under the set-up and assumptions of Theorem 2, assuming the above proportionality condition and the choice (23), then as $$d\rightarrow \infty$$, the process $$\{X_t^{(d)}\}$$ converges weakly in the Skorokhod topology to a limit process X. Furthermore, the limit process has the property that if
\begin{aligned} Z_t = z\big ( h(X_t) \big ) , \end{aligned}
then Z is skew Brownian motion $$B^*_t$$ with reflection at
\begin{aligned} (3-2i) \left[ 2 \, \varPhi \left( {- \ell _0 \over 2 \sqrt{r_i}} \right) \right] ^{-1/2} \int _{\beta _{\text {min}}}^1{1\over \ell (u) }du, \ \ \ i=1, 2.\nonumber \\ \end{aligned}
(24)

### Remark 2

It follows from the proof of Theorem 3 that the specific version of skew Brownian motion $$B^*_t$$ that arises in the limit is one with excursion weights proportional to
\begin{aligned}a = p \left[ 2\varPhi \left( {- \ell _0 \over 2 \sqrt{r_1}} \right) \right] ^{1/2} \text {and}~ b = (1-p) \left[ 2 \varPhi \left( {- \ell _0 \over 2 \sqrt{r_2}} \right) \right] ^{1/2}. \end{aligned}
That means that the stationary density for $$B^*_t$$ on the positive and negative values is proportional to a and b, respectively. This might seem surprising since the limiting weights of the modes should be equal to p and $$1-p$$, not proportional to a and b (unless $$r_1=r_2$$). The explanation is that the lengths of the positive and negative parts of the domain are given by $$\left[ 2 \, \varPhi \left( {- \ell _0 \over 2 \sqrt{r_1}} \right) \right] ^{1/2}$$ and $$\left[ 2 \, \varPhi \left( {- \ell _0 \over 2 \sqrt{r_2}} \right) \right] ^{1/2}$$, respectively. Hence, the total stationary mass of the positive and negative parts—and hence also the limiting modes weights—is still p and $$1-p$$ as they should be.

### 5.3 Complexity order

Theorems 2 and 3 have implications for the computational complexity of our algorithm.

In Theorem 2, the limiting diffusion process $$H_t$$ is a fixed process, not depending on dimension except through the value of $$\beta _{\text {min}}$$. It follows that if $$\beta _{\text {min}}$$ is kept fixed, then $$H_t$$ reaches 0 (and hence mixes modes) in time O(1). Since $$H_t$$ is derived (via $$X_t$$) from the $$\beta _t$$ process speeded up by a factor of d, it thus follows that for fixed $$\beta _{\text {min}}$$, $$\beta _t$$ reaches $$\beta _{\text {min}}$$ (and hence mixes modes) in time O(d). So, if $$\beta _{\text {min}}$$ is kept fixed, then the mixing time of the weight-preserving tempering algorithm is O(d), which is very fast. However, this does not take into account the dependence on $$\beta _{\text {min}}$$, which might also change as a function of d.

Theorem 3 allows for control of the dependence of mixing time on the values of $$\beta _{\text {min}}$$. The limiting skew Brownian motion process $$B^*_t$$ is a fixed process, not depending on dimension nor on $$\beta _{\text {min}}$$, with range given by the reflection points in (24). It follows that $$Z_t$$ reaches 0 (and hence mixes modes) in time of order the square of the total length of the interval, i.e. of order
\begin{aligned} \left( \sum _{i=1}^2\left[ 2 \, \varPhi \left( {- \ell _0 \over 2 \sqrt{r_i}} \right) \right] ^{-1/2} \int _{\beta _\mathrm{min}}^1{1\over \ell (u) }\mathrm{d}u \right) ^2 \end{aligned}
In the Exponential Power Family case, this is easily computed to be $$O\big (d \, [\log \beta _{\text {min}}]^2\big )$$.
This raises the question of how large $$\beta _{\text {min}}$$ needs to be, as a function of dimension d. If the proposal scaling is optimal for within each mode at the cold temperature, then the proposal scaling is $$O(d^{-1/2})$$. Then, at an inverse temperature $$\beta$$, the proposal scaling is $$O((\beta d)^{-1/2})$$. Hence, at an inverse temperature $$\beta$$, the probability of jumping from one mode to the other (a distance $$O(\sqrt{d})$$ away) is roughly of order $$e^{-\beta d^2}$$. This is exponentially small unless $$\beta = O(1/d^2)$$. This indicates that for our algorithm to perform well, we need to choose $$\beta _{\text {min}} = O(1/d^2)$$. With this choice, the mixing time order becomes
\begin{aligned} \left( \sum _{i=1}^2\left[ 2 \, \varPhi \left( {- \ell _0 \over 2 \sqrt{r_i}} \right) \right] ^{-1/2} \int _{1/d^2}^1{1\over \ell (u) }\mathrm{d}u \right) ^2 \end{aligned}
In the Exponential Power Family case, this corresponds to $$O\big (d \, [\log d]^2\big )$$. That is, for the inverse-temperature process to hit $$\beta _{\text {min}}$$ and hence mix modes, it takes $$O\big (d \, [\log d]^2\big )$$ iterations. This is a fairly modest complexity order and compares very favourably to the exponentially large convergence times which arise for traditional simulated tempering as discussed in Sect. 2.2.

### 5.4 More than two modes

Finally, we note that for simplicity, the above analysis was all done for just two modes. However, a similar analysis works more generally. Indeed, suppose now that we have k modes, of general weights $$p_1,p_2,\ldots ,p_k \ge 0$$ with $$\sum _i p_i = 1$$. Then, when $$\beta$$ gets to $$\beta _{\text {min}}$$, the process chooses one of the k modes with probability $$p_i$$. This corresponds to $$\{Y_t\}$$ being replaced by a Brownian motion not on $$[-1,1]$$, but rather on a “star” shape with k different length-1 line segments all meeting at the origin (corresponding, in the original scaling, to $$\beta _{\text {min}}$$), where each time the Brownian motion hits the origin it chooses one of the k line segments with probability $$p_i$$ each. This process is called Walsh’s Brownian motion, see e.g. Barlow et al. (1989). (The case $$k=2$$ but $$p_1\not =1/2$$ corresponds to skew Brownian motion as above.) For this generalised process, a theorem similar to Theorem 2 can be then stated and proved by similar methods, leading to the same complexity bound of $$O\big (d \, [\log d]^2\big )$$ iterations in the multimodal case as well.

## 6 Conclusion and further work

This article has introduced the HAT algorithm to mitigate the lack of regional weight preservation in standard power-based tempered targets. Our simulation studies show promising mixing results, and our theorems indicate the mixing times can become polynomial rather than exponential functions of the dimension d, and indeed of time $$O(d[\log d]^2)$$ under appropriate assumptions.

Various questions remain to make our HAT approach more practically applicable. The “modal assignment function” needs to be specified in an appropriate way, and more exploration into the robustness of the current assignment mechanism is needed to understand its performance on heavier and lighter tailed distributions. The suggested HAT target assumes knowledge of the mode points which typically one will not have to begin with and one would rely on effective optimisation methods to seek these out either during or prior to the run of the algorithm. Indeed, this has been partially explored by the authors in Tawn et al. (2018). The performance of HAT is heavily reliant on the mixing at the hottest temperature level; the use of RWM here can be problematic for HAT where the mode heights of the disperse modes can be far lower than the narrower modes. As such, more advanced sampling schemes such as discretised tempered Langevin could give accelerated mixing at the hot state, the effects of which would be transferred to an improvement in the mixing at the coldest state.

In the theoretical analysis of Sect. 5, the spacing between consecutive inverse-temperature levels was taken to be $$O(d^{-1/2})$$ to induce a non-trivial diffusion limit. However, this result required strong assumptions. Accompanying work in Tawn and Roberts (2018) suggests that for the HAT algorithm under more general conditions, the consecutive optimal spacing should still be $$O(d^{-1/2})$$, with an associated optimal acceptance rate in the interval [0, 0.234].

## References

1. Atchadé, Y.F., Liu, J.S.: The Wang–Landau algorithm for Monte Carlo computation in general state spaces. Stat. Sin. 20, 209–33 (2004)Google Scholar
2. Atchadé, Y.F., Roberts, G.O., Rosenthal, J.S.: Towards optimal scaling of Metropolis-coupled Markov chain Monte Carlo. Stat. Comput. 21(4), 555–568 (2011)
3. Barlow, M.T., Pitman, J., Yor, M.: On Walsh’s Brownian motions. Séminaire de probabilités (Strasbourg) 23, 275–293 (1989)
4. Bédard, M., Rosenthal, J.S.: Optimal scaling of Metropolis algorithms: heading toward general target distributions. Canad. J. Stat. 36, 483–503 (2008)
5. Behrens, G.R.: Mode jumping in MCMC. Ph.D. thesis, University of Bath (2008)Google Scholar
6. Bhatnagar, N., Randall, D.: Simulated tempering and swapping on mean-field models. J. Stat. Phys. 164(3), 495–530 (2016)
7. Ethier, S.N., Kurtz, T.G.: Markov Processes: Characterization and Convergence. Wiley, London (1986)
8. Geyer, C.J.: Markov chain Monte Carlo maximum likelihood. Comput Sci Stat 23, 156–163 (1991)Google Scholar
9. Kone, A., Kofke, D.A.: Selection of temperature intervals for parallel-tempering simulations. J Chem Phys 122(20), 206101 (2005)
10. Kou, S., Zhou, Q., Wong, W.H.: Equi-energy sampler with applications in statistical inference and statistical mechanics. Ann. Stat. 34, 1581–1619 (2006)
11. Lejay, A.: On the constructions of the skew Brownian motion. Probab. Surv. 3, 413–466 (2006)
12. Liggett, T.M.: Continuous Time Markov Processes: An Introduction. American Mathematical Society, Providence (2010)
13. Marinari, E., Parisi, G.: Simulated tempering: a new Monte Carlo scheme. EPL (Europhys. Lette.) 19(6), 451 (1992)
14. Miasojedow, B., Moulines, E., Vihola, M.: An adaptive parallel tempering algorithm. J. Comput. Graph. Stat. 22(3), 649–664 (2013)
15. Neal, R.M.: Sampling from multimodal distributions using tempered transitions. Stat. Comput. 6(4), 353–366 (1996)
16. Nemeth, C., Lindsten, F., Filippone, M., Hensman, J.: Pseudo-extended Markov Chain Monte Carlo (2017). ArXiv e-prints arXiv:1708.05239
17. Predescu, C., Predescu, M., Ciobanu, C.V.: The incomplete beta function law for parallel tempering sampling of classical canonical systems. J. Chem. Phys. 120(9), 4119–4128 (2004)
18. Revuz, D., Yor, M.: Continuous Martingales and Brownian Motion, 3rd edn. Springer, Berlin (2004)
19. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)Google Scholar
20. Roberts, G.O., Rosenthal, J.S.: Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 60(1), 255–268 (1998)
21. Roberts, G.O., Rosenthal, J.S.: Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 16(4), 351–367 (2001)
22. Roberts, G.O., Rosenthal, J.S.: Examples of adaptive MCMC. J. Comput. Graph. Stat. 18(2), 349–367 (2009)
23. Roberts, G.O., Rosenthal, J.S.: Minimising MCMC variance via diffusion limits, with an application to simulated tempering. Ann. Appl. Probab. 24(1), 131–149 (2014)
24. Roberts, G.O., Gelman, A., Gilks, W.R.: Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7(1), 110–120 (1997)
25. Tawn, N.: Towards optimality of the parallel tempering algorithm. Ph.D. thesis, University of Warwick (2017)Google Scholar
26. Tawn, N.G., Roberts, G.O.: Optimal Temperature Spacing for Regionally Weight-preserving Tempering (2018). arxiv:1810.05845v1
27. Tawn, N.G., Roberts, G.O., Moores, M., Assing, S.: Annealed leap point sampler. Manuscript in preparation (2018)Google Scholar
28. Tjelmeland, H., Hegstad, B.K.: Mode jumping proposals in MCMC. Scand. J. Stat. 28(1), 205–223 (2001)
29. Wang, F., Landau, D.: Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram. Phys. Rev. E 64(5), 056101 (2001)
30. Wang, J.S., Swendsen, R.H.: Cluster Monte Carlo algorithms. Phys. A Stat. Mech. Appl. 167(3), 565–579 (1990)
31. Woodard, D.B., Schmidler, S.C., Huber, M.: Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions. Ann. Appl. Probab. 19, 617–640 (2009a)
32. Woodard, D.B., Schmidler, S.C., Huber, M.: Sufficient conditions for torpid mixing of parallel and simulated tempering. Electr. J. Probab. 14, 780–804 (2009b)

© Springer Science+Business Media, LLC, part of Springer Nature 2019

## Authors and Affiliations

• Nicholas G. Tawn
• 1
Email author
• Gareth O. Roberts
• 1
• Jeffrey S. Rosenthal
• 2
1. 1.Department of StatisticsUniversity of WarwickCoventryUK
2. 2.Department of Statistical SciencesUniversity of TorontoTorontoCanada