Weight-preserving simulated tempering

Simulated tempering is a popular method of allowing MCMC algorithms to move between modes of a multimodal target density π\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi $$\end{document}. One problem with simulated tempering for multimodal targets is that the weights of the various modes change for different inverse-temperature values, sometimes dramatically so. In this paper, we provide a fix to overcome this problem, by adjusting the mode weights to be preserved (i.e. constant) over different inverse-temperature settings. We then apply simulated tempering algorithms to multimodal targets using our mode weight correction. We present simulations in which our weight-preserving algorithm mixes between modes much more successfully than traditional tempering algorithms. We also prove a diffusion limit for an version of our algorithm, which shows that under appropriate assumptions, our algorithm mixes in time O(d[logd]2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(d [\log d]^2)$$\end{document}.


Introduction
Consider the problem of drawing samples from a target distribution, π(x) on a d-dimensional state space X where π(•) is only known up to a scaling constant.A popular approach is to use Markov chain Monte Carlo (MCMC) which uses a Markov chain that is designed in such a way that the invariant distribution of the chain is π(•).
However, if π(•) exhibits multimodality, then the majority of MCMC algorithms which use tuned localised proposal mechanisms, e.g.Roberts et al. [1997] and Roberts and Rosenthal [2001], fail to explore the state space, which leads to biased samples.Two approaches to overcome this multimodality issue are the simulated and parallel tempering algorithms.These methods augment the state space with auxiliary target distributions that enable the chain to rapidly traverse the entire state space.
The major problem with these auxiliary targets is that in general they don't preserve regional mass, see Woodard et al. [2009a], Woodard et al. [2009b] and Bhatnagar and Randall [2016].This problem can result in the required run-time of the simulated and parallel tempering algorithms growing exponentially with the dimensionality of the problem.
In this paper, we provide a fix to overcome this problem, by adjusting the mode weights to be preserved (i.e., constant) over different inverse-temperatures.We apply our mode weight correction to produce new simulated and parallel tempering algorithms for multimodal target distributions.We show that, assuming the chain mixes at the hottest temperature, our mode-preserving algorithm will mix well for the original target as well.
This paper is organised as follows.Section 2 reviews the simulated and parallel tempering algorithms and the existing literature for their optimal setup.Section 3 describes the problems with modal weight preservation that are inherent with the traditional approaches to tempering, and introduces a prototype solution called the HAT algorithm that is similar to the parallel tempering algorithm but uses novel auxiliary targets.Section 4 presents some simulation studies of the new algorithms.Section 5 provides a theoretical analysis of a diffusion limit and the resulting computational complexity of the HAT algorithm in high dimensions.Section 6 concludes and provides a discussion of further work.

Tempering Algorithms
There is an array of methodology available to overcome the issues of multimodality in MCMC, the majority of which use state space augmentation e.g.Wang and Swendsen [1990], Geyer [1991], Marinari and Parisi [1992], Neal [1996], Kou et al. [2006], Nemeth et al. [2017].Auxiliary distributions that allow a Markov chain to explore the entirety of the state space are targeted, and their mixing information is then passed on to aid inter-modal mixing in the desired target.A convenient approach for augmentation methods, such as the popular simulated tempering (ST) and parallel tempering (PT) algorithms introduced in Geyer [1991] and Marinari and Parisi [1992], is to use power-tempered target distributions, for which the target distribution at inverse temperature level β is defined as π β (x) ∝ [π(x)] β for β ∈ (0, 1].For each algorithm one needs to choose a sequence of n + 1 "inverse temperatures" such that ∆ := {β 0 , . . ., β n } where 0 ≤ β n < β n−1 < . . .< β 1 < β 0 = 1, so that π β0 equals the original target density π, and hopefully the hottest distribution π βn (x) is sufficiently flat that it can be easily sampled.
The ST algorithm augments the original state space with a single dimensional component indicating the current inverse temperature level creating a (d+ 1) -dimensional chain, (β, X), defined on the extended state space {β 0 , . . ., β n }× X that targets π(β, x) ∝ K(β)π(x) β (1) where ideally K(β) = x π(x) β dx −1 , resulting in a uniform marginal distribution over the temperature component of the chain.Techniques to learn K(β) exist in the literature, e.g.Wang and Landau [2001] and Atchadé and Liu [2004], but these techniques can be misleading unless used with care.The ST algorithm procedure is given in Algorithm 1.
for i ← 1 to s do (2)

13:
Do m updates to the X -component according to q β t+1 (x, •).

14:
end for 15: return {(β 0 , x 0 ), (β 1 , x 1 ), . . ., (β s+sm , x s+sm )} 16: end function The PT approach is designed to overcome the issues arising due to the typically unknown marginal normalisation constants.The PT algorithm runs a Markov chain on an augmented state space X (n+1) with target distribution defined as The PT algorithm procedure is given in Algorithm 2.
2.1 Optimal Scaling for the ST and PT Algorithms Atchadé et al. [2011] and Roberts and Rosenthal [2014] investigated the problem of selecting optimal inverse-temperature spacings for the ST and PT algorithms.Specifically, if a move between two consecutive temperature levels, β and β = β + , is to be proposed, then what is the optimal choice of ?Too large, and the move will probably be rejected; too small, and the move will accomplish little (similar to the situation for the Metropolis algorithm, cf.Roberts et al. [1997] and Roberts and Rosenthal [2001]).
for i ← 1 to s do

6:
Sample U ∼ Unif(0, 1) for p ← 0 to n do 13: Perform m updates to the p th chain according to q βp (x, •).

14:
end for 15: end for 16: return {X 0 , X 1 , . . ., X s+sm } 17: end function For ease of analysis, Atchadé et al. [2011] and Roberts and Rosenthal [2014] restrict to d-dimensional target distributions of the iid form: (4) They assume that the process mixes immediately (i.e., infinitely quickly) within each temperature, to allow them to concentrate solely on the mixing of the inverse-temperature process itself.To achieve non-degeneracy of the limiting behaviour of the inverse-temperature process as d → ∞, the spacings are scaled as O(d −1/2 ), i.e. = /d 1/2 where = (β) a positive value to be chosen optimally.Under these assumptions, Atchadé et al. [2011] and Roberts and Rosenthal [2014] prove that the inverse-temperature processes of the ST and PT algorithms converge, when speeded up by a factor of d, to a specific diffusion limit, independent of dimension, which thus mixes in time O(1), implying that the original ST and PT algorithms mix in time O(d) as d → ∞.They also prove that the mixing times of the ST and PT algorithms are optimised when the value of is chosen to maximise the quantity where I(β) = Var π β f (x) .Furthermore, this optimal choice of corresponds to an acceptance rate of inverse-temperature moves of 0.234 (to three decimal places), similar to the earlier Metropolis algorithm results of Roberts et al. [1997] and Roberts and Rosenthal [2001].
From a practical perspective, setting up the temperature levels to achieve optimality can be done via a stochastic approximation approach (Robbins and Monro [1951]), similarly to Miasojedow et al. [2013] who use an adaptive MCMC framework (see e.g.Roberts and Rosenthal [2009]).

Torpid Mixing of ST and PT Algorithms
The above optimal scaling results suggest that the mixing time of the ST and PT algorithms through the temperature schedule is O(d), i.e. grows only linearly with the dimension of the problem, which is very promising.However, this optimal, non-degenerate scaling was derived under the assumption of immediate, infinitely fast within-temperature mixing, which is almost certainly violated in any real application.Indeed, this assumption appears to be overly strong once one considers the contrasting results regarding the scalability of the ST algorithm from Woodard et al. [2009a] and Woodard et al. [2009b].Their approach instead relies on a detailed analysis of the spectral gap of the ST Markov chain and how it behaves asymptotically in dimension.They show that in cases where the different modal structures/scalings are distinct, this can lead to mixing times that grow exponentially in dimension, and one can only hope to attain polynomial mixing times in special cases where the modes are all symmetric.
The fundamental issue with the ST/PT approaches are that in cases where the modes are not symmetric, the tempered targets do not preserve the regional/modal weights.That motivates the current work, which is designed to preserve the modal weights even when performing tempering transformations, as we discuss next.
Interestingly, a lack of modal symmetry in the multimodal target will affect essentially all the standard multimodal focused methods: the Equi-Energy Sampler of Kou et al. [2006], the Tempered Transitions of Neal [1996], and the Mode Jumping Proposals of Tjelmeland and Hegstad [2001], all suffer in this setting.Hence, the work in this paper is applicable beyond the immediate setting of the ST/PT approaches.

Weight Stabilised Tempering
In this section, we present our modifications which preserve the weights of the different modes when performing tempering transformations.We first motivate our algorithm by considering mixtures of Gaussian distributions.
Consider a d-dimensional bimodal Gaussian target distribution with means, covariance matrices and weights given by µ i , Σ i , w i for i = 1, 2 respectively.Hence the target density is given by: where φ(x, µ, Σ) is the pdf of a multivariate Gaussian with mean µ and covariance matrix Σ.Assuming the modes are well separated, then for β not too small, the power tempered target from (1) can be approximated by a bimodal Gaussian mixture (cf.Woodard et al. [2009b], Tawn [2017]): where the weights are approximated as A one-dimensional example of this is illustrated in Figure 1, which shows plots of a bimodal Gaussian mixture distribution as β → 0. Clearly the second mode, which was originally wide but very short and hence of low weight, takes on larger and larger weight as β → 0, thus distorting the problem and making it very difficult for a tempering algorithm to move from the second mode to the first when β is small.Higher dimensionality makes this weight-distorting issue exponentially worse.Consider the situation with w 1 = w 2 but Σ 1 = I d and Σ 2 = σ 2 I d where I d is the d × d identity matrix.Then so the ratio of the weights degenerates exponentially fast in the dimensionality of the problem for a fixed β.This heuristic result in (8) shows that between levels there can be a "phase-transition" in the location of probability mass, which becomes critical as dimensionality increases.

Weight Stabilised Gaussian Mixture Targets
Consider targeting a Gaussian mixture given by in the (practically unrealistic) setting where the target is a Gaussian mixture with known parameters, including the weights.By only tempering the variance component of the modes, one can spread out the modes whilst preserving the component weights.We formalise this notion as follows: Definition 1 (Weight-Stabilised Gaussian Mixture (WSGM)).For a Gaussian mixture target distribution π(•), as in (9), the weight-stabilised Gaussian mixture (WSGM) target at inverse temperature level β is defined as Figure 2 shows the comparison between the target distributions used when using power-based targets vs the WSGM targets for the example from Figure 1.Using these WSGM targets in the PT scheme can give substantially better performance than when using the standard power based targets.This is very clearly illustrated in the simulation study section below in Section 4.1.From herein, when the term "WSGM ST/PT Algorithm" is used it refers to the implementation of the standard ST/PT algorithm but now using the WSGM targets from (10).

Approximating the WSGM Targets
In practice, the actual target distribution would be non-Gaussian, and only approximated by a Gaussian mixture target.On the other hand, due to the improved performance gained from using the WSGM over just targeting the respective power-tempered mixture, there is motivation to approximate the WSGM in the practical setting where parameters are unknown.To this end, we present a theorem establishing useful equivalent forms of the WSGM; these alternative equivalent forms give rise to a practically applicable approximation to the WSGM.
To establish the notation, let the target be a mixture distribution given by where g j (x) is a normalised target density.Then set Then we have the following result, proved in the Appendix.
Remark 1: In Theorem 1, statement (b) shows that second order gradient information of the h j 's can be used to preserve the component weight in this setting.
Remark 2: Statement (c) extends statement (b) to no longer require the gradient information about the h j but simply the mode/mean point µ j .
Essentially this shows that by appropriately rescaling according to the height of the component as the components are "powered up" then component weights are preserved in this setting.Remark 3: A simple calculation shows that statement (c) holds for a more general mixture setting when all components of the mixture share a common distribution but different location and scale parameters.

Hessian Adjusted Tempering
The results of Theorem 1 are derived under the impractical setting that the components are all known and that π(•) is indeed a mixture target.One would like to exploit the results of (b) and (c) from Theorem 1 to aid mixing in a practical setting where the target form is unknown but may be well approximated by a mixture.Suppose herein that the modes of the multimodal target of interest, π(•), are well separated.Thus an approximating mixture of the form given in ( 11) would approximately satisfy where M = sup j {h j (x)}.Hence it is tempting to apply a version of Theorem 1(b) to π(•) directly as opposed to the h j .So at inverse temperature β, the point-wise target would be proportional to where ).This only uses point-wise gradient information up to second order.At most locations, if the target was indeed a Gaussian mixture, and the inverse temperature β is not too hot so there is no significant tail overlap between modes, then this approach would give almost exactly the same value as π β (•) from ( 12) in the setting of (b).However, at locations between modes when the Hessian of log(π(x)) is positive semi-definite, this target behaves very badly, with explosion points that make it improper.
Instead, under this setting of well separated modes then one can appeal instead to the weight preserving characterisation in Theorem 1(c).Assume that one can assign each location in the state space to a "mode point" via some function x → µ x , with a corresponding tempered target given by In the setting where the modes are well separated and the target is really a Gaussian mixture, and the inverse temperature β is not too hot so there is no significant tail overlap between modes, this approach would again give almost exactly the same value as π β (•) from ( 12).This is the motivation for the following general target: Definition 2 (Hessian Adjusted Tempering (HAT) Target).For a target distribution π(x) on R d with a corresponding "mode point assigning function" We note that, assuming that π(x) = Cf (x) is continuous and bounded on R d , and that for all β ∈ (0, ∞), is a well defined probability density, i.e.Definition 2 makes sense.For a bimodal Gaussian mixture example Figure 3 compares the HAT target relative to the WSGM target; showing that until the hottest temperatures levels are reached the numerical version is a good approximation to the WSGM target.We propose to use these HAT targets in place of the power-based targets for the tempering algorithms given in Section 2. We thus define the following algorithms, which are explored in the following sections.
Definition 3 (Hessian Adjusted Simulated Tempering (HAST) Algorithm).The HAST algorithm is an implementation of the ST algorithm (Section 2, Algorithm 1) where the target distribution at inverse temperature level β is given by π H β (•) from Definition 2.
Definition 4 (Hessian Adjusted (Parallel) Tempering (HAT) Algorithm).The HAT algorithm is an implementation of the PT algorithm (Section 2, Algorithm 2) where the target distribution at inverse temperature level β is given by π H β (•) from Definition 2. Finally, we note that it can be non-trivial to set up the mode point assignment function µ x .In the setting where the mode points and their curvature information were known a priori one sensible option would be to use the Mahalonobis distance i.e. under the notation of the Gaussian mixture setting: However, in the setting where the mode points are unknown, a local optimisation could be performed, with an initialisation from the point x that the chain is currently located at.Indeed, if the local mode to the point x is Gaussian, then a single step of the Newton optimisation scheme would approximately locate the mode point i.e.
However, this step could be expensive and potentially non-robust, which could restrict the implementation of our algorithm on certain target distributions.
4 Simulation Studies

WSGM Algorithm Simulation Study
We begin by comparing the performances of a ST algorithm targeting both the power-based and WSGM targets for a simple bi-modal Gaussian target example.The example will illustrate that the traditional ST algorithm, using power-based targets, struggles to mix effectively through the temperature levels due to a bottleneck effect caused by the lack of regional weight preservation.The example considered is the 10-dimensional target distribution given by the bi-modal Gaussian mixture where w 1 = 0.2, w 2 = 0.8, µ 1 = (−10, −10, . . ., −10), µ 2 = (10, 10, . . ., 10), Σ 1 = 9I 10 and Σ 2 = I 10 .When power based tempering is used, then mode 1 accounts for only 20% of the mass at the cold level, but at the hotter temperature levels becomes the dominant mode containing almost all the mass.For both runs the same geometric temperature schedule was used: ∆ = {1, 0.32, 0.32 2 , . . ., 0.32 6 }.
This geometric schedule is justified by Corollary 5.2.1 of Tawn [2017], which suggests this is an optimal setup in the case of a regionally weight preserved PT setting.Indeed, this schedule induces a swap move acceptance rates around 0.22 for the WSGM algorithm; close to the suggested 0.234 optimal value.This temperature schedule gave swap acceptance rates of approximately 0.23 between all levels of the power-based ST algorithm except for the coldest level swap where this degenerated to 0.17.That shows that the power-based ST algorithm was set up essentially optimally according to the results in Atchadé et al. [2011].
In order to ensure that the within-mode mixing isn't influencing the temperature space mixing, a local modal independence sampler was used for the within-mode moves.This essentially means that once a mode has been found, the mixing is infinitely fast.We use the modal assignment function µ x which specifies that the location x is in mode 1 if x < 0 and in mode 2 otherwise.
Then the within-move proposal distribution for a move at inverse temperature level β is given by where φ µ,Σ (.) is the density function of a Gaussian random variable with mean µ and variance matrix Σ.
Figure 4 plots a functional of the inverse temperature at each iteration of the algorithm.The functional is where sgn(.) is the usual sign function and β min is the minimum of the inverse temperatures.This figure clearly illustrates that the hot state modal weight inconsistency leads the chain down a poor trajectory since at hot temperatures nearly all the mass is in modal region 1.This results in the chain never reaching the other mode in the entire (finite) run of the algorithm.Indeed, the trace plots in Figure 4 show that the chain is effectively trapped in mode 1, which although it only has 20% of the mass in the cold state, is completely dominant at the hotter states.

Simulation study for HAT
Consider a five-dimensional bi-modal Gaussian mixture target, with target density given by: where φ (µ,Σ) (.) is the density function of a 5-dimensional Gaussian with mean µ and covariance matrix Σ.The weights of the modes are equal, with w 1 = w 2 = 0.5, µ 1 = (−15, . . ., −15), µ 2 = (15, . . ., 15), Σ 1 = I 5 and Σ 2 = 3 2 × I 5 .A geometric inverse tempering schedule is justified for this setting by Corollary 5.2.1 of Tawn [2017].In this case, seven tempering levels, which are geometrically spaced and given by {1, 0.35, 0.35 2 , . . ., 0.35 6 }, was found to be optimal.With this setup, 10 repeated runs of the algorithms gave stable empirical estimates of the swap ratios at around 0.26 for the HAT algorithm (and only defer from this value in the hottest levels, when the weight preserving approximation becomes less valid).The algorithm is run using three sequential withintemperature level move proposals for every one temperature-swap proposal.
In a few of the PT algorithm runs on this setup, the cold state chain completely fails to discover the mode centred on µ 1 .These runs are evident in the weight estimation plots found below in Figure 6.In contrast, the HAT approach was highly successful in converging to the target distribution in a smaller number of iterations, with more frequent inter-modal jumps between modes in the cold state.17).On the left is the version using the WSGM targets, which mixes well through the temperature schedule and finds both modal regions.On the right is the version using the standard power-based targets, which fails to ever find one of the modes.Bottom: Trace plots of xt in each of the cases respectively.Figure 5 shows one example of a run of each of the algorithms in this setup, i.e. the PT, HAT and WSGM, illustrating that there is comparability in the trace plots of the performance of the HAT algorithm to the performance of the WSGM algorithm.
Figure 6 shows the running modal weight approximation of w 1 after the k th iteration of the cold state chains, after removing a burn-in period of 10,000 initial iterations, for the ten examples of the PT and HAT runs respectively.The weight approximation after iteration k is given by ŵk where X 1 i is the location of the first component of the five-dimensional chain at the coldest temperature level after the i th iteration.This figure indicates that the PT algorithm fails to provide a stable estimate for w 1 , and does not escape from its initialisation mode even after the 10,000 iteration burn-in.
Instead, consider a slight increase in the ambitiousness of the spacing, by decreasing the common ratio of the geometric spacings from 0.35 to 0.25.Figure 7 shows one representative run from each.Although slightly on the low side, the acceptance rate for swap moves throughout the temperature schedule is approximately 0.16 and would not be suggestive of any major mixing issue to the practitioner.However, the PT algorithm fails in these runs never finds the other mode at all.By contrast, the HAT algorithm (despite acceptance rates around 0.16 which are a little on the low side from that suggested for optimality) manages to make reasonably regular inter-modal jumps and thus mix successfully.

Diffusion Limit and Computational Complexity
In this section, we provide some theoretical analysis for our algorithm.We shall prove in Theorems 2 and 3 below that as the dimension goes to infinity, a simplified and speeded-up version of our weight-preserving simulated tempering algorithm (i.e., the HAST Algorithm from Definition 3, equivalent to the ST Algorithm 1 with the adjusted target from Definition 2) converges to a certain specific diffusion limit.This limit will allow us to draw some conclusions about the computational complexity of our algorithm.

Assumptions
We assume for simplicity (though see below) that our target density π is a mixture of the form (11) has just J = 2 modes, of weights w 1 = p and w 2 = 1 − p respectively, with each mixture component a special i.i.d.product g j (x) = d i=1 f j (x i ) as in (4).We further assume that a weight-preserving transformation (perhaps inspired by Theorem 1(b) or (c)) has already been  19), for 10 runs of the PT (left) and HAT (right) algorithms.In both cases a burn-in of 10,000 iterations was removed.Observe the lack of convergence of the weight estimates for the PT runs compared to the relatively strong estimates from the HAT runs.done, so that for each β.We consider a simplified version of the weight-preserving process, in which the chain always mixes immediately within each mode, but the chain can only jump between modes when at the hottest temperature β min , at which point it jumps to one of the two modes with probabilities p and 1 − p respectively.We shall sometimes concentrate on the Exponential Power Family special case in which each of the two mixture component factors is of the form f j (x) ∝ e −λj |x| r j for some λ j , r j > 0. This includes the Gaussian case for which r 1 = r 2 = 2 and λ j = 1/σ 2 j .(Note that the ( 14) modification requires the existence of second derivatives, corresponding to r j ≥ 2.) As in Atchadé et al. [2011] and Roberts and Rosenthal [2014], following Predescu et al. [2004] and Kone and Kofke [2005], we assume that the inverse temperatures are given by 1 = β for some fixed C 1 function .In many cases, including the Exponential Power Family case, the optimal choice of is (β) = β 0 for a constant 0 .= 2.38.18) using the PT, HAT and WSGM algorithms respectively.However, unlike the versions of the runs in Figure 5, the temperature schedules were on a more ambitious, sub-optimal, spacing.All cold level swap move acceptance rates were approximately 0.16; even for the PT runs.
We let β (d) t be the inverse temperature at time t for the d-dimensional process.To study weak convergence, we let X process, speeded up by a factor of d, where {N (t)} is an independent standard rate 1 Poisson process.To combine the two modes into one single process, we further augment X t by multiplying it by −1 when the algorithm's state is closer to the first mode, while leaving it positive (unchanged) when state is closer to the second mode.That is, the combined process {X t } does mode hopping with probability p or 1 − p whenever it reaches β min .

Main Results
Our first diffusion limit result is the following (proved in the Appendix).
Theorem 2. Assume the target density π is of the form (11) with J = 2 modes of weights w 1 = p and w 2 = 1 − p, with inverse weights chosen as in (20).Then as d → ∞, a rescaled version of the above process {X t } converges weakly in the Skorokhod topology to a fixed diffusion process H t , which depends on the target density factors f 1 and f 2 and the value of β min but is independent of the dimension d, and which has uniform stationary distribution, i.e. leaves constant densities invariant.
To make further progress, we require a proportionality condition.Namely, we assume that the quantities corresponding to I(β) = Var (log f 2 (x)) for β < 0 (corresponding to the second mode), and assume there is a fixed function I 0 : R + → R + and positive constants r 1 and r 2 such that we have I(β) = I 0 (β)/r 1 for β > 0 (in the first mode), while I(β) = I 0 (|β|)/r 2 for β < 0 (in the second mode).For example, it follows from Section 2.4 of Atchadé et al. [2011] that in the Exponential Power Family case, I(β) = 1/r 1 β 2 for β > 0 and I(β) = 1/r 2 β 2 for β < 0, so that this proportionality condition holds in that case.
Corresponding to this, we choose the inverse-temperature spacing function as follows (following Atchadé et al. [2011] and Roberts and Rosenthal [2014]): for some fixed constant 0 > 0.
To state our next result, we require the notion of skew Brownian motion, a generalisation of usual Brownian motion.Informally, this is a process that behaves just like a Brownian motion, except that the sign of each excursion from 0 is chosen using an independent Bernoulli random variable; for further details and constructions and discussion, see e.g.Lejay [2006].We then have the following result (also proved in the Appendix).
Theorem 3.Under the set-up and assumptions of Theorem 2, assume the above proportionality condition, and the choice (21).Then as d → ∞, a rescaled version of the above process {X t } converges weakly in the Skorokhod topology to a reflected skew Brownian motion B * t .That is, there is a fixed strictly increasing function U : R → R, depending on the target density factors f 1 and f 2 but independent of β min and the dimension d, with U (β min ) = 0, such that if Z t = U (X t ), then Z t ⇒ B * t as d → ∞, where B * t is skew Brownian motion with reflecting boundaries at U (±β min ).The possible values of Z t are thus in the range [U (−β min ), U (β min )].Furthermore, in the Exponential Power Family case, the function U is given by U (s) = r 1 log(s) for s > 0 and U (s) = r 2 log |s| for s < 0, so the range of Remark.It follows from the proof of Theorem 3 that the specific version of skew Brownian motion B * t that arises in the limit is one with excursion weights proportional to That means that the stationary density for B * t on the positive and negative values is proportional to a and b respectively.This might seem surprising since the limiting weights of the modes should be equal to p and 1−p, not proportional to a and b (unless r 1 = r 2 ).The explanation is that the lengths of the positive and negative parts of the domain are given by 2 respectively.Hence, the total stationary mass of the positive and negative parts -and hence also the limiting modes weights -are still p and 1−p as they should be.

Complexity Order
Theorems 2 and 3 have implications for the computational complexity of our algorithm.
In Theorem 2, the limiting diffusion process H t is a fixed process, not depending on dimension except through the value of β min .It follows that if β min is kept fixed, then H t reaches 0 (and hence mixes modes) in time O(1).Since H t is derived (via X t ) from the β t process speeded up by a factor of d, it thus follows that for fixed β min , β t reaches β min (and hence mixes modes) in time O(d).So, if β min is kept fixed, then the mixing time of the weight-preserving tempering algorithm is O(d), which is very fast.However, this does not take into account the dependence on β min , which might also change as a function of d.
Theorem 3 allows for control of the dependence of mixing time on the values of β min .The limiting skew Brownian motion process B * t is a fixed process, not depending on dimension nor on β min , with range U (±β min ).It follows that Z t reaches 0 (and hence mixes modes Since Z t is also derived (via X t ) from the β t process speeded up by a factor of d, it follows that β t reaches β min (and hence mixes modes) in time This raises the question of how large β min needs to be, as a function of dimension d.If the proposal scaling is optimal for within each mode at the cold temperature, then the proposal scaling is O(d −1/2 ).Then, at an inversetemperature β, the proposal scaling is O((βd) −1/2 ).Hence, at an inversetemperature β, the probability of jumping from one mode to the other (a distance O( √ d) away) is roughly of order e −βd 2 .This is exponentially small unless β = O(1/d 2 ).This indicates that, for our algorithm to perform well, we need to choose With this choice, the mixing time becomes ).In the Exponential Power Family case, this corresponds to O(d[log d] 2 ).That is, for the inverse-temperature process to hit β min and hence mix modes, takes O(d[log d] 2 ) iterations.This is a fairly modest complexity order, and compares very favorably to the exponentially long convergence times which arise with traditional simulated tempering as discussed in Subsection 2.2.

More than Two Modes
Finally, we note that for simplicity, the above analysis was all done for just two modes.However, a similar analysis works more generally.Indeed, suppose now that we have k modes, of general weights p 1 , p 2 , . . ., p k ≥ 0 with i p i = 1.Then when β gets to β min , the process chooses one of the k modes with probability p i .This corresponds to {Y t } being replaced by a Brownian motion not on [−1, 1], but rather on a "star" shape with k different length-1 line segments all meeting at the origin (corresponding, in the original scaling, to β min ), where each time the Brownian motion hits the origin it chooses one of the k line segments with probability p i each.This process is called Walsh's Brownian motion, see e.g.Barlow et al. [1989].(The case k = 2 but p 1 = 1/2 corresponds to skew Brownian motion as above.)For this generalised process, a theorem similar to Theorem 2 can be then stated and proved by similar methods, leading to the same complexity bound of O(d(log d) 2 ) iterations in the multimodal case as well.

Conclusion and Further Work
This article has introduced the HAT algorithm to mitigate the lack of regional weight preservation in standard power-based tempered targets.Our simulation studies show promising mixing results, and our theorems indicate the mixing times can become polynomial rather than exponential functions of the dimension d, and indeed of time O(d[log d] 2 ) under appropriate assumptions.
Various questions remain to make our HAT approach more practically applicable.For example, the "modal assignment function" needs to be specified in an appropriate way given the problem, and localised optimisation methods add significant additional computational expense.Also, the HAT targets may be poor approximations of the theoretically promising WSGM targets when the temperature reaches the hottest states, which could affect the algorithm's mixing performance.
In the theoretical analysis of Section 5, the spacing between consecutive inverse-temperature levels was taken to be O(d −1/2 ) to induce a non trivial diffusion limit.However, this result required strong assumptions.Accompanying work in Tawn [2017], in particular Theorem 5.1.1,suggests that for the HAT algorithm under more general conditions, the consecutive optimal spacing should still be O(d −1/2 ), with an associated acceptance rate ≤ 0.234.

Appendix
In this Appendix, we prove the theorems stated in the paper.

Proof of Theorem 1
Herein, assume the mixture distribution setting of ( 11) and ( 12) where the mixture components consist of multivariate Gaussian distributions i.e. g j (x) = φ(x; µ j , Σ j ).We prove each of the three parts of Theorem 1 in turn.
Proof of Theorem 1(a).Recall that h j (x) = w j φ(x; µ j , Σ j ) where ∃C ∈ R such that C J j=1 w j = 1.Hence, taking f j (x, β) = [h j (x)] β gives Proof of Theorem 1(b).Recall the result of Theorem 1(a).To adjust for the weight discrepancy from the cold state target a multiplicative adjustment factor, α j (x) is used such that where . An identical argument to Theorem 1(a) shows that this immediately gives W (j,β) ∝ w j .
In a Gaussian setting, up to a proportionality constant and at any point Substituting these gradient terms ( 23) and ( 24) into ( 22) and then using this form of ( 22) to create the adjustment factor α Proof of Theorem 1(c).Since h j (x) = w j φ(x; µ j , Σ j ) then and so W (j,β) ∝ w j .
Remark: It is possible to extend the weight adjusted target result of Theorem 1(c) to a setting where the target consists of a mixture of a general but common distribution, with each component having a different shape and scale factor; we plan to pursue this result elsewhere.

Proof of Theorem 2
It follows from Theorem 6 of Roberts and Rosenthal [2014] that as d → ∞, the process {X t } converges weakly, at least on X t > 0, to a diffusion limit {X t } t≥0 satisfying where (log f 2 (x)) for β < 0, so that positive values correspond to the first mode while negative values correspond to the second mode, then (25) also holds for X t < 0, except with the sign of the drift reversed.
To continue, we let h(x) = x βmin Re-writing everything in terms of H t = h(X t ), this becomes Now, in general, a diffusion of the form dH t = σ(H t )dB t + µ(H t )dt has stationary density π provided that 1 2 (log π) σ 2 + σσ = µ.In particular, it has uniform stationary distribution, i.e. with π constant, provided that µ = σσ , i.e. that 2µ = (σ 2 ) .In this specific case, we verify that which is indeed equal to 2µ since in the above equation This shows that a uniform density is stationary for H t .

Proof of Theorem 3
We now assume that I(β) = I 0 (β)/r 1 for β > 0, while I(β) = I 0 (|β|)/r 2 for β < 0, and that (β) = I −1/2 0 (β) 0 .This makes (x)I 1/2 (x) = 0 / √ r 1 for x > 0, and (x)I 1/2 (x) = 0 / √ r 2 for x < 0. In either case, (x)I 1/2 (x) is constant, i.e. has derivative zero.That in turn collapses (25), at least for H t = 0, into the simpler where r(H) = r 1 for H > 0 and r(H) = r 2 for H < 0. Finally, we set That is, Z t is a version of H t which is stretched by a piecewise linear spatial function, which is linear on each of the positive and negative values respectively.It then follows immediately from the above that dZ t = dB t , i.e. that Z t behaves like Brownian motion on each of its two branches (positive and negative).It remains only to prove that at Z t = 0, the convergence still holds.We complete the proof similarly to previous proofs of diffusion limits of MCMC algorithms (e.g.Roberts et al. [1997]; Roberts and Rosenthal [1998]; Bédard and Rosenthal [2008]), following the approach indicated in Chapter 4 of Ethier and Kurtz [1986], by proving that the generator G (d) of the original process under these combined transformations (i.e., jumping according to a rate d Poisson process, then transformed by the h function, and then stretched by the piecewise linear function) converges uniformly to the generator G * of skew Brownian motion, when applied to a core of functionals.
We let D be the set of all functions f : [−1, 1] → R which are continuous on [−1, 1], and twice-continuously-differentiable on [−1, 0] and also on [0, 1], with matching one-sided second derivatives f + (0) = f − (0), and skewed one- .Thus, C 2 functions are not contained in D due the enforced discontinuity of the first derivative at 0, but e.g.f ∈ D if f (x) = x 2 + ax1 x<0 + bx1 x>0 .In particular, D is dense (in the sup norm) in the set of all C 2 [−1, 1] functions, so in the language of Ethier and Kurtz [1986], D serves as a core of functions for which it suffices to prove that the generators converge.Furthermore, it follows from e.g.Liggett [2010] and Exercise 1.23 of Chapter VII of Revuz and Yor [2004]) that the generator of skew Brownian motion (with excursion weights proportional to a and b respectively) satisfies that G * f (x) = 1 2 f (0) for all f ∈ D, using the convention that f (0) represents the common value f + (0) = f − (0).Now, it follows from the previous discussion that for any fixed f ∈ D, That is, the generators do converge uniformly to the G * , as required, at least for z = 0, i.e. avoiding the mode-hopping value β min .To complete the proof, it suffices to prove that ( 26) also holds at z = 0, i.e. to prove: Lemma 1.We have that: Proof.Note first that if the original inverse-temperature process proposes to move the inverse-temperature from β min to β min + (β min )d −1/2 , then the H t process proposes to move from 0 to ±d −1/2 , and the Z t process proposes to move from 0 to ±d −1/2 2 Φ . Furthermore, the Z t process, like the X t process, is sped up by a factor of d, which multiplies its generator by d.Hence, we conclude that where α + is the acceptance probability for the original process to accept a proposal to increase the inverse-temperature from β min to β min + (β min )d −1/2 in mode 1, and α − is the acceptance probability for the same proposal in mode 2. Next, note that the process Z t has expected squared jumping distance equal to the square of its volatility, which is just equal to 1. On the other hand, the expected squared jumping distance must be equal to the squared distance of its proposed move times the acceptance probability.Hence, in mode 1, we must have Then, taking a Taylor series expansion, we obtain that for f ∈ D, Then, by definition of f ∈ D, the terms involving f + (0) and f − (0) cancel, and the terms involving f + (0) and f − (0) combine.Recalling the convention f (0) = f + (0) = f − (0), we obtain finally that as claimed.

Figure 1 :
Figure1: Power-tempered target densities of a bimodal Gaussian mixture using inverse temperature levels β = {1, 0.1, 0.05, 0.005} respectively.At the hot state it is evident that the mode centred on 40 begins to dominate the weight as β → ∞.

Figure 3 :
Figure3: For the same bimodal Gaussian target from Figure2, here is a comparison of the HAT vs WSGM targets at inverse temperatures β = 0.05 and β = 0.005 respectively.Note they are almost identical at the colder temperature; but they do differ at the hotter temperature with the appearance of a step jump in the HAT target, an open problem with the HAT target distributions.

Figure 4 :
Figure 4: Top: Trace plots of the functional of the simulated tempering chains given in (17).On the left is the version using the WSGM targets, which mixes well through the temperature schedule and finds both modal regions.On the right is the version using the standard power-based targets, which fails to ever find one of the modes.Bottom: Trace plots of xt in each of the cases respectively.

Figure 5 :
Figure5: Three trace plots of the cold state chain targeting the distribution in (18) using the PT, HAT and WSGM algorithms respectively.

Figure 6 :
Figure 6: Running estimate of the weight, w k 1 given in (19), for 10 runs of the PT (left) and HAT (right) algorithms.In both cases a burn-in of 10,000 iterations was removed.Observe the lack of convergence of the weight estimates for the PT runs compared to the relatively strong estimates from the HAT runs.

Figure 7 :
Figure7: Three trace plots of the cold state chain targeting the distribution in (18) using the PT, HAT and WSGM algorithms respectively.However, unlike the versions of the runs in Figure5, the temperature schedules were on a more ambitious, sub-optimal, spacing.All cold level swap move acceptance rates were approximately 0.16; even for the PT runs.