A Metropolis-class sampler for targets with non-convex support

We aim to improve upon the exploration of the general-purpose random walk Metropolis algorithm when the target has non-convex support $A \subset \mathbb{R}^d$, by reusing proposals in $A^c$ which would otherwise be rejected. The algorithm is Metropolis-class and under standard conditions the chain satisfies a strong law of large numbers and central limit theorem. Theoretical and numerical evidence of improved performance relative to random walk Metropolis are provided. Issues of implementation are discussed and numerical examples, including applications to global optimisation and rare event sampling, are presented.


Introduction
A key challenge for Markov chain Monte Carlo (MCMC) algorithms is the balance between global "exploration" and local "exploitation". In this paper we present the skipping sampler, a general-purpose and easily implemented Metropolis-class algorithm which is capable of improving exploration of targets π with nontrivial support A, by reusing proposals lying outside A. For this to be useful, we make the following standing assumption: Assumption 1. π is a probability density function on R d whose support A = supp(π) := {x ∈ R d : π(x) > 0} satisfies Leb(A c ) > 0, where A c is the complement of A and Leb denotes Lebesgue measure on R d .
Such targets can arise for example in sampling from the superlevel sets of a density in the hybrid slice sampler [23], or when sampling from rare events.
Proposals in A c would be automatically rejected by standard algorithms such as random walk Metropolis (RWM), which exploits only local proposals for the next state of the chain. If a proposal lies in A c , the skipping sampler uses this information by attempting to cross A c in a sequence of linear steps, much as a flat stone can jump repeatedly across the surface of water, and offer a relevant proposal. Since this can be seen as a tunnelling effect through the zero-mass region A c , it is advantageous when A is non-convex and, in particular, disconnected. The resulting Markov chain satisfies a strong law of large numbers and central limit theorem under essentially the same conditions as for RWM, to which we provide theoretical and numerical performance comparisons.
To accelerate global exploration of the state space in MCMC algorithms, several approaches have by now been developed including tempering, Hamiltonian Monte Carlo and piecewise deterministic methods (see [28] for a recent review). However these methods are best suited to target densities with connected support, since the chain cannot cross regions where the target has zero density. A disconnected support would thus imply reducibility of the chain and its failure to converge to the target.
While RWM can be applied to targets with regions of zero density, its balance between exploration and exploitation can be problematic. If any state in A c is proposed it is discarded and the chain does not progress. When A is non-convex, in particular, examples may be constructed where exploration is slow even when RWM is well tuned, making the chain sensitive to its initial state. This is illustrated in Figure 1, where red dots show the trace of a tuned RWM applied to a target with non-convex support, with four different initial states of the chain (the blue traces illustrate the increased exploration achieved by the skipping sampler). One solution is to use knowledge of the target to design a more advanced proposal, such as those reviewed in [28]. However this approach is unavailable if the target density is unknown, or is known but insufficiently regular. In this case, a general-purpose method is instead required.  Figure 1: Traces of the proposed skipping sampler (blue) and RWM algorithm (red) when the target has disconnected support. Both samplers are started at the same initial point X 0 and use the same underlying Gaussian proposal, whose standard deviation is tuned for a RWM acceptance ratio of 25%. The RWM typically localises around its initial state.
Theorem 3.2 establishes that the performance of the skipping sampler is at least as good as that of RWM according to the Peskun ordering. However the strengths of the proposed method lie primarily in applications to difficult low-dimensional problems. Conversely, in high dimensional problems the method generally offers similar performance to RWM. The aim of this paper is to present the method and illustrate its benefits via numerical examples, rather than to study any particular application exhaustively.
Although it is not random walk-based, the skipping sampler is Metropolis class. The symmetry of the skipping proposal can be seen intuitively, provided that the direction of the first proposal is chosen symmetrically and the sequence of jump lengths has the same distribution when reversed. Thus although the proposal density typically does not have a convenient closed form, it need not be evaluated in order to access the Metropolis acceptance probabilities. Another advantage is that the sampler is general-purpose, in the sense that no knowledge is required of the target density beyond the ability to evaluate it pointwise. In particular, it is not necessary to know the target's support a priori.
Beyond the context of random sampling, our work has applications to probabilistic methods for deterministic non-convex optimisation such as multistart [7,16] and basin-hopping [14,42]. These methods combine deterministic local search, such as given by a gradient method, with random perturbations or re-initialisations which may be performed using the skipping sampler to improve exploration. Section 5 provides numerical examples of these applications.

Related work
Many methods for accelerating the exploration of MCMC algorithms use prior knowledge of the target. For example, known mode locations may be used to design global moves for the sampler [1,26,12,34,35,40], or moves may be guided by the known derivatives of a differentiable target density [12,37].
Some exceptions are methods that generate multiple proposals, such as Multipoint MCMC [27] and Multipletry Metropolis [15] which, like the skipping sampler, do not require additional information about the target. A fixed number of potentially correlated trial points are generated and one is selected at random, using a weight function which may be chosen to encourage exploration. Its random-grid implementation, in particular, has similarities with the skipping sampler. However, instead of fixing the number of draws, our proposal attempts to continue projecting further sequentially until it reaches A. Another advantage of our method is that it is Metropolis class, which simplifies both implementation and theoretical analysis.
During the review process our attention was brought to the very interesting sequential proposals of [24], which also introduces a Metropolis-class sampler that modifies the proposal sequentially. In the wider class of algorithms introduced there, it is possible to recognise methods close in spirit to the skipping sampler. When skipping is applied to the hybrid slice sampler as in Section 5.2, for example, the resulting algorithm is a particular instance of the sequential proposal. While in [24] the authors are motivated by the efficient implementation of Hamiltonian Monte Carlo, our own motivation is the efficient sampling of rare events. Together, these studies are suggestive of further potential to use sequences of proposals to accelerate MCMC methods in a range of situations, for instance within the framework introduced in [2].
Like the hit-and-run sampler [36] and related algorithms (see Section 6.3 of [4]), the skipping sampler splits a Markovian transition into the random generation of a direction followed by a move in that direction. When the target conditioned on any line in the space is available in closed form, the hit-and-run algorithm is of course preferable, for the reasons provided in [32]. Otherwise (which is more typical in applications), the skipping sampler offers a simple alternative and has the potential to increase exploration in the case of a non-convex support.
While also designed for targets with non-convex support, the ghost sampler introduced by the present authors in [22] is not general-purpose since it uses knowledge of the geometry of the set A, assuming it is polyhedral.
The rest of the paper is structured as follows. We introduce the skipping sampler in Section 2 and state our main results in Section 3. Implementation and extensions are discussed in Section 4. Numerical applications to slice sampling and rare event sampling are given in Section 5, together with an application to global optimisation. Section 6 is devoted to the proof of the main results.

Skipping sampler
In this section we introduce the skipping sampler on R d , which is a modification of the RWM algorithm [18]. It is Metropolis-class although, unlike RWM, does not perform a random walk.
) continuous probability density function with q(0) > 0. We refer to q as the underlying proposal density.
Recall that given the state X n of the chain, the RWM proposes a state Y n+1 sampled from the density y → q(y − X n ) and accepts it as the next state X n+1 with probability else it is rejected by setting X n+1 = X n . Here π is the target density, although we do not take care to distinguish between π and the corresponding distribution as it will not cause confusion. For convenience we use the common shorthand MH(π, q) (after the more general Metropolis-Hastings algorithm, see [6]) to refer to the Metropolis-class algorithm with target π and proposal q. Algorithm 1 presents the skipping sampler, which aims to endow RWM with an improved ability to cross regions in which the target has zero density. Beginning with a RWM proposal Y n+1 , it continues jumping in a linear trajectory and accepts or rejects the first state of nonzero target density to be encountered. Thus any RWM proposal Y n+1 ∈ A c , which would be rejected by MH(π, q), is instead reused by adding jumps of random size in the direction Y n+1 − X n until either A is entered, or skipping is halted.
Algorithm 1 can be interpreted as follows. The halting index K is an independent random variable with distribution K on Z >0 ∪ {∞}. If K = 1 then Y , the usual RWM proposal, is taken as the proposal. However if K > 1, the proposal is constructed using the skipping chain {Z k } k≥0 on R d defined by Z 0 := X, with X = X n being the current state of the chain, and the update rule where · denotes the Euclidean norm, Φ = (Y − X)/ Y − X , R 1 = Y − X , and the distance increments {R k } k≥2 are independent draws from the distribution of the radial part Y − X conditional on the angular part Φ. Let T A be the first entry time of the skipping chain into A: with min ∅ := ∞. Writing T A ∧ K for the smaller of the two indices T A and K, we also require: Assumption 3. The support A = supp(π) and distribution K are such that E[T A ∧ K] < ∞ .
Relevant considerations for the choice of K and q are discussed in Section 4. Note that almost surely we have both Y = x (since q is a density) and T A ∧ K < ∞ (Assumption 3), so the skipping proposal Z := Z T A ∧K output by Algorithm 1 is well defined.
Algorithm 1: Skipping sampler (n-th iteration) Input : The n-th sample X n ∈ R d 1 Set X := X n and Z 0 = X; 2 Generate the initial proposal Y distributed according to the density u → q(u − X); 3 Calculate the direction 4 Generate an independent random halting index K ∼ K; 5 Set k = 1 and Z 1 := Y ; 6 while Z k ∈ A c and k < K do 7 Generate an independent distance increment R distributed as Y − X given Φ; Increase k by one; 10 end 11 Set Z := Z k ; 12 Evaluate the acceptance probability: if π(X) = 0, 1, otherwise; (2) Generate a uniform random variable U on (0, 1); Proposition 2.1. The following statements hold: (i) Algorithm 1 is a symmetric Metropolis-class algorithm on the domain A. That is, there exists a transition density q K (which depends on the halting index distribution K) satisfying q K (x, z) = q K (z, x) for all x, z ∈ A, such that Algorithm 1 is MH(π, q K ).
(ii) The inequality q K (x, z) ≥ q(z − x) holds for every x, z ∈ A.
Proof. (i) We now make rigorous the intuitive argument which was provided earlier for the symmetry of the skipping proposal. Conditional on the direction Φ, the skipping chain (3) is one-dimensional. We therefore analyse this one-dimensional chain, before integrating over Φ to obtain the unconditional transition density. Consider transitions of the skipping chain (3) between the states x and z in exactly k ∈ Z >0 steps. The intermediate states z 1 , . . . , z k−1 satisfy z i ∈ A c for i = 1, . . . , k − 1. The (sub-Markovian) density z → ξ k (x, z) of these transitions is given by the Chapman-Kolmogorov equation and the density ξ(r) of the distance increment R, which can in d-dimensional spherical coordinates be seen to be proportional to q(rΦ)r d−1 . Since the distance increments are i.i.d. and have symmetric densities (q(−rΦ) = q(rΦ)), simple manipulations of the Chapman-Kolmogorov integral confirm that it is unchanged when the start and end point, the order of the jumps, and the direction of each jump are all reversed. This establishes that the density ξ k is symmetric.
Next note that Assumption 3 implies the decomposition Hence, Z given x and Φ has a (sub-Markovian) density When z ∈ A the above can be simplified to Using d-dimensional spherical coordinates, the unconditional transition density is then the product of the density of Φ with the transition density conditional on Φ: As the proposal q and the densities ξ k (for all k) are symmetric, so is ξ K and so is the skipping proposal q K , whenever x, z ∈ A.
Since any proposal Z ∈ A c is almost surely rejected if x ∈ A, Algorithm 1 is a well defined Metropolis-class algorithm on A, i.e. it is equivalent to MH(π, q K ) on the domain A.
(ii) As noted above, if K = 1 then Algorithm 1 reduces to MH(π, q). From (5) we therefore have which again translates to the desired statement about proposal densities.

Theoretical results
For completeness of the discussion below we provide the following definitions, further details of which may be found in [19].
Note that an absorbing set B gives rise to a Markov chain evolving on B whose transition kernel is simply P restricted to B (see [19,Theorem 4

.2.4]).
It is clear from (1) that if x ∈ supp(π) then P (x, supp(π) c ) = 0, so that supp(π) is an absorbing set for the Metropolis algorithm with target π, and is a natural space of realisations of the chain. In what follows we therefore always consider the chain to evolve on the set A.
Regarding initialisation of the skipping sampler, note from (2) that if X 0 / ∈ supp(π) in Algorithm 1 then Z is automatically accepted. In this case the skipping sampler first enters supp(π) at a random step N and, for 0 ≤ n ≤ N − 2, we have X n+1 = Z K -that is, the maximum allowed number of skips is performed at each stage. This implies that the skipping procedure is also capable of improving exploration in this initialisation stage. Theorem 3.1 assumes that π(X 0 ) > 0, or that initialisation has already been performed. We have Theorem 3.1 (SLLN). Suppose that MH(π, q) restricted to supp(π) is π-irreducible. Then MH(π, q K ) restricted to A = supp(π) is also π-irreducible and Harris recurrent. Moreover, the Strong Law of Large Numbers holds: if {X i } i∈Z>0 is the skipping sampler (generated by Algorithm 1) initiated at X 0 = x ∈ A, then for every π-integrable function f we have The conditions of Theorem 3.1, which are mild, are discussed in Section 4. There are also cases where MH(π, q) is not irreducible but MH(π, q K ) is, for instance when the dimension d = 1, q is a random walk proposal with finite support, and A c is an interval too wide to be crossed by a single random walk step, but which can be skipped across.
The statement of the second main result uses some additional notation (for further details see [29]). Consider the Hilbert space L 2 (π) of square-integrable functions with respect to π, equipped with the inner product (for f, g ∈ L 2 (π)) Since all Metropolis-class chains are time reversible, the Markov kernel of MH(π, q) defines a bounded self-adjoint linear operator P on L 2 (π), defined for f ∈ L 2 (π) via If P is irreducible then its operator norm is P = 1, with f ≡ 1 as the unique eigenfunction for the eigenvalue 1, and the spectral gap of P is defined to be λ : Under the conditions of Theorem 3.1, denoting respectively by P and P K the Markov kernels of MH(π, q) and MH(π, q K ) restricted to A = supp(π), the following statements hold: ii) If MH(π, q) has a non-zero spectral gap λ, then MH(π, q K ) also has a non-zero spectral gap λ K that satisfies λ K ≥ λ; iii) If the central limit theorem (CLT) holds for MH(π, q) and function f with asymptotic variance σ 2 (f ), that is in distribution, then the CLT also holds for MH(π, q K ) and the same function f , with asymptotic variance The inequality at point (i) of Theorem 3.2 gives a useful way to compare performance and mixing of different Markov kernels. Indeed, one can consider the Peskun-Tierney partial ordering (see [25] and [20,21,39]) on the family of bounded self-adjoint linear operators on L 2 (π) by setting P 1 ≥ P 2 whenever Intuitively, point (ii) of Theorem 3.2 means that the skipping sampler has the potential to mix faster than the classical random walk Metropolis, i.e., converge to stationarity in fewer steps. As explained in Section 2.1 of [32] the speed of convergence to stationarity can also be measured by other analytical quantities of the form inf f ∈M (I − P )f, f for some subset M of L 2 (π); it is straightforward to modify Theorem 3.2 accordingly. In the case of the spectral gap presented above we have M = {f ∈ L 2 (π) : π(f ) = 0 and π(f 2 ) = 1}. It follows from point (iii) that in stationarity, the samples produced by the skipping sampler are at least as good for estimating π(f ) as those generated by RWM.
These theoretical benefits are balanced by increased computational complexity. The exploration added by the skipping sampler relative to RWM carries a computational cost, and the tradeoff between cost and benefit depends on the target density. In particular, this tradeoff could become disadvantageous if evaluating the target density (and thus assessing the event {Z k ∈ A}) in Algorithm 1 carries high cost. In the absence of global knowledge of the target, a pragmatic approach would be to run both methods and try to judge between their output. In Section 5.2, for example, we have compared the mean squared error of the coordinate projections against the increased number of evaluations of the target density. As noted in Section 4.1, the evaluations of the target density can also be vectorised with the aim of decreasing computation time.
Sufficient conditions for parts (ii) and (iii) of Theorem 3.2 have been studied in the literature. An aperiodic reversible Markov chain has non-zero spectral gap if and only if it is geometrically ergodic (see [29]), a property which is explored in [9,17,31] for random walk Metropolis algorithms. The CLT holds essentially for all f ∈ L 2 (π) under the assumption of geometric ergodicity (see [30,Section 5]), but also holds more generally (see [10]).

Implementation and extensions
Implementing Algorithm 1 involves two choices, an underlying proposal density q and a halting index K, which are discussed respectively in Sections 4.1 and 4.2. An alternative to Algorithm 1 using a 'doubling trick' for greater computational efficiency is given in Section 4.3.

Choice of q
In addition to Assumptions 2-3, to ensure that the SLLN holds (Theorem 3.1) we require MH(π, q) to be π-irreducible. This holds, for example, when π is continuous and bounded and q is everywhere positive. More generally, MH(π, q) is also irreducible if the interior of A is (non-empty) connected and there exist δ, > 0 such that q(x) > > 0 whenever x < δ (see [38,Section 2.3.2]).
Since skipping can be seen as a way of endowing RWM with an improved ability to cross regions of zero density, a minimal approach would be to tune q as if it were to be employed in the RWM algorithm, for example achieving an acceptance ratio around 25% when q is employed in RWM. However we have observed empirically that a lower acceptance ratio, for example 15%, may further stimulate skipping.

Computational aspects
For sampling of the i.i.d. radial increments R 1 , R 2 , . . . , it is desirable to choose q such that samples may be drawn efficiently from for all ϕ ∈ S d−1 . Convenient cases include when q is radially symmetric so that conditioning is not required, or when q ∼ N (0, Σ) for some d × d covariance matrix Σ, so that, given direction ϕ, each increment R i follows a generalised gamma distribution with density Alternatively one may specify q indirectly by choosing the unconditional distribution of Φ and the conditional distribution of R := Y − X given Φ, then checking that the conditions of Theorem 3.1 are satisfied. If sampling from the distribution (6) is computationally expensive, however, the sampler may be modified by setting all R k equal to R, so that only a single sample is required to generate a proposal and the skipping chain keeps moving in the direction Φ with jumps of equal size. While this modification would not change the mean distance Z m − Z 0 = m i=1 R i covered by m steps of the skipping chain, it would increase its variance to m 2 Var(R).

Anisotropy
If A has a known anisotropy, the angular part of the underlying proposal may be chosen to favour certain directions in comparison to others, for example by tailoring the covariance matrix in a normal proposal q ∼ N (0, Σ). This may be useful in high dimensional problems where otherwise, with high probability, the skipping chain may fail to re-enter A. It is not difficult to show that if the distance increment retains the properties used in the proof of Proposition 2.1, then the acceptance ratio (1) depends additionally on the ratio of the angular densities. Denoting by q ϕ (x, φ) the density of direction φ at the location x, for Φ = Yn+1−Xn Yn+1−Xn the acceptance probability then equals Although beyond the scope of this paper, in the absence of geometric knowledge of A other information, for instance the history of the chain, may be used in an online fashion to make the angular part of the underlying proposal density dependent on the chain's current location.

Choice of K
The simplest choice is a nonrandom halting index K ≡ k s ∈ Z >1 . Under this choice the k s skips can be vectorised and stored in memory along with the corresponding states Z i for i = 1, . . . k s , and the evaluations of whether Z i ∈ A for i = 1, . . . k s can then be performed in parallel. This increases computational speed at the expense of a k s -times higher memory requirement plus the coordination cost of parallelisation, and the balance between benefit and cost is not explored here. However if the additional computational costs are low, and if the costs of evaluating whether Z i ∈ A are bounded, then the skipping sampler may be run at speed approaching that of RWM.
There is of course interplay between the choices for K and q. For example, if an upper bound D is available for the diameter of A c then we may use k s = D sup ϕ σϕ , where σ ϕ denotes the standard deviation of the conditional jump density in the direction ϕ. In the anisotropic case of Section 4.1.2, mutatis mutandis the halting index may also be made direction-dependent using a parametric family of constants (or distributions) K ϕ , ϕ ∈ S d−1 . To preserve symmetry it is then necessary that K ϕ = K −ϕ for each ϕ ∈ S d−1 . Similar tradeoffs between K and q may also be made when K is chosen to be random with finite mean.
If skipping cannot be efficiently parallelised as suggested above then, clearly, large realisations of K can result in high computational costs if A is not re-entered. In the extreme, bearing in mind Assumption 3, an unbounded distribution K should only be taken if A c is known to be bounded. If K cannot be chosen based on a known diameter D as above, then the absolute length of skipping trajectories may alternatively be controlled probabilistically using a large deviations estimate, as follows. If the conditional jump distribution is R then the probability that a distance mr can be traversed in m skips is approximately (see for example [3]): where I(r) = sup θ>0 [θr − λ(θ)] is the Legendre-Fenchel transform of R, provided that R has finite logarithmic moment generating function, i.e. λ(θ) = ln E[exp(θR)] < ∞ for all θ ∈ R.
Based on the above, if K is random and mass is to be placed on large values of K then this could lead to large computational costs. In this case the doubling trick of Section 4.3 may be applied.

The doubling trick
For clarity of exposition we first assume that A c is convex. From (3), the state Z k of the skipping chain is the partial sum S A := min{k ≥ 1 : The convexity of A c induces an ordering on the skipping chain, in the sense that If T A < K then Algorithm 1 evaluates T A by sampling the partial sums {Z k } k≥1 sequentially. The following alternative implementation evaluates T A significantly faster, in order log 2 T A steps. It requires that for any k, the sum k i=1 R i may be sampled directly, both unconditionally and given the value of 2k i=1 R i , at a comparable cost to sampling R 1 . This is possible, for example, when the R i are exponentially distributed.
For generalisations of this trick, note first that the doubling trick can be used only to accelerate skipping over a convex subset B ⊂ A c , so that we only add a single distance increment at a time while the skipping chain is in A c \ B, and use the doubling trick while in B. The idea may then be applied to a maximal convex subset of A c , provided that such a subset is known. Then note that if B 1 , . . . , B n B are all convex subsets of A c , the doubling trick may be used to traverse each convex subset B i in turn, if needed. Thus the idea may be applied to an inner approximation of A c by a union of balls, for example.

Numerical examples
In order to motivate some applications, Section 5.1 begins with a general discussion of targets for which the skipping sampler offers an advantage over RWM. The numerical example of Section 5.3, in the context of rare event sampling, illustrates an improvement in exploration achieved by our method. Then, in an application to optimisation, Section 5.4 provides quantitative examples of performance improvements obtained when the skipping sampler is used as a subroutine in probabilistic methods for non-convex optimisation. The Python code used to create all these numerical examples and figures is available at [43].

General considerations
Note firstly that if the initial proposal Y lies in A c then it would be rejected by the RWM algorithm. Instead, in Algorithm 1 it is reused. Thus skipping offers an advantage over RWM if the initial proposal Y regularly lies in A c . Secondly, when Y ∈ A c the skipping proposal Z of Algorithm 1 needs regularly to be accepted (which in turn necessitates Z ∈ A). By construction (since Z lies beyond Y on the straight line between the current state X n ∈ A of the chain and Y ∈ A c ), this requires the support A of the target to be non-convex.
The dimension d also plays a key role. Considering an example where the support A is the union of two disjoint balls in R d , by increasing d we reduce the probability that Z ∈ A. Hence, the benefit of skipping is greatest in low dimensions and then gradually decreases. Nevertheless, in Section 5.3 we show that in special cases the sampler can be beneficial even in high dimensions.
We also note the following tradeoff. Due to the increased exploration offered by the skipping sampler, the density encountered upon landing at Z ∈ A after crossing A c may be significantly different from that at the current state X n of the Markov chain. In particular, if the target density does not vary slowly then the acceptance ratio α(X, Z) may be so low that such skips are not regularly accepted. Although this tradeoff is problem dependent, it does not apply in the rare event example of Section 5.3.

Hybrid Slice Sampler
The slice sampler may be used to sample from a density ρ on R d as follows. Given the current sample X n ∈ K, the following two steps generate the next sample X n+1 : (i) pick t uniformly at random from the interval [0, ρ(X n )], (ii) sample uniformly from the 'slice' or superlevel set We refer the reader to [13,23] and references therein for more information on the slice sampler and its convergence properties.
Step (ii) is typically infeasible in multidimensional settings. Instead, in the Hybrid Slice Sampler (HSS) a Markov chain is used to approximately sample the uniform distribution on the slice. The following example illustrates the potential advantage of using the skipping sampler rather than RWM to generate this chain, since the slice may not be convex.
For ρ we take a uniform mixture of m = 7 standard normal densities in d = 5 dimensions, whose means are drawn uniformly at random from a box B = [−12, 12] 5 . The underlying RWM proposal is a spherically symmetric Gaussian, with variance tuned to achieve an acceptance ratio of 23.5% in RWM. Independent trajectories (started in stationarity) of n = 2 · 10 5 steps were generated for the HSS algorithm with respectively the RWM and the skipping sampler used to sample from the superlevel sets. The halting index is taken to be P[K ϕ = 15] = 1 for all ϕ ∈ S d−1 . As can be seen from Figure 2, the RWM implementation remains in the mode in which it was initiated. In contrast, the skipping sampler version transitions regularly between the seven modes. The experiment was run m = 100 times (on the same Gaussian mixture), during which skipping transitions happened on average 16 times per run. While the number of evaluations of the target density increased 11.61 fold on average, skipping greatly reduced the mean squared error (MSE) for the estimators of the coordinate-wise means. The MSE for RWM and reduction estimates (MSE for RWM divided by MSE for skipping) for each of the five coordinates are reported in Table 1. Hence, in this example the skipping sampler is roughly 12 times more expensive to compute, but produces samples with much greater effective sample size.

Rare event sampling
The aim in this example is to sample rare points under a complex density ρ on R d , by sampling from its intrinsic tail or sublevel set A = {x ∈ R d | ρ(x) ≤ a} for some a > 0. As an illustration let ρ be a mixture of m = 20 Gaussian distributions, with randomly drawn means, covariances and mixture coefficients. We use the tails given by the levels a = e −15 and a = e −350 respectively for dimensions d = 2 and d = 50. In the case d = 2, a visual illustration of Theorem 3.2 is provided by plotting comparisons of the exploration achieved in 10 5 steps of RWM and the skipping sampler respectively. Since the superlevel sets of a finite Gaussian mixture are bounded, in this example we may take the halting index K = ∞.    The fact that proposals are often re-used rather than rejected is well illustrated by the acceptance rates, which are 23.7% and 43.3% for RWM and the skipping sampler respectively. Further, ∂A is disconnected. While the 'inner' component is not visited by RWM in this sample, the skipping sampler regularly passes through A c to visit both components, thus exploring ∂A more quickly. Despite 3.45 times more target evaluations required for the skipping sampler, the benefits in this example are clearly worthwhile. Figure 4 shows the evolution of the chain's first coordinate in the case d = 50. While the boundary of A cannot be easily visualised here, the faster mixing of the skipping sampler is again apparent. The successful re-use of proposals by skipping across A c again constituted approximately 18% of the chain's steps, suggesting superdiffusive exploration. The respective acceptance rates were 22.2% for RWM and 48.1% for the skipping sampler. The benefits of skipping are again seen to be worth the computational cost, since this time the skipping sampler required only 1.44 times more target evaluations than RWM.

Applications to optimisation
The challenging problem of finding the global minimum of a non-convex function has attracted much attention and several probabilistic methods and heuristics have been developed, including simulated annealing [11], multistart [7,16], basin-hopping [14,42], and random search [33]. In this section we illustrate how the skipping sampler can be used in difficult low-dimensional examples to either bias the choice of initial points of such methods, or as a subroutine, in order to improve exploration. Below we consider an optimisation problem in R d of the form and consider as the target density the Bolztmann distribution with temperature T ≥ 0 and energy function f , conditioned on the region D, that is

Monotonic skipping sampler
While outside the scope of our theoretical analysis, a variation on Algorithm 1 is one in which the support A is not constant. In particular, defining the level sets S(X n ) = {x ∈ R d : f (x) ≤ f (X n )}, a monotonic skipping sampler (MSS) may be defined in which the support at the n-th step of Algorithm 1 is A n := S(X n ) ∩ D (setting A 0 := D), and the target density π = π n is uniform on A n . That is, only downward moves (Markov chain transitions with f (X n+1 ) ≤ f (X n )) are accepted. By construction we have X n ∈ A n for each n ∈ N. Also, since the random subsets {A n } n=1...,m are themselves decreasing with A n+1 ⊆ A n for every n, they contain progressively fewer non-global minima in addition to the global minima of the function f . In common with the skipping sampler where the support A is fixed, the n-th step of the MSS requires no information about the sublevel set S(X n ), just the ability to check whether the proposal Z lies in A n .
To illustrate a trajectory of the MSS, take f to be the so-called eggholder function in dimension d = 2, i.e. an optimisation test function often used in the literature [8], with D = [−512, 512] 2 . Figure 5 shows some snapshots from a trajectory of the MSS, also indicating the progressively shrinking sublevel sets A n = S(X n ) ∩ D.
In this subsample the state of the chain (starred marker) is seen to jump four times between different connected components of the sublevel sets (in the subfigures for n = 67, 84, and 108), which happens by means of the skipping mechanism. In Sections 5.4.2 and 5.4.3 we provide numerical examples of performance improvements achieved when the MSS is used as a subroutine in the multistart and basin-hopping optimisation procedures respectively.

Augmented multistart method
Given a nonconvex optimisation problem of the form (11) with possibly several local minima, a classical strategy to find its global minimum is to restart the local optimisation method of choice at several different points. The multistart method produces the desired number N of initial points by sampling them uniformly at random in Note that in the above setup, f may be set equal to positive infinity outside an arbitrary constraint set. If the set f −1 (R) of feasible points has a low volume compared to D then many of the randomly sampled points may lie outside it, making this multistart initialisation procedure inefficient. In this case, recalling the remark on initialisation of Algorithm 1 from Section 2, the MSS is capable of accelerating the search for feasible starting points x ∈ f −1 (R).
Equally, sampling starting points uniformly at random may not be helpful if the basin of attraction of the global minimum has low volume. We can mitigate both of these issues by "improving" each of the points proposed by the multistart method as follows. Assume N initial points X MSS-augmented multistart method, we present an example again using the eggholder function. We first consider the unconstrained optimisation problem which has the optimal solution x * = (512, 404.2319), attaining the value f eggholder (x * ) = −959.6407. Averaging over N = 1000 runs, in Table 2 we summarise the "goodness" of the N starting points given by the following three methods: (i) multistart method, i.e., initial points uniformly distributed on [−512, 512] 2 ; (ii) the initial points obtained in (i) are evolved for m = 100 steps with a RWM with Gaussian proposals with covariance matrix 2 · I and the Boltzmann distribution π in (12) with T = 1.0 as target distribution; (iii) the initial points obtained in (i) are evolved for m = 100 steps with the MSS using a deterministic halting index K = 200 and the same Gaussian proposal density as in (ii).
The MSS augumented multistart method effectively biases the initial points towards the global minimum x * , bringing 65.7% of them in the correct basin of attraction, although at the expense of more function evaluations than the other two methods.

Skipping sampler as basin-hopping subroutine
Besides improving the multistart method, the skipping sampler can also be used to improve stochastic techniques for non-convex optimisation, in particular the so-called basin-hopping method. In this subsection we explore this novel idea, although the implementation details and a systematic comparison with other global optimisation routines are left for future work.

Metric
Basin-hopping (with uniform displacement as subroutine) Monotonic sequence basin-hopping (with uniform displacement as subroutine)  Basin-hopping is a global optimisation technique proposed in [42], which at each stage combines a random perturbation of the current point, local optimisation, and an acceptance/rejection step. The random perturbation consists of i.i.d. uniform simultaneous perturbations in each of the coordinates, that is, a random walk step. The stopping criterion for this iterative procedure is often a maximum number of function evaluations, or when no improvement is observed for a certain number of consecutive iterations.
The random walk step may be replaced by a step from the MSS. That is, at step n, given the current point X n−1 we first sample a new point Y n from the sublevel set S(X n−1 ) ∩ D using MSS and then perform a local optimisation procedure starting from Y n to obtain a new point X n . This idea is summarised in the pseudo-code presented in Algorithm 3. The MSS variant of the basin-hopping method is related to the monotonic sequence basin-hopping (MSBH) proposed in [14], which also accepts only new points in S(Y n−1 ) ∩ D. However MSBH uses only local uniform perturbations and thus faces the same exploration challenges as RWM when S(Y n−1 ) ∩ D is disconnected.
In Table 3, we compare the performance of basin-hopping and of the MSBH with the proposed basin-hopping with skipping. The MSS subroutine leads to 54.4% of the initial (uniformly distributed) points converging to the basin of attraction of the global minimum x * . This sharp improvement with respect to basin-hopping (which has a corresponding success rate of only 2.2%) only requires ten times more function evaluations. In [5] we present additional performance metrics for the basin-hopping method with skipping and more extensive numerical results over a large collection of test functions. For each x ∈ supp(π) let P x be a probability measure carrying all random variables used in Algorithm 1, such that X 0 = x almost surely under P x . Denote respectively by {Y m } m≥1 and {X n } n≥1 the proposals generated by the MH(π, q) algorithm and the Markov chain returned by the algorithm. Writing A n := n i=1 {X i = Y i } for the event that the first n proposals of MH(π, q) are all accepted, we have Lemma 6.1. If the chain MH(π, q) restricted to supp(π) is π-irreducible then P x (A m ) > 0 for all x ∈ supp(π) and all m ≥ 1.
Proof. Fixing x ∈ supp(π) and supposing otherwise for a contradiction, let n be the smallest integer such that P x (A n ) = 0. Clearly n ≥ 2, since otherwise, P x −almost surely we have X k = X 0 for all k ≥ 1, contradicting the assumption of π-irreducibility. Therefore P x (A n−1 ) > 0 and we may write p for the density of X n−1 conditional on the event A n−1 . Then by the Markov property we have so that P y (A 1 ) = 0 for some y ∈ supp(p). Arguing as above, this contradicts the assumption of π-irreducibility.
Denote the Markov kernels of the chains generated by MH(π, q) and MH(π, q K ) by P and P K respectively. Also let {X n } n≥1 be the jump chain associated with X (that is, the subsequence of {X n } n≥1 given by excluding all X m which satisfy X m = X m−1 ).

Lemma 6.2.
For all x ∈ A = supp(π), n ∈ Z >0 and all B ⊂ A the following inequality holds: Proof. Note first that the last equality in (14) follows by definition of the jump chain. We will prove the inequality in (14) by induction on n. Since supp(π) = A, Proposition 2.1 (ii) gives Assume now the statement holds for some n ∈ Z >0 and let us prove it for n + 1. We argue using the induction hypothesis and Proposition 2.1 (ii) again: Proof of Theorem 3.1. Take B ⊆ A = supp(π) such that π(B) > 0, x ∈ A and let {X n } n≥1 be MH(π, q) started at X 0 = x. Since MH(π, q) is π-irreducible there exists an integer n ∈ Z >0 such that P x (X n ∈ B) > 0. Let S n be the number of rejections which occur in the generation of {X m } 1≤m≤n . Then 0 < P x (X n ∈ B) = n i=0 P x (X n ∈ B, S n = i) .
Consequently P x X n−j ∈ B|A n−j > 0, so that P n−j K (x, B) ≥ P x X n−j ∈ B | A n−j P x (A n−j ) > 0 , where we used the above together with Lemma 6.2 and Lemma 6.1. The skipping chain MH(π, q K ) is therefore π-irreducible, and thus is Harris recurrent by [38,Corollary 2]. Furthermore, [19,Theorem 10.0.1] yields that π is its unique invariant probability measure. Finally, the SLLN holds for all π-integrable functions by Harris recurrence and [19, Theorem 17.1.7].

Proof of Theorem 3.2
To prove Theorem 3.2, we will make use of the following lemma, whose proof is omitted. Lemma 6.3 (Integration with respect to a symmetric joint density). Consider a symmetric density ∆ : R d ×R d → [0, +∞) and a subset B ⊆ R d . For every f ∈ L 2 (∆) the following identity holds: ∆(x,y)dy dx.
The first identity is an immediate consequence of the definition (2) of α, while the second one follows from Assumption 2 and Proposition 2.1 (i).