NuZZ: numerical Zig-Zag for general models

,


Introduction
Markov Chain Monte Carlo (MCMC) methods are a mainstay of modern statistics (Brooks et al., 2011), and are employed in a wide variety of applications, such as astrophysics (e.g.The Dark Energy Survey Collaboration et al. ( 2017)), epidemiology (e.g.O'Neill and Roberts (1999), House et al. (2016)), chemistry (e.g. C. Cotter et al. (2019) and S. Cotter et al. (2020)) and finance (e.g.Kim et al. (1998)).There are a diverse range of challenges that MCMC algorithms can be hindered by, including multimodality of the posterior, complex non-linear dependencies between parameters, large number of parameters (e.g. S. Cotter et al. (2013)), or a large number of observations (e.g.Bardenet et al. (2017)).Common MCMC methods such as the Random Walk Metropolis (RWM) algorithm Metropolis et al. (1953) offer simplicity and flexibility, but also have several limitations.
Throughout the history of MCMC there have been many attempts to improve algorithmic efficiency (Robert & Casella, 2011), including the work of Hastings (1970) in generalising the RWM approach to non-symmetric proposals, and the use of gradient information to explore the state space according to a discretised Langevin diffusion in the Metropolis-Adjusted Langevin Algorithm (MALA) (Roberts & Tweedie, 1996).Both of these methods rely on reversible dynamics that satisfies a detailed balance condition, used to ensure convergence to the appropriate limiting distribution.
Lifting the process simulated to a larger state space can aid exploration of the original state space, with one of the most successful such approaches being the Hamiltonian Monte Carlo (HMC) algorithm in Euclidean space (Duane et al., 1987), and on Riemannian manifolds (RMHMC) (Girolami et al., 2011).HMC relies on adding momentum variables to the state space, while maintaining detailed balance and reversibility thanks to a Metropolis acceptance step.
Recently, there has been growing interest in irreversible algorithms, which have been shown to converge faster and have lower asymptotic variance than their reversible counterparts (Chen & Hwang, 2013;Ottobre, 2016).This was initially investigated by Diaconis et al. (2000) and Turitsyn et al. (2011), where the authors achieve irreversibility via "lifting" and injecting vortices in the state space.In Ma et al. (2016) the authors provide a framework for developing irreversible algorithms.Not all irreversible algorithms are built from reversible ones.In the physics literature, Rapaport (2009) and Peters and de With (2012) laid the foundations for the Bouncy Particle Sampler (BPS), which was then thoroughly studied by Bouchard-Côté et al. (2018).The BPS abandons diffusions in favour of completely irreversible, piecewise deterministic dynamics based on the theory developed in Davis (1984Davis ( , 1993)).The close relationship between the BPS and HMC is explored in Deligiannidis et al. (2021).
The Zig-Zag (ZZ) process (Bierkens, 2016;Bierkens & Duncan, 2017;Bierkens & Roberts, 2017), a variant of the Telegraph process (Kolesnik & Ratanov, 2013), is another Piecewise Deterministic Markov Process (PDMP) closely related to the BPS, even coinciding in one dimension.Its properties were first explored in work by Bierkens and Duncan (2017) and Bierkens, Fearnhead, et al. (2019).The ZZ sampler is particularly appealing as when properly modified with upper bounds for the switching rate, it has a computational cost per iteration that need not grow with the size of the data.This feature makes it particularly well suited for 'tall, thin' data, e.g.multivariate datasets with a large number of observations, n, and a relatively small number of parameters, d.
The BPS sampler and the ZZ sampler are both PDMPs that change direction at random times according to the interarrival times of a Poisson process whose rate depends on the gradient of the target density.The main difference between them is the way in which the new direction is chosen.Conditions for ergodicity have been established for both algorithms (Bierkens, Roberts, & Zitt, 2019;Durmus et al., 2020), and the Generalised Bouncy Particle Sampler (Wu & Robert, 2017) partially bridges the gap between the two.
Given their desirable properties, there has been a wealth of theoretical analysis of PDMPs in MCMC developed in recent years (Andrieu, Dobson, et al., 2021;Andrieu, Durmus, et al., 2021;Bierkens et al., 2018;Bierkens & Verduyn Lunel, 2019;Chevallier et al., 2020;Chevallier et al., 2021;Zhao & Bouchard-Côté, 2019).In Vanetti et al. (2018), the authors provide insights into a general framework for construction of continuous and discrete time algorithms based on PDMPs, and discuss some improvements that ameliorate some of the current challenges that PDMP samplers face.There remains, however, a lack of guidance about which classes of statistical problems are well-suited to the use of PDMP algorithms, and how those algorithms are best to be implemented.
In this work we aim to study the Zig-Zag dynamics and how well they explore distributions with specific features, often found in practical Bayesian statistics.In order to do this, we develop a numerical approximation to calculate the time to the next switch for general distributions, without requiring analytical solutions or bounds for the switching rate.
Moreover, we obtain results in controlling the numerical error induced by our algorithm.A first result in this sense was achieved by Riedler (2013), where the author shows that under general assumptions, sample paths resulting from numerical approximations of PDMPs converge almost surely to the paths followed by the exact process as the numerical tolerances approach zero.A more relevant result can be found in Bertazzi et al. (2021), where the authors study a general class of Euler-type integration routines, which yield a Markov process at each point of the mesh defined by the integrator.Their framework is sufficiently general to accomodate PDMPs with trajectories that are not linear and need to be approximated numerically.On the contrary, the process derived using the adaptive integration methods examined in this work is Markov only at points on the skeleton chain, but not at points on the deterministic portion of the trajectory.
The paper is organised as follows.In Section 2 we review the technical details of the Zig-Zag process and fix the notation for the rest of this work.In Section 3 we present an alternative algorithm, the Numerical Zig-Zag (NuZZ), which uses numerical methodology to compute switching times of the Zig-Zag process.In Section 4 we study how the numerical tolerances affect the quality of the NuZZ MCMC sample.In Section 5 we present some numerical experiments which test the performance of a range of MCMC algorithms, including NuZZ, for a range of test problems with features often found in real-world models.Finally in Section 6 we conclude with some discussions of the results.

The Zig-Zag sampler
The Zig-Zag sampler is an MCMC algorithm designed for drawing samples from a probability measure π(x) with a smooth density on X = R d .In the same vein as methods like HMC, the Zig-Zag sampler operates on an extended state space where the position variable x is augmented by a velocity variable v.
The Zig-Zag sampler is driven by the Zig-Zag process, a continuous time PDMP which describes the motion of a particle x, with a piecewise constant velocity which is subject to intermittent jumps.The dynamics and jumps are constructed in such a way that the process converges in law to a unique invariant measure which admits π as its marginal.
Let the Zig-Zag process be defined as Z t = (X t , V t ) on the extended space E = X × V, with X = R d and V = {−1, 1} d .From an initial state (x, v), the process evolves in time following the deterministic linear dynamics where the dots denote total differentiation with respect to time.The probability of changing direction (i.e. the sign of one component of the vector v) increases as the process heads into the tails of π(x).The event of changing the sign of v i is modelled as an inhomogeneous Poisson process with component-wise rate where ∂ i is the partial derivative w.r.t. the ith component.Hence, the probability of having an event for coordinate i in [t, t + dt] is λ i (z t )dt + o(dt).This induces the ZZ process to spend more time in regions of high density of π(x).When a velocity switch is triggered, a switching time τ is obtained, and the process moves forward from x to x = x+τ v.At which point the transition kernel Q(v |x , v) selects the velocity component to switch and flips its sign.From (x , v ) the dynamics proceed following Equation 1 until another velocity switch occurs.The trajectories of the process produce 'zig-zag' patterns such as those seen in Figure 1, motivating its name.
Figure 1: Example Zig-Zag trajectory.The target is a 2-dimensional Rosenbrock distribution.
The linear dynamics in Equation 1, the jump rates λ i (x, v), and the transition kernel Q(v |x , v) uniquely define the Zig-Zag process.Its infinitesimal generator is known, and can be given in terms of the three above quantities as where F i (v) represents the operation of flipping the sign of the ith component.
What is described above is the Zig-Zag process with canonical rates, which has been proven to be ergodic for a large class of targets (Bierkens, Roberts, & Zitt, 2019).Taking the rate (compare with Equation (2)), with γ i satisfying also yields a process that targets the desired distribution, and the process is ergodic on a larger class of models.However, increasing γ i induces diffusive behaviour, in that the process tends to change direction more often than it otherwise would, and to cover less distance in the same amount of time.Hence, if the functions γ i are taken to be strictly positive, they are generally set to be as small as possible.
In this work we take γ i (x, v) = γ, which gives an overall baseline switching rate of Γ = i γ i = dγ.For z = (x, v), we will often refer to the switching rate as a function of time as

Sampling the switching time
In this section we outline the approach followed in Bierkens, Fearnhead, et al., 2019.Switches in v occur as the interarrival times of an inhomogeneous Poisson process with rate λ i (x, v).Starting the process at t = 0, each τ i , the time to the next switch in the ith component conditional on no other switches occurring, has cumulative distribution function (cdf) A switching time τ i is sampled separately for each component with law given by ( 6), and the final switching time τ is selected as Once τ has been calculated, the system follows the deterministic dynamics from x(s) = x(0)+sv to x(s) = x(0)+ τ − v, where a velocity flip occurs, and the process starts following the new dynamics x(s) = x(τ All the larger {τ i } i =i * are usually discarded, as they refer to a state of the system that can no longer be achieved.The same procedure is repeated at each iteration of the algorithm. This method for sampling the switching times requires realisations of d random variables with densities given by ( 6), where d is the number of components of the Zig-Zag process, i.e. the number of parameters to be estimated via MCMC.

Poisson sampling via thinning
Sampling τ i from Equation ( 6) requires solving the integral in the exponential.This is often a difficult task, as the rates λ i depend on the posterior in a complex way as shown in Equation (4).Analytic solutions are available only for simple targets and are problem-dependent.
In Bierkens, Fearnhead, et al., 2019 the authors propose the method of thinning to sample the waiting times to the next switch, τ i .If the Jacobian or Hessian of the target π(x) is bounded, then there exists a bound λi (x, v) ≥ λ i (x, v), ∀x, v, i.The bound λi can be used instead of λ i in Equation ( 6), so that G Ti (τ i ) can be replaced with a simpler function.Once the τ i are sampled and the time τ to the next switch is found, the proposed velocity switch occurs with probability λ i * / λi * , to correct for the fact that τ was not sampled from the correct distribution.If the switch is rejected, there is no change in v, the process moves linearly to x + τ v, and new switching times are sampled.
Sampling the switching times via thinning extends the applicability of the Zig-Zag sampler to a wider class of models than those where τ can be sampled by inverting the cdf, and forms the basis for the ZZCV algorithm Bierkens, Fearnhead, et al., 2019.Additionally, in the recent work of Sutton and Fearnhead, 2021, the authors propose a method for systematically identifying numerical bounds which are valid on a given time horizon.Their method significantly extends the applicability of PDMPs, but again retain the cost of needing to construct a numerical bound at each iteration, and requires a priori knowledge of the target measure.
As such, despite these developments, the Zig-Zag sampler is still limited in its applicability to a narrow class of models, which has so far hampered the study of PDMP-based MCMC from a practical point of view.

Numerical Zig-Zag
The Zig-Zag sampler described in Section 2 relies on the availability of an easy way to sample from Equation ( 6), or on having knowledge of a bound for λ i .In this section we extend the Zig-Zag algorithm to sample from arbitrary targets via numerical approximation of the switching times.

The Sellke Zig-Zag process
In order to ameliorate the issues described in Section 1 and 2, it is possible to reformulate the problem of finding the time to the next velocity switch in a more convenient way.This approach is based on the Sellke construction (Sellke, 1983), and we note that picking τ i with law ( 6) is equivalent to finding τ i > 0 such that where R i ∼ Exp(1).Ergo, finding τ i is equivalent to finding the root of the function g(τ i ).
We demonstrate this equivalence in the appendix in Section A, in which we describe SeZZ, the Zig-Zag process where the switching times are given by roots of ( 7), and where we show that the SeZZ generator is stationary with respect to the target measure.The Numerical Zig-Zag, or NuZZ, arises from numerical approximations of the roots of (7).

Numerical integration
Unlike other Zig-Zag processes, which can rely on bounds which are either unknown or might not exist, numerical approximations of the roots of ( 7) can be accomplished for a very wide class of models.There are, however, a number of problems to overcome.The function g is in the set C 1 (R)\C 2 (R), and in practice it often increases very steeply when moving into the tails of the distribution.The numerical method used to solve the integral in (7) up to some τ should be chosen on the basis of its efficient and accurate approximation of such a problem.The function g is only once differentiable, and therefore numerical integrators like Runge-Kutta may perform unreliably.In particular, the first derivative of λ i is only piecewise continuous, which may also create problems for common root finding methods like Newton-Raphson.
The method we use to find the time to the next switch combines the QAGS (Quadrature Adaptive Gauss-Kronrod Singularities) integration routine from the GSL library (Galassi, M., 2017), with Brent's method (Press et al., 2007) for finding the root.

Brent's Algorithm
Common root-finding algorithms such as Newton-Raphson and the secant method have good theoretical properties, such as superlinear convergence to the root.However, if the target function is not extremely well behaved, these algorithms may converge very slowly, or not converge at all.On the other hand, iteratively bisecting intervals is a more reliable method to find the root, as it converges linearly even when the function is not so well behaved.However, bisection has slower linear convergence.
Brent's method (Press et al., 2007) was originally developed by Van Wijngaarden and Dekker in the 1960s, and it was only successively improved by Brent in 1973.Dekker's root-finding method combined the secant method with bisection, to converge super linearly when the function is well behaved, while maintaining linear convergence if the target is particularly difficult.Unfortunately, it is possible for Dekker's algorithm to choose to apply the secant method even when the change to the current guess of the root is arbitrarily small, leading to very slow convergence.
Brent's method introduces an additional check on the change that the secant method would apply to the current root guess.If the update is too small, the algorithm then performs a bisection step instead, which guarantees linear convergence in the worst case scenario.
Moreover, Brent's method is superior to previous methods in that when it determines the new root guess via the secant method, it uses inverse quadratic interpolation instead of the usual linear interpolation.
Brent's method deals with numerical instabilities caused by the denominators of the secant update being too small by bracketing the root.That is to say, by storing and updating the range where the root is known to be by the intermediate value theorem, as the iterations are performed and the search for the root continues.
In this work, for a Zig-Zag process as defined in Section 2.1, the root returned by Brent's method for each of the components, τ i , satisfies the following inequality τi 0 where ε Bre is the user-specified numerical tolerance, and λ i is the piecewise linear approximation to to λ induced by the numerical integration routine.
In Section 4 we discuss how we model the numerical tolerances for the NuZZ process, and what effect they have on the resulting MCMC sample.In the next sub-section, we discuss our actual implementation, where we reduce the number of integrals to be computed from d to 1.

Efficient implementation
The Zig-Zag Sampler discussed in Section 2.1 relies on sampling a switching time τ i for each component, then taking the minimum τ = min i=1,...,d τ i .This approach requires solving d integrals for each iteration.However, it is possible to reduce the cost to only one integral per iteration, by using the following procedure.First, sample τ such that where Λ(x(s), v) := is the total rate of switching in any component.Then sample the index of the first component to switch, i * , as a multinomial distribution with ith cell probability λ i (x(τ ), v)/Λ(x(τ ), v).The last step can be achieved efficiently by taking i * such that and finding i * by bisection search.This sampling method is used in the Gillespie algorithm, for Monte Carlo simulation of chemical reaction networks (Gillespie, 1977), and is a standard result for Poisson processes.
Proposition 1.A Zig-Zag process where switches happen at rate Λ(x(s), v) = d i=1 λ i (x(s), v), and the index of the velocity component to switch is distributed as targets the same invariant distribution π(x) as the Zig-Zag process with generator given in (3).
Proof.This follows from standard results on Poisson processes.For a proof tailored to this case see Appendix B Importantly, sampling τ in this way reduces the number of integrals to be computed numerically from d to 1, at the price of a slight increase in the complexity of the integrand.Moreover, there may be more efficient implementations than bisection to find i * , which may reduce the computational complexity further.
Calculate numerically τ such that It is worth pointing out that the benefits of this implementation of the Zig-Zag Sampler, i.e. working with the sum of rates, are not only applicable to the NuZZ algorithm.The approach can be applied to other Zig-Zagbased algorithms and PDMPs, and in fact it synergises particularly well with the Zig-Zag algorithm with control variates, as showed in Appendix C.
Even though the efficient implementation has the same theoretical cost as the original (without taking numerical approximations into consideration), our experiments suggest that the efficient implementation may be significantly more efficient in practice.

Reusing information
As integration over a similar range is repeated at every root finding iteration, we have modified the integration routine so that information can be shared through subsequent iterations of Brent's method.
The QAGS routine performs a 10-21 Gauss-Kronrod numerical integral on the interval [0, τ ] at each root finding iteration, where τ is the current proposed root.If the estimated integration error is greater than the threshold ε Int , the interval [0, τ ] is split into subintervals, on which new 10-21 numerical integrals are computed, and so on recursively.
Note that the value of the numerically approximated integral of the true rate Λ on [0, τ ] is equal to the value of the exact integral of Λ, the piecewise linear approximation of the rate Λ obtained joining the nodes of the numerical integral.
If at a given iteration of the root finding algorithm the following inequality holds, then the quantities computed between 0 and τ can be saved and reused during the next iteration of the root finder.At the next iteration, the new proposed root would be τ , and the objective function can be decomposed as where now we only have to compute the second integral, which typically requires fewer resources to achieve the desired level of precision.Because of this, the Gauss-Kronrod number of points was decreased from the original 10-21 to 7-15 points.

Error and convergence analysis
In this section we study how the error induced by numerical procedures impacts the quality of the samples which are produced by NuZZ.In order to do this, we will consider NuZZ as a stochastic process that is a perturbation of the original process, the Zig-Zag sampler.
Our aim is to show that the ergodic averages which we calculate via NuZZ are close to the true equilibrium expectations of interest, with an error that can be quantified, and whose dependence on the numerical tolerances can be made transparent.
However, the continuous time NuZZ process is not Markov, as numerical integration induces a mesh on the deterministic part of the trajectory that depends on both end-points It is possible to enlarge the state space to make the NuZZ process Markov, but calculations quickly become unwieldy.Therefore we will follow an indirect route using the skeleton chain.
Recall that the full Zig-Zag process Z t = (X t , V t ) is continuous-time, however the differential equations (1) describing its behaviour in between velocity switches are trivially solvable and so the full process can be reconstructed from the skeleton chain of points at which velocities change, Our strategy for bounding the impact of numerical errors on ergodic averages therefore involves bounding the discrepancy between the skeleton chain from the exact process, and the skeleton chain from the NuZZ process.
One can then reconstruct ergodic averages via interpolation using (12).

Results
Our final result, contained in Theorem 1, can be summarised with the following corollary.Let ε int , ε Bre be the user-defined error tolerances for the numerical (QAGS) integration routine and for Brent's method.
Let us assume now (we will motivate these assumptions later) that 1. η int ε int , i.e. the true numerical integration error is always smaller than the tolerance, 2. Every element of the Hessian of log π(x) is globally bounded from above by the constant M .
3. The skeleton chain Y k of the Zig-Zag process.without numerical errors, Z t , is Wasserstein geometrically ergodic.
4. The skeleton chain Y k of the Zig-Zag process with numerical errors (the NuZZ process), Z t , is Wasserstein geometrically ergodic.
5. The total refreshment rate Γ is strictly positive.
Then we will be able to obtain the following result.
Corollary 1.Let Z t be the exact Zig-Zag process with stationary measure µ, and let Z t be the NuZZ process.
There exists a probability measure µ on E such that for all functions f bounded and Lipschitz, the limit exists and takes the value Moreover, defining the Kantorovich-Rubinstein distance as where ε Tot is the sum between the numerical tolerance for the integration routine, ε int , and the tolerance for Brent's method, ε Bre .
This bound motivates the following considerations.
Remark 1.We note that the bound in Lemma 1, which then determines the bound in Corollary 1, may not be tight, and so it may be possible to use other techniques to obtain similar stability results in the degenerate case in which Γ = 0.It would also be natural to devise a specialised root-finding routine which is adapted to this setting.We leave such specialisations for future work.
Remark 2. Consider the bound in Equation ( 16), and a sequence of model problems in dimension d, with Hessian bounds growing as M ∼ d a , numerical tolerances tightening as ε Tot ∼ d −b , and aggregate refreshment rates shrinking as Γ ∼ d −c with a, b, c 0. We choose to consider bounded Γ as it represents a minimal perturbation to the non-reversible character of the original process.
In order to have the error bound on the ergodic averages decaying to 0 as the dimension grows, we must have Noting that the cost per unit time of simulating the process grows with b, efficient algorithms will seek to make this quantity as small as possible.Since we must have b > a + c + 1, it suffices to take c = 0 and b > a + 1, i.e. let Γ be of constant order, and take One outcome of this argument is to confirm the intuition that for more irregular targets (i.e. with larger M ), a tighter numerical tolerance ε Tot is required in order to faithfully resolve the event times and jump decisions of the process numerically, and hence to accurately reproduce equilibrium expectation quantities.
In order to produce these results, we start by recalling some facts about Markov chains and Wasserstein distances that will be of use later.

Wasserstein distance
In Rudolf and Schweizer, 2018 the authors derive perturbation results for Markov chains which are formulated in terms of the 1-Wasserstein distance between transition kernels.We found that in our calculations, it is more convenient to work instead with the 2-Wasserstein distance.Hence, in this section, we report some basic results on Markov chains and Wasserstein distances, and prove that the relevant results in Rudolf and Schweizer, 2018 do indeed extend to the 2-Wasserstein case.For brevity, many of the results of this section have been moved to Appendix D, as they are only technical conditions and are not used directly in the results that we chose to include in the main body of this work.Let us now define the following.
Let (E, B(E)) be a Polish space with its associated Borel σ-algebra, and P the set of Borel probability measures on E. The 2-Wasserstein distance between µ 1 , µ 2 ∈ P is where Ξ is the set of all couplings of µ 1 and µ 2 on E × E that have µ 1 and µ 2 as marginals, and • 2 is the 2-norm.We will use the following notation for the expectation of a function f with respect to the measure µ 1 : Let P : P → P be a linear operator and a transition kernel on (E, B(E)) such that measures of interest (µ 1 , µ 2 etc.) are elements of P. Then This implies that δ x1 P (A) = P (x 1 , A), and for a measurable function f : where Definition 1 (Wasserstein Geometric Ergodicity of Markov kernels).We say that the transition kernel P is Wasserstein geometrically ergodic with constants Proposition 2. Let P and P be Markov kernels such that P is Wasserstein geometrically ergodic, and d W (δ x P, δ x P ) for all x.Let π n and π n be the measures defined by π n = π 0 P n and π n = π 0 P n .It then holds that Proof.See Appendix D.
Proposition 3. If a Markov kernel P is Wasserstein geometrically ergodic, then it has an ergodic invariant measure π with finite second moment, which is therefore unique.Furthermore, for any measure ν with finite second moment, it holds that νP n → π in the Wasserstein distance.
Proof.See Appendix D.
We are now ready to discuss results specific to the Zig-Zag process.

Zig-Zag notation
Recall that Z t = (X t , V t ) is the exact (Zig-Zag) continuous time Markov process defined on the state space E = X × V, with stationary measure µ(dx, dv) = π(x) ⊗ ψ(v).Let T k be the kth switching time of the Zig-Zag process, and let .., n be the skeleton Markov chain, with stationary measure ν(dx, dv).
Let P : E → E denote the Markov transition kernel of the skeleton chain Y k .Intuitively, applying P to a state (x, v) of the process involves sampling a new τ , pushing the dynamics forward from (x, v) to (x + τ v, v), and sampling a new velocity from the distribution associated with the transition kernel Let us also define ) as the 'extended' skeleton chain, which includes two consecutive states of the Y k chain, with invariant measure ζ.As mentioned above, our proof strategy is based on the interpolation of the skeleton chain, and the process U k will be a key component.
Lastly, we place tildes above all of the quantities above to denote their perturbed versions under numerical approximation.In this spirit, if Z t is the exact Zig-Zag process, Z t is the NuZZ process.If Y k is the skeleton chain of the Zig-Zag process, then Y k is the skeleton chain of NuZZ, and if P is the Markov transition kernel that moves the skeleton chain of the exact process one step forward, then P is its numerical equivalent.The same applies to the other relevant objects.We assume that all of these quantities exist, and we will address their properties below.In particular, in the next section we will describe the analytical form that numerical errors take for the NuZZ algorithm, which will be useful in bounding discrepancies between the exact and numerical approximation of the objects described above.
For notational simplicity, we use the unlabelled norm • to represent a 2-norm, and norms defined on elements of the extended state space E and E × E (denoted as E 2 ) are defined as

Perturbation bound for the skeleton chain
Under appropriate conditions that we detail below, a sufficient condition for the processes Z t and Z t to produce comparable ergodic behaviour is that Z t contracts towards equilibrium sufficiently rapidly, and that the transition kernels P and P should be close to each other in a uniform sense.By construction, the latter property holds for the NuZZ if on average, (i) the new locations x , x are close to one another, and (ii) the new velocities v , v are close to one another.By analysing the numerical error, we will show in Proposition 4 that this holds.
Recall that in Section 3.2 we described how NuZZ relies on two different numerical methods to solve Equation ( 7): an adaptive integration routine and Brent's method, each requiring a user-set tolerance ε int and ε Bre , respectively.
In our specific setting, fixing R at the beginning of each MCMC iteration, the error at each iteration of Brent's method can be written as where Λ is the piecewise polynomial approximation to Λ induced by the integration routine, and τ ( = τ ) is the root currently returned by the algorithm.Using Equation ( 9) and adding and subtracting terms to Equation ( 26) we obtain the following error decomposition The term η int represents the component of the total Brent error deriving from the numerical integration routine, while the term η root represents the component of the total Brent error deriving from the fact that, conditional on integrating on the exact rate, the integral is calculated up to τ instead of to τ .
Let us now make an assumption on the smoothness of the event rates of the process.
Assumption 1.The function Λ is such that the numerical error η int is always bounded by the given tolerance ε int .
In our specific case, the integration routine is adaptive, and it stops improving the result when the size of the estimated η int , the integration error between the two layers of Gauss-Kronrod nodes, is below the tolerance threshold ε int .An irregular Λ with very tall and thin spikes can in principle fool the integrator, but the QAGS routine places more nodes around areas of high variation of the target, to minimise the impact of the irregularities on the final value of the integral.Therefore, only a very pathological Λ, e.g. one with very large, or infinite, values of certain derivatives are likely to cause Assumption 1 to be false, and hence we expect this to hold for most densities arising in statistical applications, including all targets studied in this work.Let us now proceed with the discussion.
Brent's method will continue iterating until |η Bre | ≤ ε Bre .If Assumption 1 holds, the root error η root at the last iteration of Brent's method can be bounded as Effectively, as ε Bre , ε int → 0, then η root → 0. This does not guarantee that τ → τ , which will be the subject of Lemma 1.
Lemma 1.Let τ be the exact switching time found by solving Equation (7), and τ be the approximated switching time obtained by solving Equation (26).The difference between the exact and numerically approximate switching time can be bounded (deterministically) as Proof.See Appendix E.
Note that when implemented without a refreshment procedure, Equation ( 27) can diverge as Λ min may degenerate to 0. The degradation of this bound corresponds to the possibility of serious discrepancies between the inter-jump times for the exact and numerical chain, which is a concrete possibility for PDMPs which have vanishing event rates over nontrivial parts of the state space.This problem can be circumvented by taking a baseline position-independent refreshment rate of γ i (z) ≡ γ i > 0, with Λ min = Γ = γ i = dγ.
One practical interpretation of this bound is that if the numerical tolerances are high, then a large value of Γ is necessary in order to avoid large discrepancies in computation of the inter-jump times.Likewise, as the numerical tolerances become smaller, the baseline refreshment rate Γ can also be taken smaller.
Let us now prove the second component needed for Proposition 4. Intuitively, if τ and τ are close to each other, then the Zig-Zag process and NuZZ both starting from (x, v) will remain close at (x + τ v, v) and (x + τ v, v) respectively.With the following lemma, we ensure that when the two processes reach their respective new switching points, the new velocities which are sampled according to the transition kernel Q(v |x, v) are likely to remain the same.
Lemma 2. Assume that the entries of the Hessian of log π(x) are uniformly bounded in absolute value by a constant M .Assume also that the process is implemented with coordinate-wise refreshment rates of γ i > 0, with sum Γ = i γ i = dγ.Then, for all (x, v, v ), it holds that Proof.See Appendix E.
We can now use Lemma 1 and 2 to show that the the transition kernels P and P are close to each other.
Proposition 4 (Wasserstein closeness of P and P ).Let P be the Markov transition kernel for NuZZ, parametrised by the numerical tolerances ε Bre and ε int .Let Γ > 0 be the baseline switching rate of both ZZS and NuZZ.For where d W is the 2-Wasserstein distance as defined in Equation ( 18), and Proof.See Appendix E.

Invariant Measures and Convergence
It is challenging to make statements on the asymptotic law µ of Z t , as NuZZ is not a Markov process.While there are ways of augmenting the state space such that Z t is a sub-component of a Markov process, we have found that in our case these approaches are somewhat opaque and algebraically complicated.Moreover, establishing the a priori existence of a stationary distribution for the numerical process is in fact not crucial for Monte Carlo purposes.With this in mind, we chose to pursue results directly in terms of ergodic averages along the trajectories of our processes.A byproduct of our approach is that we will be able to describe these ergodic averages as spatial averages against a probability measure µ, which encapsulates the long-run behaviour of the process Z t , without requiring the strict property of being an invariant measure for the process.
In this Section we will use Proposition 4 to derive results about the stationary measures of Y k , Y k , U k and U k .We will then define the interpolation operator J f and use it, together with the results on the stationary measures, to produce Theorem 1, the main result of this section.
Recall that P is defined as the Markov kernel corresponding to the skeleton chain of the exact Zig-Zag Process.
If the Zig-Zag process Z t converges geometrically to the stationary distribution, then one generally expects that this assumption should also hold for the skeleton chain Y k .However, we do not know of existing results which guarantee this implication.
Moreover, let us make the following additional assumption.
Assumption 3 (Wasserstein geometric ergodicity for Y k ).The transition kernel P is Wasserstein geometrically ergodic.
Once again, as P relates to the skeleton chain rather than the continuous time process, this assumption is not particularly strict.Note that Assumptions 2 and 3 imply that ν and ν are unique.
Under these assumptions, the following bound on the invariant measures of the skeleton chains holds.
Proposition 5 (Wasserstein distance between skeleton chains).Let ν and ν be the unique invariant distributions of Y k and Y k .If P satisfies Assumption 2, then: where C and ρ are the constants defined in Assumption 2.
Proof.See Appendix E.
Proposition 5 above makes use of the described in Proposition 4, and thus already provides a bound on the Wasserstein distance.However, the computed bound applies only to the stationary distribution of the skeleton chain.In what follows we try to extend these results on the skeleton chain Y k to the continuous time Zig-Zag process Z t by making use of the extended skeleton chain U k and the interpolation operator J .
Let us now state a couple of technical results on the extended skeleton chains U k and U k .
Lemma 3. The Markov chains U k and U k each have a unique invariant measure.
Proof.See Appendix F.
We can now bound the distance between U k and U k in terms of the chains Y k and Y k as follows.
Proposition 6 (Wasserstein distance between U k and U k ).Let ζ and ζ be the invariant distributions of the augmented chains U k and U k .Then: Proof.See Appendix F.
The remainder of this section shows that if the numerical error is small, the ergodic averages obtained through the measure µ will be close to those obtained through the exact stationary measure µ.Moreover, the discrepancy will be bounded from above by a monotone function of the numerical tolerance.
Let us now define the interpolation operator J , which will be of distinct importance in our derivation, and prove two technical lemmas that will be used in Theorem 1.
Definition 2 (Interpolation operator).Let (x, v) and (x , v ) be two consecutive points in the skeleton chain Y k .Define the interpolation operator J : C(E) → C (E × E) by its action on functions as The significance of this operator is that given a pair of consecutive points in the skeleton chain and a function f , the interpolated function J f returns the integral of the function f over the period of time during which the continuous-time process moves from the first point to the second point.As such, it is a natural tool for expressing ergodic averages of the continuous-time processes in terms of their skeleton chains.
Then, the ergodic averages for the continuous processes Z t and Z t can be given in terms of the chains U k and U k and their stationary distributions as where i.e. the average switching time for both the exact and numerical process along a trajectory is finite and constant.
Proof.See Appendix F.
Remark 3. Note now that the right-hand side of Equation (32) can be rewritten as where ϕ(dy, dy , dy ) = ν(dy)P (y, dy ) where K(•|y, y ) is the Markov kernel which selects a point uniformly at random from the line segment joining y and y , and µ is obtained by integrating y, y out of ϕ.
In particular, it follows that µ is the occupation (and stationary, in this case) measure of the exact process Z t on the original state-space E. Likewise, we can define with in order to characterise the occupation measure of the approximating process.
We can now prove the main result of this section, which provides an error bound on the ergodic averages which are obtained from the paths of the continuous-time approximate process.
Theorem 1 (Bound on discrepancy of ergodic averages).Let f : E → R be a bounded and Lipschitz function, and let Then, there exists a probability measure µ such that the limit exists, and satisfies where µ is the invariant distribution of the exact Zig-Zag Process, and the expectation is the expected square jump distance.
Proof.See Appendix F.
Intuitively, Theorem 1 shows that despite Z t being non-Markovian, there exists a probability measure that describes its sojourn in the state-space, and that integration against this measure will reproduce the ergodic averages of the process.
The second, and perhaps most important consequence of Theorem 1 is to establish a bound between the difference of ergodic averages between the exact process Z t and ergodic averages of the approximate process Z t , bound that depends on how close the skeleton chains (Y k , Y k ) via the distance between the extended skeleton chains (U k , U k ).Hence ergodic averages computed with the occupation measure µ of the numerical process Z t will be close to those computed via the exact stationary measure µ of the exact process, and as the numerical tolerances ε Bre and ε int go to zero, the two ergodic averages coincide.
The following corollary is a somewhat restrictive but more explicit summary of the results presented in this section.
Corollary 2. Let f : E → R be a bounded and Lipschitz function with compact support.Let η int ≤ ε int , let every element of the Hessian of log π(x) be bounded by the constant M , let Y k be Wasserstein ergodic with constants ρ and C, and let Y k be Wasserstein ergodic.Then by Theorem 1 and Propositions 6, 5, 4, the discrepancy between ergodic averages from the Zig-Zag process, Z t , and the ergodic averages from NuZZ, Z t , can be bounded as with (46) Remark 4. Note that while the left-hand side of Equation ( 45) is invariant under constant shifts of the form f → f + c, the right hand side is not.As such, one may sharpen the bound by taking the infimum of the existing bound over all such shifts.

Numerical experiments
We now study how the Zig-Zag process compares to other popular MCMC algorithms on test problems with features that are often found in practice.These features include high linear correlation, different length scales, high dimension, fat tails, and position-dependent correlation structure.Our results are shown in Figure 2.
In the following sections, we aim to compare the NuZZ method against three popular algorithms.In the figures that follow, the Random Walk Metropolis method is represented by the black line labelled RWM in plots, Hamiltonian Monte Carlo by the green + line labelled HMC, and the simplified Manifold MALA Girolami et al., 2011 with SoftAbs metric Betancourt, 2013 by the red line labelled sMMALA.We also include a Direct Monte Carlo sample (blue line) from the target as a term of comparison.Every example we examine has been chosen so that it is easy to obtain a direct Monte Carlo sample, and the marginal distributions are easily accessible.
For each example, we test two versions of NuZZ.The yellow ♦ line labelled NuZZroot, represents the performance of NuZZ as if only one gradient evaluation was necessary to obtain each τ , which would be the performance of the algorithm if an analytical solution to Equation (9) were available.The yellow line therefore represents a lower bound to the performance of NuZZ, and indeed any numerical approximation of the Zig-Zag process.The blue × line labelled NuZZepoch represents the performance of the NuZZ algorithm when every gradient evaluation necessary to obtain τ is counted.In this context, the current implementation of NuZZ is expensive, as for each root, Λ(x(s), v) has to be evaluated for each root finding iteration, for each node of the numerical integral.
Lastly, the icdfZZ algorithm mentioned in Section 5.3 is the exact process where τ is obtained via cdf inversion, while the NuZZtol algorithm from Section 5.7 is just the NuZZepoch run with increased tolerances, as explained below.

Tuning
The RWM, MALA and HMC algorithms used in this work are tuned to achieve the optimal acceptance rate of roughly 25%, 50% (Roberts & Rosenthal, 2001), and 65% (Beskos et al., 2013), with exceptions described in each relevant section.
We run all the NuZZ-based algorithms with absolute tolerance of ε int = 10 −10 for the integration routine, and ε Bre = 10 −10 for Brent's method.
The functions γ i (x, v) can be set to zero if the target distribution satisfies certain properties (Bierkens, Roberts, & Zitt, 2019), or otherwise they can be made small enough such that they have negligible effects on the dynamics.Small values of γ i reduce diffusivity and guarantee better results in terms of asymptotic variance (Andrieu & Livingstone, 2021).As it is unknown what the effect of setting Γ = 0 for a numerical approximation to the Zig-Zag process, we set Γ = 0.001 in all of our examples, which is an empirical value small enough (in our case) that the impact on the switching rate is negligible1 .
The mixing of Zig-Zag-based algorithms is affected by the velocity vector v in the same way the mixing of Metropolis-based MCMC algorithms is influenced by the covariance of the transition kernel (or the momentum variable).Tuning v init is discussed in the relevant examples in Section 5.4 and 5.5.Wherever possible, we used the values of the theoretical covariance matrix to adapt the proposal covariance, mass matrix, and velocity vectors of all the algorithms.We leave the study of how fast the methods adapt to each example using partial MCMC samples for future work.

Performance comparison
We chose as a measure of algorithm performance the largest Kolmogorov-Smirnov (KS) distance between the MCMC sample and true distribution amongst all the marginal distributions.The KS distance is particularly well suited to our situation as the marginal distributions of our test problems are known or easily obtainable.We chose not to utilise the Effective Sample Size (ESS) or other ESS-based metrics, as NuZZ is not an exact algorithm and a high ESS value does not necessarily reflect the quality of the sample (NuZZ with very high tolerances can have excellent ESS but be arbitrarily far from the target distribution).Moreover, the KS distance is still well defined even if the moments of the target are infinite as in Section 5.7, unlike the ESS.Choosing the KS distance also reflects the requirements of practical Bayesian statistics, where each parameter will typically have a point estimate and credible region reported, meaning that a supremum distance such as the KS is appropriate in assessing the worst-case accuracy of such figures.
Formally, the Kolmogorov-Smirnov distance for the ith parameter is defined as i.e. the supremum of the absolute value of the difference between the exact cdf, F i , and the empirical cdf constructed from the MCMC sample, F i .The KS distance also has the advantage of being relatively easy to calculate, and it is an integral probability measure with good theoretical properties (Sriperumbudur et al., 2009).The largest KS distance on the marginal distributions, which we standardly refer to as the "D-statistic", is simply Since the empirical cdf is calculated via Monte Carlo simulation, we ran each algorithm 40 times and reported the mean and two-sided 95% region of the KS distance in plots such as Figure 2.
In order to meaningfully compare the algorithms studied in this work, we gave ourselves a fixed computational budget.We defined one epoch as "a complete passage of the algorithm through the whole dataset".For each model, we gave ourselves a computational budget of 6 × 10 6 epochs, and counted one epoch every time the full gradient or one likelihood evaluation is completed.For sMMALA, we ignored the computational cost of computing the Hessian matrix, since depending on the problem, the actual cost of this will be highly variable.
As each algorithm that we studied has a different computational cost, each of them must return a sample of different size.In order to meaningfully compare the D-statistic across samples of different size, we divided the sample from each algorithm in 40 equal batches, which is a large enough number to give a detailed account of convergence, but not too large for our computational resources.We then produced the lines in Figure 2 by calculating the D-statistic on subsamples formed by an increasing number of batches, with the first value of each line calculated only on one batch, and the last value of each line calculated on the whole sample.The Zig-Zag-based algorithms that we study in this work return a trajectory which is represented by a series of switching points.These trajectories then have to be uniformly subsampled to obtain a sample from the distribution of interest π(x).In our experiments we extracted 6 × 10 6 samples from each trajectory.

Standard normal
The first model we study is a 10-dimensional standard normal distribution, which lacks any of the challenging features discussed later, and as such provides a baseline for algorithm performance.It is a model that has analytic solutions for the Zig-Zag Sampler, so it allows us to compare the performance of the Zig-Zag with cdf inversion (pink line labelled icdfZZ in Figure 2a) with that of the NuZZroot algorithm.As one would expected, direct sampling from the target density shows the fastest convergence.
The three popular algorithms RWM, sMMALA and HMC are all quite close to each other in terms of performance.HMC, which was tuned to take 3 leapfrog steps with step size 0.6 to achieve alternate autocorrelation (identity mass), performs slightly worse than sMMALA.Note that for normal distributions, the Hessian is  constant, so there is no difference between sMMALA and standard MALA.The RWM converges quite well, but falls behind sMMALA and HMC in terms of performance.
The Zig-Zag-based algorithms are represented by the blue ×, yellow ♦, and pink lines.As expected, NuZZepoch has the slowest convergence as root finding is expensive in terms of epochs.That leaves the algorithm with a budget of around 8 × 10 3 switching points, which is not sufficient to explore the density as well as the other algorithms.If this algorithm represents a lower bound for the performance of numerically approximated Zig-Zag Samplers, the yellow ♦ line of NuZZroot represents the upper bound; its efficiency significantly outperforms every other MCMC algorithm tested.
Of particular interest is the convergence of the analytical version of the Zig-Zag Sampler (pink line labelled icdfZZ) on this model.The algorithm's performance is practically indistinguishable from NuZZroot, providing further evidence supporting the discussion on numerical errors in this work.

Different length scales
Having measured the performance of the algorithms on a standard normal, we now proceed to introduce the features of interest which commonly appear in Bayesian inverse problems, starting with correlations.Some algorithms, struggle on these targets, as the length of their trajectory is tuned on one particular scale.Therefore the trajectory will be too long for variables with smaller scales, doubling back and wasting computational resources, while it will be too short for variables with larger scales, giving a more correlated sample.
A common test problem with these feature is Neal's Normal (Neal, 2010), an uncorrelated normal distribution (we take the dimension to be 10, to be consistent with the rest of the test problems) with variances [1 2 , 2 2 , . . ., 10 2 ].The results are shown in Figure 2b.
Tuning the components of the velocity vector v for Zig-Zag-based processes helps mixing for this problem.The simplest choice for the initial velocities is v i = 1, i ∈ {1, . . ., d}, however it makes sense to have different velocities for components with different scales.When we tune the velocities of the NuZZ algorithm, we use the theoretical standard deviations σ i for each variable, and normalise them to have fixed speed v 2 = √ d, as the original process.Thus the individual values of the v i change, but the overall speed of the process remains the same.Explicitly, if we have a set of standard deviations, σ i , then the velocity update we use is The velocities can also be tuned adaptively (Roberts & Rosenthal, 2007) using the history of the process to calculate standard deviations σ i for each variable on the basis of observed sample properties during the adaptation phase.Our experiments suggest that the adaptively estimated velocities converge quickly to their final value, as they only need information about the relative scale of the components.
As before, direct sampling is the most efficient method, while NuZZepoch is too expensive to provide a good sample.NuZZroot on the other hand, performs well, surpassing the Metropolis-based algorithms.RWM and sMMALA perform again on a similar level, as they apply the same kind of global conditioning to their proposal.HMC, outperforms both of them, with step size 0.6, 3 leapfrog steps, and mass matrix M = Σ −1 .

Linear correlation
Another common issue that arises in practice is posteriors whose variables have strong linear correlation.In this section we compare the performance of the algorithms on a 10-dimensional multivariate normal with covariance matrix where we pick α = 0.9.The results can be see in Figure 2c.
Unsurprisingly, the performance of RWM and sMMALA is quite similar, as they both use rotation matrices to improve their proposals.HMC was tuned with mass matrix M = Σ −1 , step size 0.6 and 3 Leapfrog steps, with a trajectory long enough to achieve alternate autocorrelation.HMC outperforms both the RW and MALA, even though the difference is not substantial.That is because as MALA is theoretically equivalent to HMC with one leapfrog step, naturally a HMC trajectory consisting of only three leapfrog steps will yield similar results.
The Zig-Zag algorithms do not perform well in this setting.The NuZZroot, performs worse than a well-tuned Random Walk.While a RW can take advantage of the whole matrix Σ, the strategy for tuning v applied in the previous section only utilises the marginal standard deviations, hence ignoring information on the correlation of the components.This is largely due to the fact that the process can only travel in certain directions, and if these directions are not conducive with travelling along a ridge in the target density, then the process does not mix well.This highlights the importance of the work in Bertazzi and Bierkens (2020), where the authors obtain results on using the whole covariance matrix of the target to tune the dynamics of the process and achieve better mixing.The theoretical linearly correlated target considered here is the closest example to the realdata example in the Supplementary Material, Section S.1, in terms of posterior properties and hence algorithm performance.

High dimension
The next example we study is a standard normal in 100 dimensions.Some algorithms are known to scale particularly poorly with dimension, e.g.RWM, while HMC is better adapted to this scenario.For this particular experiment, we reduced the computational budget to 1 × 10 6 epochs, to be able to run the repeats in reasonable time.As a consequence, the KS distance is generally one order of magnitude larger than that for the other models.The result of our tests are shown in Figure 2d.
The performance of HMC and sMMALA is quite similar, meaning that the use of gradient information has a positive effect on their performance on this problem.Conversely, the RWM performs significantly worse.
No preconditioning was used for RWM, MALA nor HMC, and HMC was tuned to have step size .73and 3 leapfrog steps (≈ d 1/4 , Beskos et al. ( 2013)).The performance of HMC and sMMALA is still close as HMC only needs 3 leapfrog steps to reach alternate autocorrelation.Moreover, it should be pointed out that even though theoretically MALA is equivalent to HMC with one leapfrog step, in practice due to the leapfrog scheme the cost of one HMC iteration with one leapfrog step is more expensive than one MALA iteration.
Interestingly, the yellow ♦ line corresponding to the NuZZroot is below the sMMALA and HMC line, suggesting that the Zig-Zag dynamics does not scale as well in high dimension, but it is still superior to RWM.

Fat tails
In this example we look at how the algorithms perform when the target has fat tails.This feature is particularly common in particle physics, where posteriors are often Cauchy-like, with power-law decay in their tails.The computational budget is set again to 6 × 10 6 epochs, and the target distribution is a 10-dimensional student-t with one degree of freedom, so that all moments are infinite.
In Figure 2e, the curves are tightly packed with overlapping confidence intervals.RWM, HMC and sMMALA display similar convergence speed, with sMMALA being the fastest.This may be due to the fact that the Hessian is not constant in this example, which may provide some helpful local information to the dynamics.HMC was tuned to take 20 leapfrog steps, and a step size of .3.The mass matrix was taken to be the identity matrix, as there is no correlation in this example, and the marginals are all equal.As the moments of the distribution are infinite, we used a fixed identity matrix for the proposal.
The Zig-Zag-like algorithms perform quite well.The blue × line representing the NuZZepoch is near the others.The yellow ♦ line corresponding to the NuZZroot is the second lowest on the chart, outperformed only by direct sampling of the density.The good performance of the Zig-Zag-based algorithms is due to the fact that the process is more frequently able to make longer excursions into the tails than the other methods, due to the low gradient of the potential away from the mode.
In the Supplementary Material, Section S.2, we show numerical experiments to find the highest level of the numerical tolerances such that the the difference in the posterior sample in not detectable in this example.While one iteration of the adaptive integration routine is usually enough to fall within the tolerance ε int , preventing us from detecting its effect, we found that the numerical error overtakes the Monte Carlo error at a value of about ε Bre = 10 −2 , in this example.The NuZZtol process simply corresponds to the NuZZepoch process run with an increased tolerance of ε Bre = 10 −2 .

Non-linear correlation
We move on to study how the algorithms behave in the regime of curved (or non-constant, non-linear) correlation structure.A distribution belonging to this class cannot be indexed with a global covariance matrix, as global conditioning does not capture the local characteristics of the target.They commonly look like curved ridges, or 'bananas', in the state space.These shapes are common in many fields of science, often in hierarchical models or models affected by some sort of degeneracy in their parameter space (e.g.House et al., 2016;The Dark Energy Survey Collaboration et al., 2017 ).
In this section we will use the Hybrid Rosenbrock distribution from Pagani et al., 2021, and more specifically the distribution where x ∈ R d , a = 2.5, b = 50.The parameters a, b have been chosen so that the shape of the distribution does not pose an extreme challenge to the algorithms we are studying.The contours of this distribution generally look like those shown in Figure 1.The results of our experiments are shown in Figure 2f.
It should be noted that the marginals of the Hybrid Rosenbrock are not known analytically, but it is possible to sample from them directly.To produce convergence plots based on the D statistics, we built empirical cumulative distribution functions for each marginal of our target, based on 2 × 10 10 samples.The Monte Carlo error introduced in our analysis by this step is therefore negligible.
Even though the curvature is not extreme, the RWM performs significantly worse than the other standard MCMC algorithms.The sMMALA algorithm performs well, as it is designed to deal with this class of densities, and the gap with the other algorithms would be larger if the shape of the target was narrower and with longer tails.HMC works quite well, even though it is difficult to tune for this target.HMC was run with step size 0.03, 20 leapfrog steps, identity mass, aiming for an acceptance of 80% (Betancourt et al., 2015).The HMC tuning was obtained by repeating the analysis with several different combinations of values for the step size and number of leapfrog steps, and it corresponds to the best ESS we were able to obtain, with speed of convergence close to sMMALA.
The first part of the NuZZepoch has KS distance close to one, which means that the process has not had enough time to escape the local region where it started.However, after ≈ 10 6 epochs it starts converging like the other algorithms.The yellow ♦ line corresponding to the NuZZroot, is above sMMALA and HMC but close to them.This is quite a promising result, as HMC is notoriously difficult to tune, especially on this class of problems, while NuZZ requires little to no tuning at all.The velocity vector was not adapted from the initial values of v = (1, . . ., 1) as done in Section 5.4, as the scales of the components are roughly the same.

Conclusions
In this work we discussed how the standard Zig-Zag Sampler from Bierkens, Fearnhead, et al., 2019 is limited in its applicability to models where an analytical solution is available, or when an efficient bound based on the Jacobian or Hessian is available.We proceeded to present the NuZZ algorithm, which carries out numerically the steps that would otherwise need analytical solutions or efficient bounds.As opposed to the standard Zig-Zag sampler, NuZZ can now be applied to general models, allowing us to study the properties of the Zig-Zag dynamics from the practitioner's point of view without having to compute potentially complicated thinning bounds beforehand.
Numerical errors in the sampling of the switching times in the NuZZ lead to a perturbation in the occupation measure of the process.We showed how a bound on the Wasserstein distance between the target and the invariant distribution is linearly dependent on the tolerances of both the numerical integration and root finding routines.This demonstrates that NuZZ can be used to sample from any given target distribution up to a prescribed degree of accuracy.
We tested Zig-Zag-based algorithms against some popular MCMC algorithms, i.e.Random Walk Metropolis, Hamiltonian Monte Carlo and Simplified Manifold MALA, on benchmark problems that display important features that often occur in practice.NuZZ is an expensive algorithm to run, as the rate functions λ i have to be evaluated at each node of the numerical integrator, for each iteration of the root finding method.As such, NuZZ per se is not competitive for practical use with respect to other MCMC algorithms.However, NuZZ allowed us to study the Zig-Zag dynamics on important test problems.Our conclusion is that overall, the Zig-Zag dynamics are competitive with other popular MCMC dynamics, and in some cases outperforms them even in the absence of super-efficiency through sub-sampling, under the assumption that at most one epoch is used per switching event.Importantly, the novel approach to quantify numerical errors in NuZZ is quite general, and can be applied to other PDMP-based MCMC algorithms.
While there are many questions remaining about the Zig-Zag process and PDMPs in MCMC in general, there are three main directions that naturally lead on from our work.A natural question is whether the numerical efficiency of NuZZ can be improved, for example by using more tightly integrated, bespoke integration and root-finding routines.Finally, the technique used to obtain bounds on the ergodic averages between the exact and numerically approximated can be extended to other PDMPs.

A The Sellke ZigZag process
In this section we describe a method for augmenting the state space to relate the NuZZ to the construction (Sellke, 1983) discussed in Section 3.1.This can be done by changing the structure of the Zig-Zag sampler, from a PDMP that jumps in conjunction with an arrival from a Poisson process, to a PDMP that jumps when one of the components touches an active boundary (Davis, 1993).We call this Markov process the Sellke Zig-Zag (SeZZ), and this can be shown using methods from Davis (1993) to target the correct invariant measure.All the processes considered in this section are one-dimensional (d = 1).
Let us define a discrepancy variable q(t) as The evolution of q(t) in time can be seen as a jump process where the value of q(t) decreases deterministically according to until it reaches q(τ − ) = 0.At that point it jumps to q(τ + ) = R ∼ Exp(1), and continues decreasing according to Equation ( 52), analogously to the way in which the differential equation ( 1) determines the dynamics of the state x.By adding the discrepancy variable q(t) to the existing position x(t) and velocity v(t) variables, we can now define the SeZZ process on (x, v, q) as a PDMP where a velocity flip is triggered by the variable q(t) hitting the active boundary at 0.
In order to prove that this process targets the correct stationary distribution, we will prove that the forward Kolmogorov equation is equal to zero.However, as this process has an active boundary in the state space, the generator has to differ from Equation (3) using methods from Davis, 1993, p.118.
Let the variable q be defined on the space Q = [0, ∞), which can be partitioned into the subsets Q 0 = (0, ∞) and {0}.Let E = X × V × Q 0 be the interior of the state space, with X = R and V = {−1, +1}, and B = X × V × {0} be an active boundary of the state space.While travelling through E, the process is described by the drift operator meaning that motion is completely deterministic and there are no jumps.When the variable q(t) reaches zero, the process touches the active boundary B causing the velocity to switch, and q is refreshed with a new exponential random variable.This change is expressed through the boundary operator The process SeZZ targets the stationary measure where the measure ρ is induced on the boundary by µ and the process.
Theorem 2. The one-dimensional SeZZ process with extended generator given by the operators targets the marginal stationary distribution π(x).
Proof.Let the function f : E → R be absolutely continuous and measurable Davis, 1993, p.118, 82, and let Following Davis, 1993, p.118, we will show that SeZZ has π(x) as marginal stationary distribution by proving that Equation ( 56) is satisfied for (60) Substituting the quantities above into Equation ( 56) above, and assuming that π(x) = 0 at x = ±∞, we obtain where the common constant 1/(2Z) has been cancelled.The equality follows by integration by parts and rearranging terms.
Our starting point is Equation ( 61) above, which we will re-write as: Third term, I3 = 0 .
To verify this equation, we proceed by analysing each of the three terms individually.We will often exchange the order of integration in integrals, making use of the Fubini-Tonelli theorem.Rearranging the factors and applying integration by parts, the first term becomes where the term in square brackets on the third line is zero as we work with target densities π(x) ∝ e −U (x) that tend to zero as x → ±∞.Again, we rearrange the factors and apply integration by parts to the second term: Finally, the third term reduces to x) dx .
The SeZZ process forms the basis for NuZZ, which uses numerical approximations to find the roots of (51).Since these approximations are dependent on the last switching location through the generation of quadrature rules, NuZZ is not Markov.However expansion of the state space to include the last switching point does induce a Markov process, but the calculations that this gives rise to have proven unwieldy, hence the approach taken in this work.

B Proof of Proposition 1
Following (Davis, 1984(Davis, , 1993)), the generator of the process described in the statement of the result is where the first term represents the drift and the second the velocity switch.The process switches velocities with overall rate Λ, whereby the index of the switching component r is sampled according to the multinomial transition kernel in (11).By carrying out some standard manipulations, Equation (70) reduces to which is identical to the generator of the original Zig-Zag process, and therefore the two processes are equivalent.

C Linear bound for the ZZCV algorithm
In the case of a ZZCV algorithm on a Logit model, the bound λi is linear, i.e. λi (x(s), v) = a i (x, v) + s b i (x, v), and can be obtained as In Equation ( 75) above, we have applied control variates, added and subtracted the term v i ∂ i U j (x), and applied the Lipschitz condition.We moved the terms a i := C i x−x * 2 and b i := C i √ d out of the maximum () + function as they are both always positive, the Lipschitz constant C i being calculated as a 2−norm.We then cancelled the term (v i ∂ i U (x * )) + remaining in the bracket as the derivative at the MLE vanishes.Now Equation ( 9) can be rewritten as  76) can then be solved analytically, and i * can be sampled from a multinomial with cell probability λi (x(τ ), v)/Λ(x(τ ), v) = (a i + τ b i )/(A + τ B).Finally, the proposed τ is accepted as a switching time with probability λ i (x(τ ), v)/ λi (x(τ ), v), otherwise the dynamics moves on to x + τ v, v is unchanged, and a new potential switching time is sampled.

D Wasserstein distance, preliminary results
Definition 3. The generalised Wasserstein ergodicity coefficient in terms of the 2-Wasserstein distance is defined as In this work we assume that both κ(P ) < +∞ and κ( P ) < +∞.
Proof of Proposition 7. Consider the two statements separately.
1. Let (X µ1 , X µ2 ) be a realisation of the optimal coupling between µ 1 and µ 2 for d W .Let Y, Y be such that (Y, Y ) | (X µ1 , X µ2 ) is a realisation of the optimal coupling between δ Xµ 1 P and δ Xµ 2 P .Since Y ∼ µ 1 P and Y ∼ µ 2 P .Then where we used the fact that d W is an infimum, the tower property of the expected value, and Definition 3. On the last line, we used the fact that X µ1 and X µ2 are optimally coupled.
2. Using the contraction property on P : κ(P )κ( P ) , obtained using the repeated application of Definition 3.

D.1 Proof of Proposition 2
Proof of Proposition 2. Following the proof from Rudolf and Schweizer, 2018.Recall that as P is Wasserstein geometrically ergodic, κ(P n ) Cρ n , with ρ < 1.By induction: Using a coupling argument: Furthermore, using the contraction property from Proposition 7: Finally: Proof of Proposition 3. Let E be a complete metric space.Using the Banach fixed point theorem (and Definition 1), we have that there exists a measure π with finite second moment such that P n (x, •) → π in Wasserstein distance.By using a coupling argument similar to the one found in the proof of Proposition 7, we find: It thus follows that πP = π and hence that π is invariant.
We will now prove that π is ergodic.Let A be an invariant set, that is for all x ∈ A, P (x, A) = 1.Clearly, for any x ∈ A and for all n > 0, we have P n (x, A) = 1.Since P n (x, •) → π, this implies that π(A) = 1, which implies that π is an ergodic measure.
Finally, by using again a coupling argument similar to the one found in the proof of Proposition 7, we find that νP n → π in Wasserstein distance for ν with finite second order moments.

E.1 Proof of Lemma 1
Proof of Lemma 1.Let the error in the integral of the switching rate due to having a numerically approximate switching time τ be defined as Applying the Mean Value Theorem to f (t) = t 0 Λ z (s) ds, we obtain that for some c ∈ [τ, τ ], it holds that and hence that this expression is bounded from below as Combining Equations ( 82), ( 83) and ( 84), we obtain This result can be combined with Proposition 4 to show the dependence of the error bound on the numerical tolerances of the inner solvers.

E.2 Proof of Lemma 2
Proof of Lemma 2. Let λ i (x, v) = max (0, −v i ∂ i log π(x)) + γ i be the unnormalised probability of switching from velocity v to velocity v = F i (v) at the time of an event at position whose terms can be bounded as follows: Combining these equations one can bound the term on the left-hand side of Equation (86) as It follows that, along the same trajectory, which concludes the proof.
F Invariant measure results The laws of (Y 1 , Y 2 ) and ( Y 1 , Y 2 ) are ζ and ζ respectively, which gives us a coupling between ζ and ζ.We can therefore deduce that

F.3 Proof of Lemma 4
Proof of Lemma 4. The ergodic averages for a function f of the process Z t up to time T n can be written in terms of the interpolation operator J , and a set of velocities u k , as This result is obtained by starting with the integral on the left-hand side of this equation, then applying the transformation t = (s − T k−1 )/(T k − T k−1 ), and defining each velocity as equal to displacement divided by time, and multiplying both sides by T −1 n Since by Lemma 3 the chain U k has a unique invariant measure, and J f ∈ L p (ζ) for p ∈ [1, ∞), we use the Birkhoff theorem Sandrić, 2017 to take the limit and write The same calculations apply mutatis mutandis to Z t and U k .
Proof of Lemma 5. Write directly that where x k,t = (1 − t)x k + tx k , using that v 2 ≡ √ d.Now, define f k (t) = f (x k,t , v k ), and decompose the difference of the integrands above as One can then bound for t ∈ [0, 1] Additionally, the triangle inequality shows that Thus, from which the result follows by integration.

F.5 Proof of Theorem 1
Proof of Theorem 1.The measure µ as defined in Equation ( 43) exists as shown in Lemma 4 and Remark 3.
Our argument to derive the bound in Inequality (44) consists in proving that the right-hand sides of Equations ( 32) and (33) from Lemma 4 are close to each other.In order to do that we need to prove that H and H are close, and that From Lemma 5, we recall that for all u 1 , u 2 ∈ E 2 and f bounded and Lipschitz, We can thus bound where the final step uses that E φ u 1 − u 2 2 1/2 = d W (ζ, ζ).

S.2 Effects of tolerances
As mentioned in the text, in most experiments we set both the integration routine absolute tolerance and Brent's method absolute tolerance to 10 −10 .
In the example from Section 5.7 we changes these tolerances systematically and assessed their impact on accuracy of the posterior samples.
Figure S.2 shows how D, the worst Kolmogorov-Smirnov distance on the marginals, increases as we increase ε Bre .In this example, with 6 × 10 6 samples, the Monte Carlo error dominates the numerical error for all ε Bre < 10 −2 .This result is model dependent, but it supports the point that even though NuZZ uses numerical approximations to sample the switching times, if the numerical error is small, the difference in the posterior is not detectable.
We also tested the influence of the integration error on the quality of the sample from NuZZ, but since the QAGS integration routine performs a minimum of a 15-point Gaussian integration steps, this is often more than enough to reach the condition η int ≤ ε int = 10 −10 .Therefore we were unable to coarsen this part of the approximation sufficiently in order to produce the analogous plot to Figure S.2 for ε int , but simply note that for all the examples we considered, one integration step is usually sufficient, and the integration tolerance simply acts as a fail-safe for when they are not.ε Bre , for a fixed integration tolerance ε int = 10 −10 .The computational budget was 6 × 10 6 epochs, which was processed into 6 × 10 6 NuZZ samples.

Figure 2 :
Figure 2: Convergence plots.D represents the largest Kolmogorov-Smirnov distance on the marginal distributions between the MCMC sample and the known target.
with T k representing the jumping times and δ r being the Dirac measure at r ∈ E × R Davis, 1984, p.367.Let the stationary measure be as in Equation (55), and following Löpker and Palmowski, 2013, let the boundary measure be ρ (dx × dv) = Λ(x, v)π(x) ψ(v) ϕ(0) dx dv .

Figure S. 2 :
Figure S.2: Dependence between the largest Kolmogorov distance on the marginals and the root finding tolerance ε Bre , for a fixed integration tolerance ε int = 10 −10 .The computational budget was 6 × 10 6 epochs, which was processed into 6 × 10 6 NuZZ samples.
Proof of Lemma 3.For any invariant measure of U k = (Y k−1 , Y k ), the projection on the first coordinate is an invariant measure of Y k , which is therefore ν since it is unique.In addition, P(Y k+1 |Y k ) = P (Y k , Y k+1 ), hence an invariant measure of U k must be of the form ζ(A) = P(y 1 , {y 2 | (y 1 , y 2 ) ∈ A})dν(y 1 ) for A measurable.Since ζ is clearly invariant for U k , ζ is the unique invariant measure of U k .The same can be said for U k and ζ.
F.1 Proof of Lemma 3