Wavelet Monte Carlo: a principle for sampling from complex distributions

We present Wavelet Monte Carlo (WMC), a new method for generating independent samples from complex target distributions. The methodology is based on wavelet decomposition of the difference between the target density and a user-specified initial density, and exploits both wavelet theory and survival analysis. In practice, WMC can process only a finite range of wavelet scales. We prove that the resulting L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} approximation error converges to zero geometrically as the scale range tends to (-∞,+∞)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(-\infty ,+\infty )$$\end{document}. This provides a principled approach to trading off accuracy against computational efficiency. We offer practical suggestions for addressing some issues of implementation, but further development is needed for a computationally efficient methodology. We illustrate the methodology in one- and two-dimensional examples, and discuss challenges and opportunities for application in higher dimensions.


Introduction
Accurately and efficiently calculating functionals of probability distributions is central to the majority of statistical applications. Such calculations, particularly in a Bayesian context, can be intractable, requiring integration over complex and high-dimensional distributions. To address this problem, a variety of methods have been developed, such as Importance Sampling, Rejection Sampling, and Markov Chain Monte Carlo (MCMC); see for example Robert and Casella (2004). These methods, collectively known as Monte-Carlo methods of integration, produce either independent or dependent samples from a distribution of interest. Under regularity conditions specific to each method, theoretical quantities of interest are then approximated by averages over the sampled points. We henceforth refer to a distribution Methods of Monte-Carlo integration each have their advantages and disadvantages in terms of accuracy, computational efficiency and human demands. All methods benefit from a good initial sampling distribution that wellapproximates the target distribution, but such a sampling distribution may be elusive especially when the target is irregular: perhaps multimodal with intricate, convoluted or concentrated density contours. Such problems are exacerbated in high dimensions, but can be challenging even in low-dimensional settings. Large sample sizes or run lengths may be required to compensate for a poor initial sampling distribution, but even then may produce misleading results, in particular when regions of the target are under-represented or entirely missed by the generated samples.
Here we present a new method of Monte-Carlo integration: Wavelet Monte Carlo (WMC) 1 , first introduced by one of us in Cironis (2019). Our WMC methodology is built upon a novel combination of the theories of wavelet decomposition and survival analysis. Particles are independently sampled from the initial distribution and then undergo a sequence of transitions based on a wavelet approximation to the difference between the initial and target distributions. The resulting set of particles is an independent sample approximately from the target. For computational efficiency, the level of this approximation can be explicitly controlled by the user, and can be sequentially improved using particles generated in previous runs. WMC has potential for parallel implementation on computer arrays.
Like other Monte-Carlo samplers, WMC benefits from a well-located initial sampling distribution but, unlike some, it does not rely on ad hoc or bespoke methods to accommodate awkward or unusual features of the target distribution. Rather, it relies only on wavelet approximations, these being well-defined as a consequence of the multiresolution analysis of wavelet decomposition (Meyer 1992).
Wavelets are functions that resemble wave-like oscillations. Subject to regularity conditions, a function of interest may be decomposed into a linear combination of locationand scale-adjusted wavelets. Efficient algorithms for this decomposition have led to their widespread use in applications of signal processing and image analysis; see for example Percival and Walden (2000), Nason (2008). From a probabilistic point of view, wavelets have been used in density estimation and statistical modelling (Donoho et al. 1996;Kerkyacharian and Picard 1992;Percival and Walden 2000). However, to our knowledge, wavelets have not previously been used in random-variate-generation algorithms.
For ease of exposition, we present the theory and methods of WMC mainly in the context of one-dimensional target distributions. In Sect. 2, before describing our methodology in detail, we review relevant elements of the extensive mathematical theory of wavelets. In Sect. 3 we introduce WMC and present a preliminary algorithm, preconditioned WMC (pWMC). In Sect. 4, to remove an awkward and restrictive precondition of pWMC, we present a modified and multistage algorithm, n-step WMC (nWMC). Section 5 presents our final algorithm, survival WMC (sWMC), which finesses nWMC into a simple recursive algorithm by appeal to the theory of survival analysis. We extend the methodology to multiple dimensions in Sect. 6. Sections 7 and 8 discuss practical issues and present some simple low-dimensional examples. Section 9 contains some concluding remarks, and the Appendix contains proofs of assertions made in the main body of the paper.
Our aim is to present only the concepts, theory and basic algorithms of WMC. Further work is required to develop the method to compete with other established methods of Monte-Carlo integration in terms of applicability, computational efficiency and ease of use.

Wavelet theory
Wavelet theory is extensive, developed primarily to approximate signals and images. In this section we present only those elements of the theory relevant to our methodology. There are many texts which present the theory more comprehensively; see for example Daubechies (1992), Meyer (1992) and Mallat (2009).
We first introduce the notion of a mother wavelet ψ, a function on the real line R. The simplest example of a mother wavelet is the Haar wavelet which has compact support and its integral over the real line is zero. We require mother wavelets ψ to have the following properties.
(i) The support of ψ = [0, a], where ψ(0) = ψ(a) = 0, a > 1, and a ∈ Z, where Z denotes the set of all integers. (ii) ψ has a number r ψ ≥ 1 of vanishing moments. That is, Here and below, unless otherwise stated, integration is understood to be over R. (iii) ψ and its first r ψ derivatives exist, are continuous and bounded. That is, there exists a constant C ψ < ∞ such that for all x ∈ R where ψ (k) (x) denotes the k th derivative of ψ at x. (iv) Translated and dilated wavelets ψ ji , defined as are orthonormal, that is These conditions are slightly more restrictive than those generally given. Specifically, it is not generally assumed that wavelet ψ has compact support, as we assume in condition (i) above. Instead, condition (i) is generally replaced by one of fast-decay in ψ.
It is easily verified that the Haar wavelet (1) satisfies conditions (ii)-(iv) with r ψ = 1, but not condition (i) which is needed for the proofs in Appendix A.1. Many other wavelet systems satisfying all four conditions have been developed, in particular the compactly supported wavelet family of Daubechies (1988) which have r ψ ≥ 1 with minimum support length given r ψ and which we exploit in Sect. 8. See also Daubechies (1992), Meyer (1992) and Mallat (2009) for discussion of other systems, including those that relax the orthonormality condition (iv).
Conditions (i)-(iv) are a special case of those given by Meyer (1992), p.72, which admit the Haar wavelet. Meyer (1992) shows that the collection of functions forms an orthonormal basis for L 2 (R), the space of squareintegrable functions on R. That is, any function h ∈ L 2 (R) can be expressed as a homogeneous wavelet expansion: whereh ji is the coefficient of wavelet ψ ji given bỹ Moreover with the same conditions, for suitably large r ψ , decomposition (6) holds also for functions h in the more general homogeneous Triebel-Lizorkin and Besov spaces, F s p,q (R n ) andḂ s p,q (R n ), as shown in Proposition 4.2 of Kyriazis (2003). The definitions of these spaces are quite technical; see for example Triebel (1983). Of particular concern here is the homogeneous Besov spaceḂ 0 1,1 (R). Informally, all functions h ∈Ḃ 0 1,1 (R) are integrable in absolute value over R and have roughness that decays over increasingly large intervals. Proposition 4.2 of Kyriazis (2003) shows that the norm of h ∈Ḃ 0 1,1 (R) has the equivalence For the methodology we describe below, in addition to the conditions (i)-(iv), we require (8) to hold, as we demonstrate in Sect. 3.3 and illustrate in Sect. 3.4. That is, we require functions h of interest to belong toḂ 0 1,1 (R). We use the homogeneous wavelet expansion (6) exclusively, rather than the more commonly encountered decomposition which additionally involves translations and dilations of a father wavelet φ (see for example Meyer (1992), Daubechies (1992), Mallat (2009)). Note that here the finest levels of detail in h are provided by j 0, while the coarsest are provided by j 0. Some authors adopt the opposite convention wherein j is replaced by − j.
The principal practical purpose of wavelet theory is to produce optimal approximations to a function h of interest, for efficient representation of sounds and images, etc. For given j 1 ∈ Z, omitting the finest levels of detail j > j 1 in equation (6) giveŝ The set of approximations {ĥ j 1 , j 1 ∈ Z} is termed a multiresolution analysis (MRA) of h. Regularity conditions (i)-(iv) ensure thatĥ j 1 converges geometrically in L 2 -norm to h at rate r ψ as j 1 → ∞ (Meyer 1992;Daubechies 1992). For practical implementation of the methodology we develop below, we work with approximations that omit both finest and coarsest levels of decomposition (6): For such an approximation to converge geometrically to h we require further regularity conditions, similar to those above, but where the roles of h and ψ are partially interchanged: (v) h has a single vanishing moment (vi) h has rapid decay: that is, there exist real-valued constants C h < ∞ and s h > 1 such that for all x ∈ R We show in Corollary 5 of Appendix A.1, subject to regularity conditions (i)-(vi), thatĥ j 0 , j 1 converges geometrically to h in L 1 norm at rate 2 −r ψ with increasing j 1 and at rate 2 −(1−s −1 h ) with decreasing j 0 .

Wavelet Monte Carlo-preliminaries
In this section we prepare the groundwork for WMC, and present a preliminary algorithm which, for reasons that will become clear, we call pre-conditioned WMC, or pWMC for short.
For any real value z, let z + and z − denote its positive and negative parts Define where the right-hand equality follows from (2),(14). Also define where the right-hand equality follows from (4). Let g 0 be an initial unnormalised density from which independent samples can be easily generated. Let g 1 be an unnormalised target density from which we would like to obtain independently sampled particles. For now, we assume the normalising constant c g of g 0 is equal that of g 1 : that is, g 0 (x) dx = g 1 (x) dx = c g . In Sect. 7.1, we propose a method for estimating the ratio: g 1 (x) dx/ g 0 (x) dx and adjusting g 0 accordingly.
In our preliminary algorithm pWMC, and in each of the algorithms presented subsequently, h will be taken as the difference between the unnormalised target and initial densities Assuming that h satisfies conditions (i)-(vi) of Sect. 2, which ensure that expansion (6) holds, each run of pWMC will independently generate a particle y ∈ R from the target distribution g 1 . Note that condition (v) implies that the normalising constants of g 0 and g 1 are equal. If they are not, they can be adjusted as noted above. Condition (vi) is unlikely to be of practical consequence since it holds even for differences between heavy-tailed densities such as scale-and location-shifted t 3 distributions, and pathological behaviour in the tails of g 0 or g 1 would be unusual. Central to pWMC is the move mass function H , defined as which is assumed to satisfy the following precondition (for which pWMC is named):

The pWMC algorithm
Here we present the algorithm for pWMC. We discuss it in Sect. 3.3.
Step p0: Sample a real value x ∈ R with probability density Step p1: With conditional probability set y = x then stop and return the value y. Otherwise, continue.
Step p2: Sample a pair of integers ( j, i) ∈ Z 2 with conditional probability Step p3: Sample y ∈ R with conditional probability density then stop and return the value y.

Proof of principle of pWMC
Lemma 1 The particle delivered by one run of the pWMC algorithm has the target density, g 1 /c g , provided that regularity conditions (i)-(vi) and precondition (18) all hold.
Proof Let q(x, y) denote the joint probability density of sampling a value x at Step p0 then returning a value y at either Step p1 or Step p3. A returned value of y = x can only be generated at Step p3. So, for y = x, marginalising over the events of Steps p1 and p2, noting that progression to Steps p2 and p3 implies H (x) > 0 a.s., we have by construction from (19)-(22): wheres ji = sign(h ji ). Let p 1 (y) denote the marginal probability density of returning the value y from the pWMC algorithm. The Kolmogorov Forward Equation gives the update: using (15), where we have used Fubini's Theorem to switch the order of integration and summations, since |q(x, y) − q(y, x)| dx < ∞, as shown in Appendix A.2. Hence, using (6), (13), (16), where (6) follows from regularity conditions (i)-(vi), as discussed in Sect. 2.

Remarks on pWMC
Lemma 1 shows that the pWMC algorithm produces a particle from the target distribution g 1 /c g . Repeating the algorithm L times produces L independent particles from g 1 . The intuition behind the pWMC algorithm is as follows. Function h in (16) is proportional to the difference between the target density g 1 /c g and the initially sampled density g 0 /c g . At the initial particle position x, this difference may be decomposed into contributions, positive Fig. 1 The intuition underlying the pWMC algorithm. The jagged line is the Daubechies wavelet ψ 00 with r ψ = 2 vanishing moments at scale j = 0 and location i = 0. Suppose vehicle ψ 00 has been selected for transition. The arrows illustrate two particles moving from "over-" to "under-represented" regions. Particle "A" moves if wavelet coefficient h 00 > 0, otherwise it does not move. Particle "B" moves ifh 00 < 0, otherwise it does not move. Thus, since ψ + 00 (x) dx = ψ − 00 (x) dx, over-and under-representations are neutralised and negative, from each term in wavelet expansion (6). Now consider each contributionh ji ψ ji (x) at x in isolation of all others. A negative-valued contribution suggests that g 0 (x) over-represents g 1 (x), while a positive-valued contribution suggests g 0 (x) under-represents g 1 (x). At an "over-represented" position x, the particle will transition with probability p 2 ( j, i|x) p 3 (y| j, i) to an "under-represented" position y elsewhere. We call ψ ji the vehicle of this transition. At an "under-represented" position x, the particle will not move, so y = x. This is illustrated in Fig. 1. Thus, considering all contributions collectively, in regions of net over-representation (h(x) < 0) particle transitions away from x to elsewhere will occur more frequently than transitions to x from elsewhere, while in regions of net underrepresentation (h(x) > 0) the converse is true.
Note that, with probability p 1 (Stay | x), a particle at x is not moved and the value y = x is returned. Precondition (18) ensures that p 1 (Stay | x) ≥ 0. Integrating (18) over x ∈ R, using (17) and Tonelli's Theorem to exchange the order of integration and summations of the non-negative integrand, gives using (14),(15). Hence, implicit in precondition (18) is the requirement that This is the homogeneous Besov space norm h Ḃ 0 1,1 given in (8). Whether the requirement (25) is satisfied depends on both the wavelet family ψ(x) and the difference function h(x). The following simple example illustrates failure of this requirement.

A pathological example
Suppose that ψ is the Haar wavelet (1) (ignoring for the time being that it violates condition (i) of Sect. 2). Suppose also that h( is the indicator function, taking the value 1 when its argument is true and 0 otherwise. Then it can be shown that Substituting this into precondition (25) gives which violates the precondition. This example was constructed so that the total probability on the positive x-axis differs between distributions g 1 /c g and g 0 /c g . Consequently, for the algorithm to succeed in delivering a particle sampled from g 1 , there needs to be a positive probability that the value y sampled in Step p3 of pWMC lies on the opposite side of the origin to the value x sampled in Step p0. However, there exists no translated and dilated Haar wavelet ψ Haar ji (x) whose support straddles the origin. Thus the system in this example provides no vehicle for transiting x → y across the origin, and therefore no way in which the target distribution can be realised. This manifests itself in the failure of precondition (25). In the multivariate generalisation of pWMC (see Sect. 6), the Haar system would provide no mechanism for transiting between the negative and positive domains of any coordinate axis.

Discrete-time WMC
In general, precondition (18) will be difficult to verify analytically. Violation of this precondition for any x would lead to a negative probability p 1 (Stay | x) in (20). One approach would be simply to ignore the problem and hope that no such x would be encountered while running the algorithm. However, this would not guarantee that the value y produced is a particle from the target density, g 1 /c g . An alternative, multistage, approach might be considered, at each stage applying pWMC to initial and target distributions that differ only slightly, incrementally progressing from g 0 /c g to g 1 /c g . For such an approach, a precondition of the form of (18) would still be required, but would be weaker due to the smaller difference function h encountered at each stage. We present such an algorithm now with n stages, but with an adjusted initial and final distribution that removes completely the need for such a precondition. We term this algorithm nstep WMC, or nWMC for short.
For nWMC, we require that h belongs to the homogeneous Besov spaceḂ 0 1,1 so (8) holds. We introduce an artificial time axis t ∈ [0, 1]; an adjustment constant δ ∈ (0, 1]; and define g t,δ to be a density located between an initial density g 0,δ and a target density g 1,δ : where H is given by (17). It is easily verified using (16) that , from which it is clear that our original initial and target densities are adjusted by δ H (x). For all t ∈ [0, 1], the normalising constant g t,δ (x) dx is given by using (24) and h(x) dx = 0. Note that our assertion of (8) ensures that c δ is finite, which is critical to the validity of nWMC.
We partition the half-open time-interval (0, 1] into n = 1/δ intervals of length δ > 0 indexed by k = 1, . . . , n. Thus For each time-interval k, we will run pWMC taking the initial density as g (k−1)δ, δ and the target density as g kδ, δ . Thus the normalising constant c g will become c δ , and the difference h(x) between the initial and target densities will become g kδ, δ (x) − g (k−1)δ, δ = h(x)δ. Accordingly,h ji given by (7) will becomeh ji δ and H given by (17) will become H δ.
The nWMC algorithm starts by generating a single particle at location x 0 from distribution g 0,δ /c δ . (The practicality of this is not our concern, as nWMC is merely a staging post en route to our final algorithm in Sect. 5.) This particle is then input to Step p1 of pWMC (adapted for k = 1 as described above), and output to location x 1 from Steps p1 or p3. The particle now at x 1 is then re-input to Step p1 of pWMC (adapted for k = 2) and re-output to location x 2 from Steps p1 or p3. This process is repeated for each time-interval k = 1, . . . , n, producing a sequence of particle locations x 1 , . . . , x n , not necessarily all distinct.

Proof of principle of nWMC
Lemma 2 Provided that h belongs to the homogeneous Besov spaceḂ 0 1,1 and the regularity conditions (i)-(vi) of Sect. 2 all hold, a particle delivered by one run of the nWMC algorithm will have density g 1,δ /c δ .
Proof We first show that nWMC does not require a precondition of the form of (18). With the above adaptations for time-interval k, precondition (18) becomes for all x ∈ R, using (26), where 1 ≤ k ≤ n. But (27) is necessarily true for all x ∈ R since g 0,δ and g 1,δ are nonnegative. Thus nWMC requires no precondition of the form of (18). We complete the proof with the following inductive argument, noting that h ∈Ḃ 0 1,1 implies that c δ < ∞, as discussed above.
Assertion: At time t ∈ {0, δ, . . . , nδ = 1}, an nWMC particle has marginal density g t,δ /c δ . Base case: The particle generated at time t = 0 has density g 0,δ /c δ . Thus the assertion holds trivially for t = 0. Induction step: Suppose the assertion holds for a given t ∈ {0, δ, . . . , (n − 1)δ}. Then, on input to Step p1 of the application of pWMC to time-interval (t, t + δ], the particle will have marginal density g t,δ /c δ . By Lemma 1, the particle output from Steps p1-p3 of this application of pWMC will have the marginal density of its target density, g t+δ,δ /c δ . Hence, the assertion is also true at time t + δ. Conclusion: The base case shows that the assertion is true at t = 0. Therefore, by induction, the assertion is true for all times t ∈ {0, δ, . . . , nδ = 1}.

Remarks on nWMC
An nWMC particle may or may not be moved in a given time-interval k. The foregoing provides a complete and exact description of nWMC, but greater computational efficiency might be found by eliding multiple steps together, as we now describe.
Conditional on being at location x at the start of timeinterval k, let π k (x) denote the probability that the particle is somewhere else at the end of that time-interval. From Step p1 of pWMC adapted for time-interval k as described above, the probability of moving in time-interval k, i.e. of not stopping at x, is Let p k (x) denote the conditional probability that this particle survives at x until the start of time-interval whereupon it moves elsewhere, or until time 1, whichever is the sooner. Then, for 1 ≤ k < n, we have for ∈ {k, . . . , n}, where the product −1 m=k is understood to take the value 1 when k = .
The following implementation of nWMC elides all 'nonmove' steps into a single step, where the notation "set u ⇐ v" means "put the value of v into variable u".

The nWMC algorithm
Here we present the algorithm for nWMC, as motivated above.
Step n0: Sample a real value x 0 ∈ R with probability density Set x ⇐ x 0 and k ⇐ 0.
Step n1: Set k ⇐ k + 1. Sample an integer ∈ {k, . . . , n} with conditional probability p k given in (29). If = n, stop and return the value x. Otherwise, set k ⇐ + 1 and continue.
Step n3: Sample a new value x ∈ R with probability density p 3 (x | j, i) given by (22), then return to Step n1.
Note that Steps n2-n3 above are essentially identical to Steps p2-p3 of pWMC.
In summary a run of the nWMC algorithm, as just described, first produces a particle at a location x 0 sampled from the initial distribution g 0,δ at time t = 0. The particle may then jump to a new location x t at any number of subsequent times t, 0 < t ≤ 1. The computational efficiency of the algorithm clearly depends on the number of jumps, which will be small only if π k (x) given in (28) is generally small.
Lemma 2 shows that nWMC delivers a particle x 1 at time t = 1 with marginal density g 1,δ /c δ , where from (26) we have g 1,δ = g 1 + δ H . This only approximates the desired target g 1 /c g , the approximation error being proportional to δ. This error will be small for δ close to zero, but the number of time-intervals n = 1/δ will then be large. The computational demands of nWMC for large n would be burdensome, but we now adapt it to produce an algorithm, for arbitrarily large n, that is simple, practical and accurate.

Continuous-time WMC
For a particle at location x at a time t = kδ, where 0 ≤ k < n, the probability π k (x) of moving away from x in the (k + 1)th time-interval of nWMC is from (26),(28). Note that g 0,δ (x) + th(x) is a linear timeinterpolation between g 0,δ (x) and g 1,δ (x), as already noted in connection with (26), and g 0,δ and g 1,δ are both strictly positive by construction, so the denominator of (31) is necessarily strictly positive. As δ → 0, the initial density g 0,δ → g 0 ; the target density g 1,δ → g 1 ; and the number of intervals n = 1/δ → ∞. Then the length of time that the particle remains at a location x can be characterised as a continuous-time survival process with a position-dependent hazard rate at time t of λ t (x) = lim δ→0 π t/δ (x)/δ. From (31), for t ∈ [0, 1), this is Suppose the particle arrives at location x at a time s ∈ [0, 1). Let t be the subsequent time at which the particle departs from x. From (32), if g 0 (x) = 0 and t = 0 or if g 0 (x) = g 1 (x) = 0, then departure is immediate: t = s. Otherwise, let F(t | x, s) denote the conditional cumulative distribution function (CDF) of departure-time t given s and x. From the theory of survival analysis (see, for example, Cox and Oakes (1984)), for s < t ≤ 1 Thus, the probability of not departing from x by time t = 1 is, from (33), We note in passing that the first case of (33) is the CDF of a scale-location-shifted generalised Pareto distribution, and the second case is the CDF of a location-shifted exponential distribution. In either case, sampling the time t of departure from x is straightforward using the inverse-CDF method, as follows. Sample a random variate u from the Standard Uniform distribution. Then set If the sampled t is greater than 1, the particle does not move from x.
We can now implement the continuous-time survival-time version of nWMC which we call survival WMC(sWMC). The major difference between nWMC and sWMC is in Step s1.

The sWMC algorithm
Step s0: Sample a real value x 0 ∈ R with probability density Set x ⇐ x 0 and t ⇐ 0.
Step s1: If g 0 (x) = 0 and t = 0 or if g 0 (x) = g 1 (x) = 0, then departure from x is immediate: go to Step s2. Otherwise, set s ⇐ t; sample u from the Standard Uniform distribution; then set t as in (35). If t ≥ 1, stop and return the value x. Otherwise, continue.
Step s3: Sample a new value x ∈ R with probability density p 3 (x | j, i) given by (22), then return to Step s1.
Note that Steps s2-s3 above are essentially identical to Steps p2-p3 of pWMC and to Steps n2-n3 of nWMC.
In summary, a run of sWMC first produces a particle at a location x 0 sampled from the initial density g 0 /c g at time t = 0. The particle may then jump to a new location x t at any number of subsequent times t, 0 < t ≤ 1. The computational efficiency of the algorithm clearly depends on the number of jumps, which will be small only if λ t (x) given in (32) is generally small.
As discussed at the end of Sect. 4.3, Lemma 2 shows that nWMC delivers a particle x 1 at time t = 1 with marginal density (g 1 + δ H )/c δ . Since sWMC is the continuous limit of nWMC and g t,δ (x)/c δ → g t (x)/c g as δ → 0 for all x ∈ R, it follows that sWMC delivers a particle x 1 at time t = 1 with marginal density g 1 /c g , the original target density.

Multiple dimensions
As noted above, the set of wavelets B given in (5) forms an orthonormal basis for L 2 (R). Therefore, an orthonormal basis for the d-dimensional space L 2 (R d ) is given by the dfold Cartesian product of B with itself, B d . Accordingly, our d-dimensional generalisation of the homogeneous wavelet expansion (6) is Here the multivariate wavelet ψ ji is simply the product of univariate wavelets ψ j 1 i 1 , . . . , ψ j d i d : andh ji is the coefficient of wavelet ψ ji , given bỹ Note that expansion (37) involves products of wavelets (38) at different scales in different dimensions, unlike the more usual d-dimensional wavelet expansion given for example in Meyer (1992), which additionally involves father wavelets. Let g 0 (x) and g 1 (x) denote the multivariate initial and target densities at location x. Let h(x) = g 1 (x) − g 0 (x) denote the difference function, generalising (16). Then the d-dimensional version of sWMC is exactly as described in Sect. 5.1, replacing functions g 0 , g 1 , h and the univariate quantities x 0 , x, j, i, ψ ji ,h ji , H , A j used and referenced therein in equations (17),(21),(22),(35) with their d-dimensional counterparts, which for A j is d k=1 A j k .

Practical considerations
In this section we discuss several practical issues confronted when running sWMC. For ease of exposition, we return to the univariate setting for most of this discussion, but include some brief comments on the multivariate setting at the end. Applications of Monte-Carlo methods typically generate large sample sizes, sometimes dependently, as in MCMC. However, there are potential advantages for methods such as WMC that generate independent samples, in terms of both statistical and computational efficiency. In particular, they allow parallel and distributed implementation on multiple independent processors. Below, we use the notation S L = {x [ ] , = 1, . . . , L} to denote a sample of L particles independently drawn from the same distribution.

Estimating the normalising-constant ratio
Suppose the normalising constants c g 0 = g 0 (x) dx and c g 1 = g 1 (x) dx are unknown or unequal. We can estimate the ratio ρ = c g 0 /c g 1 as follows. Let p dom denote a convenient probability density which dominates both g 0 and g 1 . Draw a sample S L from p dom , then estimate ρ as follows: .
Usingρ L , we can adjust g 0 to produce an unnormalised densityg 0 for which cg 0 = c g 1 for use in WMC, thus

Estimating wavelet coefficients
Methods of Monte-Carlo integration replace explicit computation of integrals on a target distribution with averages over particles sampled from that distribution. sWMC is one such method, but its implementation involves calculation of many wavelet coefficientsh ji , each of which is itself an integral (7), typically not of closed form. This feature would seem self defeating, but we now show that it is sufficient, for each particle, to replace eachh ji encountered in the algorithm with an independent unbiased estimate of it. From regularity condition (i) of Sect. 2, the mother wavelet ψ has support [0, a] for some integer a > 1 so, from (4), the support of ψ ji is [2 − j i, 2 − j (i + a)]. A simple unbiased estimate ofh ji is obtained by averaging the value of h(u)ψ ji (u) obtained from N ≥ 1 values of u sampled independently from the Uniform distribution with the same support as ψ ji .
Leth [ ] ji N denote such an unbiased independent estimate of h ji for particle . For each particle ∈ {1, . . . , L}, replacing eachh ji encountered in running sWMC by its estimate leads to approximating h by an empirical version of equation (6): which is unbiased forh ji with variance proportional to L −1 N −1 . However, inaccurate estimates ofh ji will likely increase the number of jumps at Step s3 of sWMC.

Range of scales j
As described in Sect. 5.1, the sWMC algorithm assumes an unbounded range of scales j. However, for practical computation, we must restrict scales j to a finite range [ j 0 , j 1 ], as in (10). As noted above, the support of ψ ji is [2 − j i, 2 − j (i + a)). At any given location x and scale j, there are therefore exactly a wavelets supported at x: {ψ ji : 2 j x − a < i ≤ 2 j x}, and Step s2 of sWMC requires a coefficienth ji to be calculated for each of these wavelets. Consequently, for a particle arriving at x, the total number of wavelet coefficients to be calculated is ( j 1 − j 0 + 1)a, which will need to be repeated after every jump (Step s3 of sWMC).
With the regularity conditions (i)-(vi) of Sect. 2, Lemma 3 of Appendix A.1 shows that omitting scales j > j 1 produces an L 1 error in the wavelet approximation (10) of h proportional to 2 −r ψ j 1 . So choosing a wavelet family with compact support and large r ψ would ensure rapid convergence to h as j 1 → ∞. However, orthogonal wavelets with r ψ vanishing moments have support length a ≥ 2r ψ − 1 (Mallat 2009), so increasing r ψ will proportionately increase the amount of computation in sWMC. A computationally efficient choice of wavelet family ψ and upper scale-range limit j 1 will depend on the magnitude of fine-scale details in h, and would need to be decided on a case-by-case basis, possibly by trial and error. Setting the finest level j 1 too low could result in blurring some fine-scale details of the target distribution.
Lemma 3 of Appendix A.1 shows that omitting scales j < j 0 produces an L 1 error in the wavelet approximationĥ j 0 , j 1 of h in (10) proportional to 2 − j 0 (1−s −1 h ) . Thus, unfortunately, the rate of convergence to h as j 0 → −∞ depends on the rate of decay s h in the tails of h, and not on the choice of wavelet family ψ with r ψ vanishing moments, which is under the control of the user. Wavelets in WMC act as vehicles that transition particles from one location to another, but each waveleth ji can only perform such a transition within its support. To reduce the average number of jumps per particle, the coarsest resolution level j 0 should be chosen to ensure that a particle will generally be able to move from its initial location to all regions of high target density in a single jump. Running sWMC restricting j ∈ [ j 0 , j 1 ] will produce a particle with output density g 0 +ĥ j 0 , j 1 . Setting the coarsest level j 0 too high could miss some coarse-scale features of the target distribution and could potentially lead to negative values of this theoretical output density at certain locations, which would clearly invalidate regularity condition (vi) of Sect. 2. If such a problem is suspected or detected, sWMC should be rerun with a reduced setting of j 0 .
As a rule of thumb, the more irregular or complex the target, the wider should be the range [ j 0 , j 1 ]. To improve the approximation to target g 1 of the particle output density g 0 +ĥ j 0 , j 1 from a run of sWMC, the particle's location can be input to Step s1 of a second run of sWMC, this time restricting j ∈ [ j 1 + 1, j 1 ], where j 1 > j 1 . The particle will then on output have density g 0 +ĥ j 0 , j 1 . Inputting the particle to a third run of sWMC, this time restricting j ∈ [ j 0 , j 0 − 1], where j 0 < j 0 , will produce a particle output density of g 0 +ĥ j 0 , j 1 . Repeating this process in parallel on an entire particle-set S L provides a multistage version of sWMC which will improve the approximation to g 1 at each stage. Such an algorithm could be useful in determining iteratively an adequate range for scale j.

Multivariate practical considerations
All of the practical considerations discussed above apply equally in the multivariate setting, replacing the univariate quantities x 0 , x, j, i, ψ ji ,h ji , H , A j with their d-dimensional counterparts, as described in Sect. 6. In particular, an unbiased estimate of multivariate wavelet coefficienth ji can be obtained by averaging h(u)ψ ji (u) over N values of u sampled independently from the d-dimensional Uniform distribution whose support is that of ψ ji . The total number of wavelet coefficientsh ji to be calculated for each particle, initially and after each jump, will be ( j 1 − j 0 + 1) d a d .

Examples
This paper is essentially a proof-of-concept of sWMC. However, as a limited exploration of the strengths and weaknesses of the method, we compare its performance in 1-and 2-dimensional examples with three well-established Monte-Carlo samplers: Rejection Sampling (von Neumann 1951); Importance Sampling (Kahn and Harris 1951); and the Random Walk Metropolis algorithm (Metropolis et al. 1953). Each of these methods involves an initial sampling distribution g 0 and a target distribution g 1 . Let w(x) = g 1 (x)/g 0 (x).
To be practically useful, a method should be highly computationally efficient. For each Monte-Carlo sampler, we define its computational efficiency as where p keep is the proportion of generated particles that are retained, ESS is the effective sample size per retained parti-cle (which compares the informational content of a sample of retained particles with that of an equal-sized i.i.d. sample from the target), andn eval is the average number of calls to the function g 1 per particle. Our focus onn eval rather than CPU time per se is motivated with applications to computationally expensive targets in mind, although the toy examples presented below are not of this nature. For each Monte-Carlo sampler, we also compute its discrepancy. For this, we first construct a finite mesh M covering the main support of the target distribution, and for each cell k ∈ M compare the observed proportion of samples o k falling into that cell with its expected value e k under the target distribution, where both o k and e k are normalised over the mesh. The simple form of the target distributions in the examples below allows e k to be calculated exactly. We then compute discrepancy as: We compare and contrast the following four Monte-Carlo samplers: Wavelet Monte Carlo (sWMC) As described above, all samples are retained and are independent, so p keep = 1 and ESS = 1. Thus E = 1/n eval . The number of target evaluations for a particle is related to its number of its jumps en route, the number of wavelet coefficients that must be computed before each jump, and the method of computing them.
In the examples below, we evaluate wavelet coefficients numerically on a fine grid, but gain a substantial reduction in this overhead by caching and reusing coefficients previously computed during the run. Rejection Sampling (RS) assumes the existence of a finite upper bound M on w(x). Independently sampled proposed particles x from g 0 /c g are retained as samples from the target with probability a = w(x)/M and discarded with probability 1 − a. Here p keep = 1/M; ESS= 1; and n eval = 1. Thus E = 1/M. Importance Sampling (IS) requires the variance Var g 0 [w] of w(x) under g 0 (x) to be finite. Independently sampled particles x from g 0 /c g are retained as weighted samples from the target with weight w(x).
Random-walk Metropolis (RWM) From an initial point x [1] sampled from g 0 , subsequent particles x [i] , i > 1, are generated by first sampling a particle y from a symmetric proposal distribution q prop centered at x [i−1] , then assigning to x [i] either the value y with probability α or the value x [i−1] with probability 1 − α, where α = min(1, g 1 (y)/g 1 (x [i−1] )). After discarding particles generated

Example 1
To lower the computational cost of the algorithm, one would ideally choose a convenient initial distribution g 0 /c g that is similar to the target g 1 /c g . However, the final result of a run of sWMC should change little if the chosen initial distribution is far from the target in location, spread and shape. This is demonstrated in Fig. 2, where the initial univariate Normal distribution g 0 has extremely low probability within the main mass of the target mixture g 1 of Normal, Exponential and Uniform densities. Daubechies wavelets ψ with 4 vanishing moments were used, setting the scale-range [ j 0 , j 1 ] to [−7, 12]. Numerical integration was used to calculate wavelet coefficients. An average of 7.4 jumps per particle was recorded. We see that sWMC has recovered the target distribution well despite its multimodality, with no tendency to get stuck in local modes. In this example, the regularity conditions for RS and IS are not met as g 0 does not dominate the tails of g 1 . Thus M = ∞ and Var g 0 [w] = ∞, implying in (43) that numerical efficiency E RS = E IS = 0. This might seem an easy win for sWMC but perhaps not a fair comparison as prac-  titioners of RS and IS would avoid using this g 0 in this example. To address this point, for sWMC, RS and IS, we performed a single simulation of L = 100 000 iterations progressing through a sequence of five initial distributions g 0 of 20 000 iterations each, using t 5 -distributions located at −2.0 with increasing scale from 5.0 to 80.0 (sd 6.45 to −103.3). For RWM, we set x [1] = −2 and used the same sequence of t 5 -distributions as proposal distributions q prop , centering them at the current point x [i−1] instead of at −2. For sWMC, we used Daubechies wavelets 2 ψ with r ψ = 4 vanishing moments and scale-range [ j 0 , j 1 ] = [−9, 12], and at each scale we estimated the normalisingconstant ratioρ L (40) from L = 20 000 samples drawn from a p dom which comprised an equally weighted mixture of the five initial distributions g 0 . For RS, the upper bound M was determined theoretically; this would not generally be an option for more complex targets. The discrepancy mesh M in (44) covered the interval (−25, 55) with cell width 0.5. The results are reported in Tables 1  and 2. Table 1 shows that optimal efficiency E is obtained for each sampler with an initial distribution (or proposal distribution) scaling of ≈40.0. For sWMC, efficiency is progressively aided by the caching of wavelet coefficient as they are calcu-lated. The efficiency of RS is abetted by the pre-calculation of an exact upper bound M on w(x) at each stage; in general this would not be practical and a much larger M might be adopted, implying smaller E. These efficiency calculations appear to suggest that sWMC is the least desirable of these methods. However, the discrepancy calculations D in Table 2 tell a different story. Theoretically, RS delivers samples exactly from the target g 1 , provided M is suitably large; (iterations 1-20 000 were skipped to avoid an excessive run time due to their extremely large value of M, see Supplementary Table 5). Although D is more discrepant for sWMC than RS, we see that RWM and IS can be much worse depending on scaling, due to poor mixing of RWM and the large standard deviation in weights w for IS (see Supplementary Table 5). Supplementary Figs. 4-7 show, for sWMC and RS, excellent fits of simulated particles to the target distribution, but for IS the fit is rough with a spurious peak, while for RWM the heights of the sharp peaks in the target distribution are poorly estimated.
Supplementary Table 5 gives the mean number of particle moves at each stage of the sWMC simulation, showing that the number of moves increased in the mid-scale range. Supplementary Fig. 8 shows the position-time tracks of eight particles from the mid-scale of the simulation. Some of these particles moved only a few times, while others moved many times, sometimes briefly visiting far-off places. Supplementary Table 5 also shows that the estimated normalising-constant ratioρ L (40) remained close to its true value, in this example ρ = 1.0. Further simulations revealed that settingρ L = 1.0 had little effect on the results (not shown).

Example 2
A two-dimensional application of sWMC is illustrated in Fig. 3, where the target distribution is a mixture of four bivariate Normals, one being highly concentrated, scattered within the main support of the initial bivariate Normal. Daubechies wavelets ψ with 3 vanishing moments were used, setting the scale-range [ j 0 , j 1 ] to [−2, 4]. Numerical integration was used to calculate wavelet coefficients. Again we see that sWMC has recovered the target distribution well. Around the point (-2,3), the black (particle) contours are more spread out than the red (target) contours; this is due to the smoothing effect of the kernel-density estimate of the particles around this highly concentrated point of the target.
Tables 3 and 4 compare the performance of sWMC with that of RS, IS and RWM, for different scalings of the initial distribution. Preliminary runs suggested setting the burn-in for RWM to zero. Scale 4 corresponds to the initial distribution of Fig. 3, for which we see in Table 3 that sWMC is less efficient than the other methods, but in Table 4 that it also has low discrepancy D. However, the discrepancy statistic   does not tell the whole story. Supplementary Fig. 11 shows that both sWMC and RS recover the target well, but there are serious problems with the IS and RWM solutions; both miss the concentrated target peak at (-2,3). At scale 2, a similar picture emerges ( Supplementary Fig. 10). At scale 1, we find that sWMC has not performed well but IS and RWM have performed much worse, the latter due to slow mixing (Supplementary Fig. 9), while RS was not performed at all due to its long predicted run time (≈ 85 000 × that for scale 2, see Supplementary Table 6). Compared to scale 4, at scale 6 the performance of sWMC has deteriorated, while that of IS and RWM has improved ( Supplementary Fig. 12).

Summarising results
Examples 1 and 2 provide a snapshot of how the methodology might perform in practice in comparison to other Monte Carlo methods. In particular we see that sWMC can provide These examples are intended only to illustrate the behaviour of sWMC. They do not provide thorough comparisons with other methods, each of which could benefit from tailored or adaptive initial/proposal distributions depending on the form of target distribution. Any method, including sWMC, will benefit from an initial distribution that is close to the target. Other methods might be more appropriate; indeed, in each of the toy examples presented above, the target distribution can be simulated directly!

Conclusions
We have presented here the theory and methodology of WMC, a new method for Monte-Carlo integration which independently samples particles from a potentially complex target distribution. Independence of the particles opens the possibility of implementation on parallel computing arrays, for computational speed. For computational efficiency, the user can control the accuracy of approximation to the target distribution through the settings of the coarsest and finest scales j 0 and j 1 . As discussed in Sect. 7, the choice of wavelet family ψ, with its attendant number of zero moments r ψ , controls inaccuracy due to omitting scales finer than j 1 , but this has no effect on inaccuracy due to omitting scales coarser than j 0 , which depends on the differential tail decay-rate s h between the initial and target densities, g 0 and g 1 . As discussed in Sect. 7, sWMC can be run to sequentially improve approximations in previously obtained sWMC particles.
Our emphasis in this paper has been on developing the theory and methodology of sWMC. The computational efficiency of sWMC will depend heavily on the number of wavelet coefficients to be calculated before each jump and on the expected number of jumps, which in turn depend on the choice of mother wavelet ψ, the scale range [ j 0 , j 1 ], the initial distribution g 0 , the method of evaluating wavelet coefficients; and the number of dimensions d. The number of jumps will be reduced substantially if an initial distribution is chosen to be close to the target, but in general this will be infeasible. As suggested in Sect. 7, wavelet coefficients may be estimated rather than directly calculated, with potential computational savings. Inaccurate estimates may however increase the number of jumps. Much remains to be done to explore the impact of these factors on computational efficiency and accuracy in different settings. New methods for accurate and computationally efficient evaluation of wavelet coefficients could hugely improve the practicality of sWMC.
The number of wavelet coefficients increases exponentially with the number of dimensions d, as shown at the end of Sect. 7. The 'curse of dimensionality' is common to most if not all methods of Monte Carlo integration, and is unavoidable unless the target distribution has a structure that is decomposable in some way, as in the case of graphical models (see, for example, Jordan (2004)). Further development of sWMC would be required to exploit such target-distribution decomposability.