UvA-DARE (Digital Academic Repository) A non-parametric Bayesian approach to decompounding from high frequency data

GivenasamplefromadiscretelyobservedcompoundPoissonprocess,weconsider non-parametric estimation of the density f 0 of its jump sizes, as well as of its intensity λ 0 . We take a Bayesian approach to the problem and specify the prior on f 0 as the Dirichlet location mixture of normal densities. An independent prior for λ 0 is assumed to be compactly supported and to possess a positive density with respect to the Lebesgue measure. We show that under suitable assumptions the posterior contracts around the pair (λ 0 , f 0 ) at essentially (up to a logarithmic factor) the √ n (cid:3) -rate, where n is the number of observations and (cid:3) is the mesh size at which the process is sampled. The emphasis is on high frequency data, (cid:3) → 0 , but the obtained results are also valid for ﬁxed (cid:3). In either case we assume that n (cid:3) → ∞ . Our main result implies existence of Bayesian point estimates converging (in the frequentist sense, in probability) to (λ 0 , f 0 ) at the same rate. We also discuss a practical implementation of our approach. The computational problem is dealt with by inclusion of auxiliary variables and we develop a Markov chain Monte Carlo algorithm that samples from the joint distribution of the unknown parameters in the mixture density and the introduced auxiliary variables. Numerical examples illustrate the feasibility of this approach.


General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).
Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons.In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website.Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands.You will be contacted as soon as possible.

Problem formulation
Let N = (N t , t ≥ 0) be a Poisson process with a constant intensity λ > 0 and let Y 1 , Y 2 , Y 3 , . . .be a sequence of independent random variables independent of N and having a common distribution function F with density f (with respect to the Lebesgue measure).A compound Poisson process (abbreviated CPP) X = (X t , t ≥ 0) is defined as where the sum over an empty set is by definition equal to zero.CPPs form a basic model in a variety of applied fields, most notably in e.g., queueing and risk theory, see Embrechts et al. (1997) and Prabhu (1998) and the references therein, but also in other fields of science, see, e.g., Alexandersson (1985) and Burlando and Rosso (1993) for stochastic models for precipitation, Katz (2002) on modelling of hurricane damage, or Scalas (2006) for applications in economics and finance.Suppose that corresponding to the 'true' parameter values λ = λ 0 and f = f 0 , a discrete time sample X , X 2 , . . ., X n is available from (1), where > 0. Such a discrete time observation scheme is common in a number of applications of CPP, e.g., in the precipitation models of the above references.Based on the sample X n = (X , X 2 , . . ., X n ), we are interested in (non-parametric) estimation of λ 0 and f 0 .Before proceeding further, we notice that by the stationary independent increments property of a CPP, the random variables Z i = X i − X (i−1) , 1 ≤ i ≤ n, are independent and identically distributed.Each Z i has the same distribution as the random variable where T is independent of the sequence Y 1 , Y 2 , . . .and has a Poisson distribution with parameter λ.Hence, our problem is equivalent to estimating (non-parametrically) λ 0 and f 0 based on the sample Z n = (Z 1 , Z 2 , . . ., Z n ).We will henceforth use this alternative formulation of the problem.Our emphasis is on high frequency data, = n → 0 as n → ∞, but the obtained results are also valid for low frequency observations, i.e., for fixed .
Our main result is on the contraction rate of the posterior distribution, which we show to be, up to a logarithmic factor, (n ) −1/2 .A by now standard approach to obtain contraction rates in an IID setting is to verify the assumptions of the fundamental Theorem 2.1 in Ghosal et al. (2000).It should be noted that in the present high frequency setting, this theorem is not applicable.One of the model assumptions underlying this theorem, which is satisfied in Gugushvili et al. (2015), is that one deals with samples of a fixed distribution, whereas in our present high frequency observation regime the distribution of Z is varying, with the Dirac distribution concentrated at zero as its limit for → 0. Therefore we propose an alternative approach, circumventing the use of the cited Theorem 2.1.The theoretical contribution of the present paper is therefore not only the statement of the main result itself, but also its proof.
Next to this we also discuss a practical implementation of our non-parametric Bayesian approach, a Markov chain Monte Carlo algorithm that samples from the joint distribution of the unknown parameters in the mixture density and certain introduced auxiliary variables.

Literature review and present approach
Because adding a Poisson number of Y j 's amounts to compounding their distributions, the problem of recovering the intensity λ 0 and the density f 0 from the observations Z i 's can be referred to as decompounding.Decompounding already has some history: the early contributions (Buchmann andGrübel 2003, 2004) dealt with estimation of the distribution function F 0 , paying particular attention to the case when F 0 is discrete, while the later contributions (Comte et al. 2014;Duval 2013;Es et al. 2007) concentrated on estimation of the density f 0 instead.More (frequentist) theory on statistical inference on CPPs (and more generally on Lévy processes) can be found in the volume (Belomestny et al. 2015), with the survey paper (Comte et al. 2015) devoted to statistical methods for high frequency discrete observations, with a special section on CPPs.Other references on statistics for Lévy processes in the high frequency data setting are Comte and Genon-Catalot (2011), Comte and Genon-Catalot (2010), Comte et al. (2010), Figueroa-López (2008), Figueroa-Lopez (2009), Nickl and Reiß (2012), Nickl et al. (2016), and Ueltzhöfer and Klüppelberg (2011).All these approaches are frequentist in nature.On the other hand, theoretical and computational advances made over the recent years have shown that a non-parametric Bayesian approach is feasible in various statistical settings; see e.g., Hjort et al. (2010) for an overview.This is the approach we will take in this work to estimate λ 0 and f 0 .
To the best of our knowledge, non-parametric Bayesian approach to inference for (a class of) Lévy processes was first considered in Gugushvili et al. (2015).That paper, contrary to the present context, dealt with observations at fixed equidistant times, and was strongly based on an application of Theorem 2.1 of Ghosal et al. (2000), as already alluded to in the problem formulation of Sect.1.1.The present work complements the results from Gugushvili et al. (2015), in the sense that we now allow high frequency observations, which requires a substantially different route to prove our results, as we will explain in more detail in Sect.1.3.
We will study the non-parametric Bayesian approach to decompounding from a frequentist point of view (in the sense specified below), so that one may also think of it as a means for obtaining a frequentist estimator.Advantages of the non-parametric Bayesian approach include automatic quantification of uncertainty in parameter estimates through Bayesian posterior credible sets and automatic selection of the degree of smoothing required in nonparametric inferential procedures.

Results
The non-parametric class F of densities f that we consider is that of location mixtures of normal densities.So we consider densities specified by where φ σ denotes the density of the normal distribution with mean zero and variance σ 2 and H is a mixing measure.These mixtures form a rich and flexible class of densities, see Marron and Wand (1992) and McLachlan and Peel (2000), that are capable of closely approximating many densities that themselves are not representable in this way.The resulting mixture densities will be infinitely smooth, which is arguably the case in many, if not most, practical applications.Bayesian estimation requires specification of prior distributions on λ and f.We propose independent priors on λ and f that we denote by 1 and 2 , respectively.For f, we take a Dirichlet mixture of normal densities as a prior.This type of prior in the context of Bayesian density estimation has been introduced in Ferguson (1983) and Lo (1984); for recent references see, e.g., Ghosal and Vaart (2001).The prior for f is defined as the law of the function f H,σ as in (3), with H assumed to follow a Dirichlet process prior D α with base measure α and σ a priori independent with distribution 3 .Recall that a Dirichlet process D α on R with the base measure α defined on the Borel σ -algebra B(R) (we assume α to be non-negative and σ -additive) is a random probability measure G on R, such that for every finite and measurable partition B 1 , B 2 , . . ., B k of R, the probability vector (G(B 1 ), G(B 2 ), . . ., G(B k )) possesses the Dirichlet distribution on the k-dimensional simplex with parameters (α(B 1 ), α(B 2 ), . . ., α(B k )).See, e.g., the original paper (Ferguson 1973), or the overview article (Ghosal 2010) for more information on Dirichlet process priors.
A nonparametric Bayesian approach to density estimation employing a Dirichlet mixture of normal densities as a prior can in very rough sense be thought of as a Bayesian counterpart of kernel density estimation (with a Gaussian kernel), cf.Ghosal and van der Vaart (2007, p. 697).
With the sample size n tending to infinity, the Bayesian approach should be able to discern the true parameter pair (λ 0 , f 0 ) with increasing accuracy.We can formalise this by requiring, for instance, that for any fixed neighbourhood A (in an appropriate topology) of Here is used as a shorthand notation for the posterior distribution of (λ, f ) and we use Q λ 0 , f 0 to denote the law of the random variable Z in (2) and Q ,n λ 0 , f 0 the law of Z n .More generally, one may take a sequence of shrinking neighbourhoods A n of (λ 0 , f 0 ) and try to determine the rate at which the neighbourhoods A n are allowed to shrink, while still capturing most of the posterior mass.This rate is referred to as a posterior convergence rate (we will give the precise definition in Sect.3).Two fundamental references dealing with establishing it in various statistical settings are Ghosal et al. (2000) and Ghosal and Vaart (2001).This convergence rate can be thought of as an analogue of the convergence rate of a frequentist estimator.The analogy can be made precise: contraction of the posterior distribution at a certain rate implies existence of a Bayes point estimate with the same convergence rate (in the frequentist sense); see Theorem 2.5 in Ghosal et al. (2000) and the discussion on pp.506-507 there.
Obviously, for our programme to be successful, has to satisfy the assumption n → ∞, which is a necessary condition for consistent estimation of (λ 0 , f 0 ), as it ensures that asymptotically we observe an infinite number of jumps in the process.We cover both the case of so called high frequency observation schemes ( → 0) as well as low frequency observations (fixed ).A sufficient condition, which covers both observation regimes and which relates to n, is = n −α , where 0 ≤ α < 1.
We note that in Ghosal and Tang (2006) and Tang and Ghosal (2007) non-parametric Bayesian inference for Markov processes is studied, of which CPPs form a particular class, but these papers deal with estimation of the transition density of a discretely observed Markov process, which is different from the problem we consider here.A parametric Bayesian approach to inference for CPPs is studied in Insua et al. (2012, Sects. 5.5 and 10.3).
The main result of our paper is Theorem 1, in which we state sufficient conditions on the prior that yield a posterior rate of contraction of the order (log κ (n ))/ √ n , for some constant κ > 0. We argue that this rate is a nearly (up to a logarithmic factor) optimal posterior contraction rate in our problem.Our main result complements the one in Gugushvili et al. (2015), in that it treats both the low and high frequency observation schemes simultaneously, with emphasis on the latter.We note (again) a fundamental difference between the present paper and Gugushvili et al. (2015), when it comes down to the techniques to prove the main result.As Theorem 2.1 of Ghosal et al. (2000) cannot immediately be used, we take an alternative tour that avoids this theorem, but instead refines a number of technical results involving properties of statistical tests that form essential ingredients of the proof in Ghosal et al. (2000).These refined results are then used as key technical steps in a direct proof of our Theorem 1. Furthermore, it establishes the posterior contraction rate for infinitely smooth jump size densities f 0 , which is not covered by Gugushvili et al. (2015).On the other hand, Gugushvili et al. (2015) deals with multi-dimensional CPPs, while in this paper we consider only the one-dimensional case.Finally, in this work we also discuss a practical implementation of our non-parametric Bayesian approach.The computational problem is dealt with by inclusion of auxiliary variables.More precisely, we show how a Markov chain Monte Carlo algorithm can be devised that samples from the joint distribution of the unknown parameters in the mixture density and the introduced auxiliary variables.Numerical examples illustrate the feasibility of this approach.

Organisation
The remainder of the paper is organised as follows.In the next section we state some preliminaries on the likelihood, prior and notation.In Sect. 3 we first motivate the use of the scaled Hellinger metric to define neighbourhoods for which posterior contraction rate is derived in case the observations are sampled at high frequency.Then we present the main result on the posterior contraction rate (Theorem 1), whose proof is given in Sect. 5. We discuss the numerical implementation of our results in Sect. 4. Technical lemmas and their proofs used to prove the main theorem are gathered in the Appendix.

Likelihood, prior and posterior
We are interested in Bayesian inference with Bayes' formula.Therefore we need to specify the likelihood in our model.We use the following notation: The characteristic function of the Poisson sum Z defined in (2) is given by where φ f is the characteristic function of f.This can be rewritten as which, using the fact that φ f vanishes at infinity, shows that the distribution of Z is a mixture of a point mass at zero and an absolutely continuous distribution.Letting t → ∞, we get that φ(t) → e −λ .Hence λ is identifiable from the law of Z , and then so is f.The density of the law Q λ, f of Z with respect to the measure μ, which is the sum of Lebesgue measure and the Dirac measure concentrated at zero, can in fact be written explicitly as (cf.van Es et al. 2007, p. 681 andProposition 2.1 in Duval 2013) where and f * m denotes the m-fold convolution of f with itself.However, the expression ( 4) is useless for Bayesian computations.To work around this problem, we will employ a different dominating measure.Consider the law R λ, f of (X t , t ∈ [0, ]).By the Theorem in Skorohod (1964, p. 261) R λ, f is absolutely continuous with respect to R λ, f if and only if P f is absolutely continuous with respect to P f (we of course assume that λ, λ > 0).A simple condition to ensure the latter is to assume that f is continuous and does not take the value zero on R.
Define the random measure μ by By Theorem 2 on p. 245 in Skorohod (1964) and Corollary 2 on p. 246 there, the density where the subscript in the conditional expectation operator signifies the fact that it is evaluated under the probability R λ, f .Hence the likelihood [in the parameter pair (λ, f )] associated with the sample Z n is given by the product An advantage of specifying the likelihood in this manner is that it allows one to reduce some of the difficult computations for the laws Q λ, f to those for the laws R λ, f , which are simpler.
Observe that the priors on λ and f indirectly induce the prior = 1 × 2 on the collection of densities k λ, f .We will indiscriminately use the symbol to signify both the prior on (λ, f ), but also on the density k λ, f .The posterior in the first case will be understood as the posterior for the pair (λ, f ), while in the second case as the posterior for the density k λ, f .We will often use the same symbol to denote the posterior distribution of (λ, f ) and on the density k λ, f .This simplifies notationally some of the formulations below.
By Bayes' theorem, the posterior measure of any measurable set A ⊂ (0, ∞) × F is given by Upon setting A = {k λ, f : (k, λ) ∈ A} and recalling our conventions above, this can also be written as Once the posterior is available, one can next proceed with computation of other quantities of interest in Bayesian statistics, such as Bayes point estimates or credible sets.

Notation
Throughout the paper we will use the following notation to compare two sequences {a n } and {b n } of positive real numbers: a n b n will mean that there exists a constant C > 0 that is independent of n and is such that a n ≤ Cb n , while a n b n will signify the fact that a n ≥ Cb n .
Next we introduce various notions of distances between probability measures.The Hellinger distance h(Q 0 , Q 1 ) between two probability laws Q 0 and Q 1 on a measurable space ( , F) is defined as Here is some additional notation.For f, g nonnegative integrable functions, not necessarily densities, we write Note that these 'distances' are all nonnegative and only zero if f = g a.e.If f and g are densities of probability measures Q 0 and Q 1 on (R, B), respectively, then the above 'distances' reduce to the previously introduced ones.
We will also use K(x, y) = x log x y − x + y for x, y > 0. Note that also K(x, y) ≥ 0 and K(x, y) = 0 if and only if x = y.

Main result on posterior contraction rate
Denote the true parameter values for the CPP by (λ 0 , f 0 ).Recall that the problem is to estimate f 0 and λ 0 based on the observations Z n and that → 0 in a high frequency regime.To say that a pair ( f, λ) lies in a neighbourhood of ( f 0 , λ 0 ), one needs a notion of distance on the corresponding measures Q λ, f and Q λ 0 , f 0 , the two possible induced laws of Z i = X i − X (i−1) .The Hellinger distance is a popular and rather reasonable choice to that end in non-parametric Bayesian statistics.However, for → 0 the Hellinger metric h between those laws automatically tends to 0. The first assertion of Lemma 1 below states that h(Q λ, f , Q λ 0 , f 0 ) is of order √ when → 0. This motivates to replace the ordinary Hellinger metric h with the scaled metric h = h/ √ in our asymptotic analysis for high frequency data.Of course, for fixed (in which case one can take = 1 w.l.o.g.), nothing changes with this replacement.The lemma also shows that the Kullback-Leibler divergence and the V-discrepancy are of order for → 0. Therefore we will also use the scaled distances K = K/ and V = V/ Lemma 1 The following expressions hold true: The proof will be presented in the appendix.
Remark 1 The Hellinger process (here deterministic) of order 1 2 for continuous observations of X on an interval [0, t] is given by Jacod and Shiryaev (2003, Sects.IV.3 and IV.4a) , whose derivative in t = 0 is the same as in (9) and thus equal to 2h 1 .For the Kullback-Leibler divergence and the discrepancy V similar assertions hold.These observations have the following heuristic explanation.For → 0, there is no big difference between observing the path of X over the interval [0, ] and X , as the probability of {N ≥ 2} is small (of order 2 ).
In order to determine the posterior contraction rate in our problem, we now specify suitable neighbourhoods A n of (λ 0 , f 0 ), for which this will be done.Let M > 0 be a constant and let {ε n } be a sequence of positive numbers, such that be a rescaled Hellinger distance.Lemma 1 suggests that this is the right scaling to use.Introduce the complements of the Hellinger-type neighbourhoods of (λ 0 , f 0 ), We shall say that ε n is a posterior contraction rate, if there exists a constant M > 0, such that in Q ,n λ 0 , f 0 -probability as n → ∞.Our goal in this section is to determine the 'fastest' rate at which ε n is allowed to tend to zero, while not violating (12).
We will assume that the observations are generated from a CPP that satisfies the following assumption.
(ii) The true density f 0 is a location mixture of normal densities, i.e., for some fixed distribution H 0 and a constant The more general location-scale mixtures of normal densities, possess even better approximation properties than the location mixtures of the normals (here H 0 and K 0 are distributions) and could also be considered in our setup.However, this would lead to additional technical complications, which could obscure essential contributions of our work.
For obtaining posterior contraction rates we need to make some assumptions on the prior.
Assumption 2 (i) The prior on λ, 1 , has a density π 1 (with respect to the Lebesgue measure) that is supported on the finite interval [λ, λ] ⊂ (0, ∞) and is such that for some constants π 1 and π 1 ; (ii) The base measure α of the Dirichlet process prior D α has a continuous density on an interval [−κ 0 − ζ, κ 0 + ζ ], with κ 0 as in Assumption 1(ii), for some ζ > 0, is bounded away from zero there, and for all t > 0 satisfies the tail condition with some constants b > 0 and δ > 0; (iii) The prior on σ, 3 , is supported on the interval [σ , σ ] ⊂ (0, ∞) and is such that its density π 3 with respect to the Lebesgue measure satisfies for some constants π 3 and π 3 .
Assumptions 1 and 2 parallel those given in Ghosal and Vaart (2001) in the context of non-parametric Bayesian density estimation using the Dirichlet location mixture of normal densities as a prior.We refer to that paper for an additional discussion.
The following is our main result.Note that it covers both the case of high frequency observations ( → 0) and observations with fixed intersampling intervals.We use to denote the posterior on (λ, f ).
Theorem 1 Under Assumptions 1 and 2, provided n → ∞, there exists a constant M > 0, such that for For fixed (w.l.o.g.one may then assume = 1) the posterior contraction rate in Theorem 1 reduces to ε n = log κ (n) √ n .We also see that the posterior contraction rate is controlled by the parameter δ of the tail behaviour in ( 14).Note that if ( 14) is satisfied for some δ > 4, it is also automatically satisfied for all 0 < δ ≤ 4. The stronger the decay rate in (14), the better the contraction rate, but all δ ≥ 4 give the same value κ = 1.The best possible posterior contraction rate in Theorem 1 for minimal δ is obtained for δ = 4.In the proof in Sect. 5 we can therefore assume that δ ≤ 4.
As on p. 1239 in Ghosal and Vaart (2001) and similar Corollary 5.1 there, Theorem 1 implies existence of a point estimate of (λ 0 , f 0 ) with a frequentist convergence rate ε n .The (frequentist) minimax convergence rate for estimation of k λ, f relative to the Hellinger distance is unknown in our problem, but an analogy to Ibragimov and Khas'minskiȋ (1982) suggests that up to a logarithmic factor it should be of order √ n (cf.Ghosal andVaart 2001, p. 1236).The logarithmic factor is insignificant for all practical purposes.The convergence rate of an estimator of the Lévy density with loss measured in the L 2 -metric in a more general Lévy model than the CPP model is (n ) −β/(2β+1) , whenever the target density is Sobolev smooth of order β (cf.Comte and Genon-Catalot 2011).Our contraction rate is hence, roughly speaking, a limiting case of the convergence in Comte and Genon-Catalot (2011) for β → ∞.

Algorithms for drawing from the posterior
In this section we discuss computational methods for drawing from the distribution of the pair (λ, f ), conditional on X n (or equivalently: conditional on Z n ).In the following there is no specific need that the observational times are equidistant.We will assume observations at times 0 Further, for consistency with notation following shortly, we set z i = X t i − X t i−1 and z = (z 1 , . . ., z n ).We will use "Bayesian notation" throughout and write p for a probability density of mass function and use π similarly for a prior density or mass function.
In general, it is infeasible to generate independent realisations of the posterior distribution of (λ, f ).To see this: from (4) one obtains that the conditional density of a nonzero increment z on a time interval of length is given by which generally is rather intractable due to the infinite weighted sum of convolutions.We specialise to the case where the jump size distribution is a mixture of J ≥ 1 Gaussians.The richness and versatility of the class of finite normal mixtures is convincingly demonstrated in Marron and Wand (1992).Hence, we assume where φ(•; μ, σ 2 ) denotes the density of a random variable with N (μ, σ 2 ) distribution.Note that in ( 16) we parametrise the density with the precision τ.In the "simple" case J = 2 the convolution density of k independent jumps is given by Plugging this expression into Eq.( 15) confirms the intractable form of p(z | λ, f ).
We will introduce auxiliary variables to circumvent the intractable form of the likelihood.In case the CPP is observed continuously, the problem is much easier as now the continuous time likelihood on an interval [0, T ] is known to be (Shreve 2008, Theorem 11.6.7) where the T i are the jump times of the CPP, J i the corresponding jump sizes and V = {i: T i ≤ T }.The tractability of the continuous time likelihood naturally suggests the construction of a data augmentation scheme.Denote the values of the CPP in between times t i−1 and t i by x (i−1,i) .We will refer to x (i−1,i) as the missing values on the ith segment.Set A data augmentation scheme now consists of augmenting auxiliary variables x mis to (λ, f ) and constructing a Markov chain that has p(x mis , λ, f | z) as invariant distribution.More specifically, a standard implementation of this algorithm consists of the following steps: (1) Initialise x mis .
Under weak conditions, the iterates for (λ, f ) are (dependent) draws from the posterior distribution.
Step 3 entails generating compound Poisson bridges.By the Markov property, bridges on different segments can be drawn independently.Data augmentation has been used in many Bayesian computational problems, see, e.g., Tanner and Wong (1987).The outlined scheme can be applied to the problem at hand, but we explain shortly that imputation of complete CPP-bridges (which is nontrivial) is unnecessary and we can do with less imputation, thereby effectively reducing the state space of the Markov chain.
As we assume that the jumps are drawn from a non-atomic distribution, imputation is only necessary on segments with nonzero increments.For this reason we let I = {i ∈ {1, . . ., n}: z i = 0} , denote the set of observations with nonzero jump sizes and define the number of segments with nonzero jumps to be I = |I|.

Auxiliary variables
Note that if Y ∼ f with f as in ( 16), then Y can be simulated by first drawing its label L , which equals j with probability ρ j , and next drawing from the N (μ L , 1/τ ) distribution.Knowing the labels, sampling the jumps conditional on their sum being z is much easier compared to the case with unknown labels.Adding auxiliary variables as labels is a standard trick used for inference in mixture models (see, e.g., Diebolt and Robert 1994;Richardsen 123 and Green 1997).For the problem at hand, we can do with even less imputation: all we need to know is the number of jumps of each type on every segment with nonzero jump size.For i ∈ I and j ∈ {1, . . ., J }, let n i j denote the number of jumps of type j on segment i. Denote the set of all auxiliary variables by a = {a i , i ∈ I }, where a i = (n i1 , n i2 , . . ., n i J ) .
In the following we will use the following additional notation: for i = 1, . . ., n, j = 1, . . ., J we set These are the number of jumps on the i-th segment, the total number of jumps of type j (summed over all segments) and the total number of jumps of all types, respectively.

Reparametrisation and prior specification
Instead of parametrising with (λ, ρ 1 , . . ., ρ J ), we define The background of this reparametrisation is the observation that a compound Poisson random variable Z whose jumps are of J types can be decomposed as Z = J j=1 Z j , where the Z j are independent, compound Poisson random variables whose jumps are of type j only, and where the parameter of the Poisson random variable is ψ j .In what follows we use θ = (ψ, μ, τ) with ψ = (ψ 1 , . . ., ψ J ) and μ = (μ 1 , . . ., μ J ).

Hierarchical model and data augmentation scheme
We construct a Metropolis-Hastings algorithm to draw from For an index i ∈ I we set a −i = {a j , j ∈ I \ {i}}.The two main steps of the algorithm are: (i) Update segments for each segment i ∈ I, draw a i conditional on (θ, z, a −i ); (ii) Update parameters draw θ conditional on (z, a).
Compared to the full data augmentation scheme discussed previously, the present approach is computationally much cheaper as the amount of imputation scales with the number of segments that need imputation.If the time in between observations is fixed and equal to , then the expected number of segments for imputation equals n(1 − e −λ ), which is for small approximately proportional to n λ.Denote the Poisson distribution with mean λ by P(λ).Including the auxiliary variables, we can write the observation model as a hierarchical model (with i ∈ {1, . . ., n} and j ∈ {1, . . ., J }).This implies

Updating segments
Updating the ith segment requires drawing from We do this with a Metropolis-Hastings step.First we draw a proposal n • i (for n i ) from a P(λ i ) distribution, conditioned to have nonzero outcome.Next, we draw where MN denotes the multinomial distribution.Hence the proposal density equals The acceptance probability for the proposal n • equals 1 ∧ A, with

Updating parameters
The proof of the following lemma is given in Appendix 1.
Lemma 2 Conditional on a, ψ 1 , . . ., ψ J are independent and where P is the symmetric J × J matrix with elements q is the J -dimensional vector with and R − q P −1 q > 0.
Remark 2 If for some j ∈ {1, . . ., J } we have s j = 0 (no jumps of type j), then the matrix P is singular.However, adding κ I J ×J ensures invertibility of P.

Numerical illustrations
The first two examples concern mixtures of two normal distributions We simulated n = 5.000 segments with = 1, μ 1 = 2, μ 2 = −1 and τ = 1.For the prior-hyperparameters we took The results for λ = 1, ρ 1 = 0.8, ρ 2 = 0.2 and hence ψ 1 = 0.8 and ψ 2 = 0.2 are shown in Fig. 1.The densities obtained from the posterior mean of the parameter estimates and the true density are shown in Fig. 2. The average acceptance probability for updating the segments was 51%.
The results for λ = 3, ρ 1 = 0.8, ρ 2 = 0.2 and hence ψ 1 = 2.4 and ψ 2 = 0.6 are shown in Fig. 3.The densities obtained from the posterior mean of the parameter estimates and the true density are shown in Fig. 4. The average acceptance probability for updating the segments was 41%.Observe that the autocorrelation functions of the iterations of the ψ i in the second case display a much slower decay.

Discussion
As can be seen from the autocorrelation plots, mixing of the chain deteriorates when λ increases.As the focus in this article is on high frequency data, where there are on average only Results for λ = 1 using 15.000 MCMC iterations.The trace plots show all iterations; in the other plots the first 5.000 iterations are treated as burnin.The figures are obtained after subsampling the iterates, where only each fifth iterate was saved.The horizontal yellow lines are obtained from computing the posterior mean of θ based on the true auxiliary variables on all segments a few jumps in between observations, we do not go into details on improving the algorithm.We remark that a non-centred parametrisation (see for instance Papaspiliopoulos et al. 2007) may give more satisfactory results when λ is large.A non centred parametrisation can be obtained by changing the hierarchical model in (17).Denote by F −1 λ the inverse cumulative distribution function of the P(λ) distribution.Let u i j (i = 1, . . ., n and j = 1, . . ., J ) be a sequence of independent U (0, 1) random variables and set u = {u i j , i = 1, . . ., n, j = 1, . . ., J }.By considering the hierarchical model Fig. 2 Results for λ = 1; the first 5.000 iterations are treated as burnin.Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates (i ∈ {1, . . ., n} and j ∈ {1, . . ., J }), ψ can be updated using a Metropolis-Hastings step.
In this way {n i j } and ψ are updated simultaneously.
Another option is to integrate out (μ, τ ) from p(θ, z, a).In this model it is even possible to integrate out ψ as well.In that case only the auxiliary variables a have to be updated.Yet another method to improve the efficiency of the algorithm is to use ideas from parallel tempering (Cf.Brooks et al. 2011, Chap. 11).

Proof of Theorem 1
There are a number of general results in Bayesian nonparametric statistics, such as the fundamental Theorem 2.1 in Ghosal et al. (2000) and Theorem 2.1 in Ghosal and Vaart (2001), which allow determination of the posterior contraction rates through checking certain conditions, but none of these results is easily and directly applicable in our case.The principle bottleneck is that a main assumption underlying these theorems is sampling from a fixed distribution, whereas in our high frequency setting, the distributions vary with .Therefore, for the clarity of exposition in the proof of our main theorem we will choose an alternative path, which consists in mimicking the main steps of the proof of Theorem 2.1, involving judiciously chosen statistical tests, as in Ghosal et al. (2000), while also employing some results on the Dirichlet location mixtures of normal densities from Ghosal and Vaart (2001).However, a significant part of technicalities we will encounter are characteristic of the decompounding problem only.
Throughout this section we assume that Assumptions 1 and 2 hold.Furthermore, in view of the discussion that followed Theorem 1 we will without loss of generality assume that 0 < δ ≤ 4. All the technical lemmas used in this section are collected in the appendices.
We start with the decomposition where 0 ≤ φ n ≤ 1 is a sequence of tests based on observations Z n and with properties to be specified below.The idea is to show that the terms on the right-hand side of the above display separately converge to zero in probability.The tests φ n allow one to control the behaviour of the likelihood ratio on the set where it is not well-behaved due to the fact that (λ, f ) is 'far away' from (λ 0 , f 0 ).

Construction of tests
The next lemma is an adaptation of Theorem 7.1 from Ghosal et al. (2000) to decompounding.
A proof is given in the appendix.We use the notation D(ε, A, d) to denote the ε-packing number of a set A in a metric space with metric d, applied in our case with d the scaled Hellinger metric h .Lemma 3 Let Q be an arbitrary set of probability measures Q λ, f .Suppose for some nonincreasing function D(ε), some sequence {ε n } of positive numbers and every ε > ε n , Then for every ε > ε n there exists a sequence of tests {φ n } (depending on ε > 0), such that where K > 0 is a universal constant.
In the proofs of Propositions 1 and 2 we need the inequalities below.There exists a constant C ∈ (0, ∞) depending on λ and λ only, such that for all λ 1 , λ 2 ∈ [λ, λ] and f 1 , f 2 it holds that These inequalities can be proven in the same way as Lemma 1 in Gugushvili et al. (2015).Let ε n be as in Theorem 1. Throughout, C denotes the above constant.For a constant L > 0 define the sequences {a n } and {η n } by We will show that inequality (24) holds true for every ε = Mε n with M > 2 and the set of measures Q equal to As a first step, note that we have where The first inequality in (28) follows from assuming M > 2. For bounding the righthand side in (28), we have the following proposition.
Proposition 1 We have Let {λ i } be centres of the balls from a minimal covering of by appropriate choices of i and j.It follows that log As we assume δ ≤ 4, we can apply the arguments in Ghosal and van der Vaart (2001, pp. 1251-1252) see in particular formulae (5.8)-(5.10)(cf.also Theorem 3.1 and Lemma A.3 there), which yield log N (η n , F n , h) log 4/δ+1 1 ε n .
Combination of the above three inequalities implies the statement of the proposition.
An application of Proposition 1 to (28) gives for some positive constant c 1 .Here, the final inequality follows from our choice for ε n .Hence, ( 24) is satisfied for By Lemma 3 there exist tests φ n such that for all n large enough

Bound on I n in (23)
First note that by Eq. ( 30) Chebyshev's inequality implies that I n converges to zero in Q ,n λ 0 , f 0 -probability as n → ∞, as soon as M is chosen so large that K M 2 − c 1 > 0.

Bound on II n
Now we consider II n .We have We will show that the numerator III n goes exponentially fast to zero, in Q ,n λ 0 , f 0 -probability, while the denominator IV n is bounded from below by an exponential function, with Q ,n λ 0 , f 0probability tending to one, in such a way that the ratio of III n and IV n still goes to zero in Q ,n λ 0 , f 0 -probability.

Bounding III
123 Here we applied Fubini's theorem to obtain the second term on the right-hand-side, which by ( 31) is bounded by exp where the last inequality is formula (5.11) in Ghosal and Vaart (2001).Hence Note that n ε 2 n → ∞ when n → ∞.We will use the following bound, an adaptation of Lemma 8.1 in Ghosal et al. (2000) to our setting, valid for every ε > 0 and C > 0, is a normalised restriction of (•) to B (ε, (λ 0 , f 0 )).By virtue of (33), with Q ,n λ 0 , f 0 -probability tending to one, for any constant C > 0 we have We will now work out the product probability on the right-hand side of this inequality.
Proposition 2 It holds that for some constant c.
Proof Let 0 < c ≤ 1/ 5C be a constant.Here C is the constant in ( 25) and ( 26).By these inequalities it is readily seen that It then follows by the independence assumption on 1 and 2 that For the first factor on the right-hand side we have by ( 13) that As far as the second factor is concerned, for some constants c 1 , c 2 it is bounded from below by by the same arguments as in inequality (5.17) in Ghosal and Vaart (2001).The result now follows by combining the two lower bounds.
Combining (34) with Proposition 2, with Q ,n λ 0 , f 0 -probability tending to one as n → ∞, for any constant C > 0 we have We are now ready for showing the final steps of proving that II n tends to zero in Q ,n λ 0 , f 0probability.Let G n denote the set on which inequality ( 35) is true.Then by (32) we obtain Recall that n ε 2 n = log 2 (n ).Hence, the exponent in the first factor of this display is of order log 2 (n ).Furthermore a δ n = L δ log 2 (4C/ε n ), which is of order log 2 (n ) as well.It follows that, provided the constants L and M are chosen large enough, the right-hand side of the above display converges to zero as n → ∞.Chebyshev's inequality then implies that II n converges to zero in probability as n → ∞.This completes the proof of Theorem 1.
Acknowledgements We wish to thank Wikash Sewlal from Delft University of Technology for the simulation results of the example with a mixture of four normals and the skewed density.The research leading to these results has received funding from the European Research Council under ERC Grant Agreement 320637.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Proof of Lemma 3
The proof is an adaptation of Theorem 7.1 from Ghosal et al. (2000) to decompounding.In all what follows it is assumed that Q λ, f ∈ Q, but we suppress this assumption in the notation.Observe that From this point on the arguments from the proof of Theorem 7.1 in Ghosal et al. (2000) are applicable (with ε replaced by ε √ ) and eventually lead to the desired result.The role of formulae (7.1)-(7.2) in that proof are played in the present context by ( 36) and (37) below.
For a given (λ 1 , f 1 ) there exists a sequence of tests φ n based on Z n , such that These two inequalities simply follow by rewriting the inequalities which are proved in Ghosal et al. (2000, pp. 520-521) and rely upon the results in Birgé (1984) and Cam (1986).

Proof of Lemma 2
As the priors for ψ 1 , . . ., ψ J are independent, we obtain that This is proportional to where From this expression it is easily seen that we can integrate out μ to obtain the distribution of τ, conditional on (z, a).To get this right, write D(μ) as a quadratic form of μ: 123 Fig.1Results for λ = 1 using 15.000 MCMC iterations.The trace plots show all iterations; in the other plots the first 5.000 iterations are treated as burnin.The figures are obtained after subsampling the iterates, where only each fifth iterate was saved.The horizontal yellow lines are obtained from computing the posterior mean of θ based on the true auxiliary variables on all segments

Fig. 3
Fig.3Results for λ = 3 using 25.000 MCMC iterations.The trace plots show all iterations; in the other plots the first 10.000 iterations are treated as burnin.The figures are obtained after subsampling the iterates, where only each fifth iterate was saved.The horizontal yellow lines are obtained from computing the posterior mean of θ based on the true auxiliary variables on all segments

Fig. 4 Fig. 5
Fig.4Results for λ = 3; the first 10.000 iterations are treated as burnin.Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

Fig. 6 Fig. 7
Fig.6Results for the example with a mixture of four normals; the first 20.000iterations are treated as burnin.Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates p(ψ | μ, τ, z, a) = p(ψ | a) first statement of the lemma.For (μ, τ ) we get p(μ, τ | z, a) ∝ i∈I φ z i ; a i μ, n i /τ × τ α 1 −1 e −β 1 τ τ J/2 exp