Skip to main content

A non-parametric Bayesian approach to decompounding from high frequency data


Given a sample from a discretely observed compound Poisson process, we consider non-parametric estimation of the density \(f_0\) of its jump sizes, as well as of its intensity \(\lambda _0.\) We take a Bayesian approach to the problem and specify the prior on \(f_0\) as the Dirichlet location mixture of normal densities. An independent prior for \(\lambda _0\) is assumed to be compactly supported and to possess a positive density with respect to the Lebesgue measure. We show that under suitable assumptions the posterior contracts around the pair \((\lambda _0,\,f_0)\) at essentially (up to a logarithmic factor) the \(\sqrt{n\Delta }\)-rate, where n is the number of observations and \(\Delta \) is the mesh size at which the process is sampled. The emphasis is on high frequency data, \(\Delta \rightarrow 0,\) but the obtained results are also valid for fixed \(\Delta .\) In either case we assume that \(n\Delta \rightarrow \infty .\) Our main result implies existence of Bayesian point estimates converging (in the frequentist sense, in probability) to \((\lambda _0,\,f_0)\) at the same rate. We also discuss a practical implementation of our approach. The computational problem is dealt with by inclusion of auxiliary variables and we develop a Markov chain Monte Carlo algorithm that samples from the joint distribution of the unknown parameters in the mixture density and the introduced auxiliary variables. Numerical examples illustrate the feasibility of this approach.


Problem formulation

Let \(N=(N_{t},\, t\ge 0)\) be a Poisson process with a constant intensity \(\lambda >0\) and let \(Y_1,\,Y_2,\,Y_3,\ldots \) be a sequence of independent random variables independent of N and having a common distribution function F with density f (with respect to the Lebesgue measure). A compound Poisson process (abbreviated CPP) \(X=(X_t,\, t\ge 0)\) is defined as

$$\begin{aligned} X_t=\sum _{j=1}^{N_t} Y_j, \end{aligned}$$

where the sum over an empty set is by definition equal to zero. CPPs form a basic model in a variety of applied fields, most notably in e.g., queueing and risk theory, see Embrechts et al. (1997) and Prabhu (1998) and the references therein, but also in other fields of science, see, e.g., Alexandersson (1985) and Burlando and Rosso (1993) for stochastic models for precipitation, Katz (2002) on modelling of hurricane damage, or Scalas (2006) for applications in economics and finance.

Suppose that corresponding to the ‘true’ parameter values \(\lambda =\lambda _0\) and \(f=f_0,\) a discrete time sample \(X_{\Delta },\,X_{2\Delta },\ldots ,X_{n\Delta }\) is available from (1), where \(\Delta >0.\) Such a discrete time observation scheme is common in a number of applications of CPP, e.g., in the precipitation models of the above references. Based on the sample \(\mathcal {X}_n^\Delta =(X_\Delta ,\,X_{2\Delta },\ldots ,X_{n\Delta }),\) we are interested in (non-parametric) estimation of \(\lambda _0\) and \(f_0.\) Before proceeding further, we notice that by the stationary independent increments property of a CPP, the random variables \(Z_i^{\Delta } = X_{i\Delta }-X_{(i-1)\Delta },\,1\le i \le n,\) are independent and identically distributed. Each \(Z_i^{\Delta }\) has the same distribution as the random variable

$$\begin{aligned} Z^{\Delta }=\sum _{j=1}^{T^{\Delta }} Y_j, \end{aligned}$$

where \(T^{\Delta }\) is independent of the sequence \(Y_1,\, Y_2,\ldots \) and has a Poisson distribution with parameter \(\Delta \lambda .\) Hence, our problem is equivalent to estimating (non-parametrically) \(\lambda _0\) and \(f_0\) based on the sample \(\mathcal {Z}_n^{\Delta }=(Z_1^{\Delta },\,Z_2^{\Delta },\ldots ,Z_n^{\Delta }).\) We will henceforth use this alternative formulation of the problem. Our emphasis is on high frequency data, \(\Delta =\Delta _n\rightarrow 0\) as \(n\rightarrow \infty ,\) but the obtained results are also valid for low frequency observations, i.e., for fixed \(\Delta .\)

Our main result is on the contraction rate of the posterior distribution, which we show to be, up to a logarithmic factor, \((n\Delta )^{-1/2}.\) A by now standard approach to obtain contraction rates in an IID setting is to verify the assumptions of the fundamental Theorem 2.1 in Ghosal et al. (2000). It should be noted that in the present high frequency setting, this theorem is not applicable. One of the model assumptions underlying this theorem, which is satisfied in Gugushvili et al. (2015), is that one deals with samples of a fixed distribution, whereas in our present high frequency observation regime the distribution of \(Z^\Delta \) is varying, with the Dirac distribution concentrated at zero as its limit for \(\Delta \rightarrow 0.\) Therefore we propose an alternative approach, circumventing the use of the cited Theorem 2.1. The theoretical contribution of the present paper is therefore not only the statement of the main result itself, but also its proof. Next to this we also discuss a practical implementation of our non-parametric Bayesian approach, a Markov chain Monte Carlo algorithm that samples from the joint distribution of the unknown parameters in the mixture density and certain introduced auxiliary variables.

Literature review and present approach

Because adding a Poisson number of \(Y_j\)’s amounts to compounding their distributions, the problem of recovering the intensity \(\lambda _0\) and the density \(f_0\) from the observations \(Z_i\)’s can be referred to as decompounding. Decompounding already has some history: the early contributions (Buchmann and Grübel 2003, 2004) dealt with estimation of the distribution function \(F_0,\) paying particular attention to the case when \(F_0\) is discrete, while the later contributions (Comte et al. 2014; Duval 2013; Es et al. 2007) concentrated on estimation of the density \(f_0\) instead. More (frequentist) theory on statistical inference on CPPs (and more generally on Lévy processes) can be found in the volume (Belomestny et al. 2015), with the survey paper (Comte et al. 2015) devoted to statistical methods for high frequency discrete observations, with a special section on CPPs. Other references on statistics for Lévy processes in the high frequency data setting are Comte and Genon-Catalot (2011), Comte and Genon-Catalot (2010), Comte et al. (2010), Figueroa-López (2008), Figueroa-Lopez (2009), Nickl and Reiß (2012), Nickl et al. (2016), and Ueltzhöfer and Klüppelberg (2011). All these approaches are frequentist in nature. On the other hand, theoretical and computational advances made over the recent years have shown that a non-parametric Bayesian approach is feasible in various statistical settings; see e.g., Hjort et al. (2010) for an overview. This is the approach we will take in this work to estimate \(\lambda _0\) and \(f_0.\)

To the best of our knowledge, non-parametric Bayesian approach to inference for (a class of) Lévy processes was first considered in Gugushvili et al. (2015). That paper, contrary to the present context, dealt with observations at fixed equidistant times, and was strongly based on an application of Theorem 2.1 of Ghosal et al. (2000), as already alluded to in the problem formulation of Sect. 1.1. The present work complements the results from Gugushvili et al. (2015), in the sense that we now allow high frequency observations, which requires a substantially different route to prove our results, as we will explain in more detail in Sect. 1.3.

We will study the non-parametric Bayesian approach to decompounding from a frequentist point of view (in the sense specified below), so that one may also think of it as a means for obtaining a frequentist estimator. Advantages of the non-parametric Bayesian approach include automatic quantification of uncertainty in parameter estimates through Bayesian posterior credible sets and automatic selection of the degree of smoothing required in non-parametric inferential procedures.


The non-parametric class \(\mathcal {F}\) of densities f that we consider is that of location mixtures of normal densities. So we consider densities specified by

$$\begin{aligned} f(x)=f_{H,\sigma } (x)=\int \phi _{\sigma }(x-z)\mathrm {d}H(z), \end{aligned}$$

where \(\phi _{\sigma }\) denotes the density of the normal distribution with mean zero and variance \(\sigma ^2\) and H is a mixing measure. These mixtures form a rich and flexible class of densities, see Marron and Wand (1992) and McLachlan and Peel (2000), that are capable of closely approximating many densities that themselves are not representable in this way. The resulting mixture densities will be infinitely smooth, which is arguably the case in many, if not most, practical applications.

Bayesian estimation requires specification of prior distributions on \(\lambda \) and f. We propose independent priors on \(\lambda \) and f that we denote by \(\Pi _1\) and \(\Pi _2,\) respectively. For f,  we take a Dirichlet mixture of normal densities as a prior. This type of prior in the context of Bayesian density estimation has been introduced in Ferguson (1983) and Lo (1984); for recent references see, e.g., Ghosal and Vaart (2001). The prior for f is defined as the law of the function \(f_{ H , \sigma }\) as in (3), with H assumed to follow a Dirichlet process prior \(D_\alpha \) with base measure \(\alpha \) and \(\sigma \) a priori independent with distribution \(\Pi _3.\) Recall that a Dirichlet process \(D_{\alpha }\) on \(\mathbb {R}\) with the base measure \(\alpha \) defined on the Borel \(\sigma \)-algebra \(\mathcal {B}(\mathbb {R})\) (we assume \(\alpha \) to be non-negative and \(\sigma \)-additive) is a random probability measure G on \(\mathbb {R},\) such that for every finite and measurable partition \(B_1,\,B_2,\ldots ,B_k\) of \(\mathbb {R},\) the probability vector \((G(B_1),\,G(B_2),\ldots ,G(B_k))\) possesses the Dirichlet distribution on the k-dimensional simplex with parameters \((\alpha (B_1),\,\alpha (B_2),\ldots ,\alpha (B_k)).\) See, e.g., the original paper (Ferguson 1973), or the overview article (Ghosal 2010) for more information on Dirichlet process priors.

A nonparametric Bayesian approach to density estimation employing a Dirichlet mixture of normal densities as a prior can in very rough sense be thought of as a Bayesian counterpart of kernel density estimation (with a Gaussian kernel), cf. Ghosal and van der Vaart (2007, p. 697).

With the sample size n tending to infinity, the Bayesian approach should be able to discern the true parameter pair \((\lambda _0,\,f_0)\) with increasing accuracy. We can formalise this by requiring, for instance, that for any fixed neighbourhood A (in an appropriate topology) of \((\lambda _0,\,f_0),\, \Pi (A^c|\mathcal {Z}_{n}^\Delta )\rightarrow 0\) in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability. Here \(\Pi \) is used as a shorthand notation for the posterior distribution of \((\lambda ,\,f)\) and we use \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta }\) to denote the law of the random variable \(Z^{\Delta }\) in (2) and \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\) the law of \(\mathcal {Z}_n^{\Delta }.\) More generally, one may take a sequence of shrinking neighbourhoods \(A_n\) of \((\lambda _0,\,f_0)\) and try to determine the rate at which the neighbourhoods \(A_n\) are allowed to shrink, while still capturing most of the posterior mass. This rate is referred to as a posterior convergence rate (we will give the precise definition in Sect. 3). Two fundamental references dealing with establishing it in various statistical settings are Ghosal et al. (2000) and Ghosal and Vaart (2001). This convergence rate can be thought of as an analogue of the convergence rate of a frequentist estimator. The analogy can be made precise: contraction of the posterior distribution at a certain rate implies existence of a Bayes point estimate with the same convergence rate (in the frequentist sense); see Theorem 2.5 in Ghosal et al. (2000) and the discussion on pp. 506–507 there.

Obviously, for our programme to be successful, \(\Delta \) has to satisfy the assumption \(n\Delta \rightarrow \infty ,\) which is a necessary condition for consistent estimation of \((\lambda _0,\,f_0),\) as it ensures that asymptotically we observe an infinite number of jumps in the process. We cover both the case of so called high frequency observation schemes (\(\Delta \rightarrow 0\)) as well as low frequency observations (fixed \(\Delta \)). A sufficient condition, which covers both observation regimes and which relates \(\Delta \) to n,  is \(\Delta =n^{-\alpha },\) where \(0\le \alpha <1.\)

We note that in Ghosal and Tang (2006) and Tang and Ghosal (2007) non-parametric Bayesian inference for Markov processes is studied, of which CPPs form a particular class, but these papers deal with estimation of the transition density of a discretely observed Markov process, which is different from the problem we consider here. A parametric Bayesian approach to inference for CPPs is studied in Insua et al. (2012, Sects. 5.5 and 10.3).

The main result of our paper is Theorem 1, in which we state sufficient conditions on the prior that yield a posterior rate of contraction of the order \((\log ^{\kappa }(n\Delta ))/ \sqrt{n\Delta },\) for some constant \(\kappa >0.\) We argue that this rate is a nearly (up to a logarithmic factor) optimal posterior contraction rate in our problem. Our main result complements the one in Gugushvili et al. (2015), in that it treats both the low and high frequency observation schemes simultaneously, with emphasis on the latter. We note (again) a fundamental difference between the present paper and Gugushvili et al. (2015), when it comes down to the techniques to prove the main result. As Theorem 2.1 of Ghosal et al. (2000) cannot immediately be used, we take an alternative tour that avoids this theorem, but instead refines a number of technical results involving properties of statistical tests that form essential ingredients of the proof in Ghosal et al. (2000). These refined results are then used as key technical steps in a direct proof of our Theorem 1. Furthermore, it establishes the posterior contraction rate for infinitely smooth jump size densities \(f_0,\) which is not covered by Gugushvili et al. (2015). On the other hand, Gugushvili et al. (2015) deals with multi-dimensional CPPs, while in this paper we consider only the one-dimensional case. Finally, in this work we also discuss a practical implementation of our non-parametric Bayesian approach. The computational problem is dealt with by inclusion of auxiliary variables. More precisely, we show how a Markov chain Monte Carlo algorithm can be devised that samples from the joint distribution of the unknown parameters in the mixture density and the introduced auxiliary variables. Numerical examples illustrate the feasibility of this approach.


The remainder of the paper is organised as follows. In the next section we state some preliminaries on the likelihood, prior and notation. In Sect. 3 we first motivate the use of the scaled Hellinger metric to define neighbourhoods for which posterior contraction rate is derived in case the observations are sampled at high frequency. Then we present the main result on the posterior contraction rate (Theorem 1), whose proof is given in Sect. 5. We discuss the numerical implementation of our results in Sect. 4. Technical lemmas and their proofs used to prove the main theorem are gathered in the Appendix.

Preliminaries and notation

Likelihood, prior and posterior

We are interested in Bayesian inference with Bayes’ formula. Therefore we need to specify the likelihood in our model. We use the following notation:


The characteristic function of the Poisson sum \(Z^{\Delta }\) defined in (2) is given by

$$\begin{aligned} \phi (t)=e^{-\lambda \Delta +\lambda \Delta \phi _{f}(t)}, \end{aligned}$$

where \(\phi _{f}\) is the characteristic function of f. This can be rewritten as

$$\begin{aligned} \phi (t)=e^{-\lambda \Delta }+\left( 1-e^{-\lambda \Delta }\right) \frac{1}{e^{\lambda \Delta }-1}\left( e^{\lambda \Delta \phi _{f}(t)} -1 \right) , \end{aligned}$$

which, using the fact that \(\phi _{f}\) vanishes at infinity, shows that the distribution of \(Z^{\Delta }\) is a mixture of a point mass at zero and an absolutely continuous distribution. Letting \(t\rightarrow \infty ,\) we get that \(\phi (t)\rightarrow e^{-\lambda \Delta }.\) Hence \(\lambda \) is identifiable from the law of \(Z^{\Delta },\) and then so is f. The density of the law \(\mathbb {Q}_{\lambda , f}^{\Delta }\) of \(Z^{\Delta }\) with respect to the measure \(\mu ,\) which is the sum of Lebesgue measure and the Dirac measure concentrated at zero, can in fact be written explicitly as (cf. van Es et al. 2007, p. 681 and Proposition 2.1 in Duval 2013)

$$\begin{aligned} \frac{\mathrm {d}\mathbb {Q}_{\lambda , f}^{\Delta }}{\mathrm {d}\mu }(x)=e^{-\lambda \Delta }\mathbf {1}_{\{0\}}(x)+\left( 1-e^{-\lambda \Delta }\right) \sum _{m=1}^{\infty } a_m(\lambda \Delta ) f^{*m}(x)\mathbf {1}_{\mathbb {R}\setminus \{0\}}(x), \end{aligned}$$

where \(\mathbf {1}_A\) denotes the indicator of a set A

$$\begin{aligned} a_m(\lambda \Delta )=\frac{1}{ e^{\lambda \Delta } -1 } \frac{(\lambda \Delta )^m}{m!}, \end{aligned}$$

and \(f^{*m}\) denotes the m-fold convolution of f with itself. However, the expression (4) is useless for Bayesian computations. To work around this problem, we will employ a different dominating measure. Consider the law \(\mathbb {R}_{\lambda ,f}^{\Delta }\) of \((X_t,\, t\in [0,\,\Delta ]).\) By the Theorem in Skorohod (1964, p. 261) \(\mathbb {R}_{\lambda ,f}^{\Delta }\) is absolutely continuous with respect to \(\mathbb {R}_{\widetilde{\lambda },\widetilde{f}}^{\Delta }\) if and only if \(\mathbb {P}_{f}\) is absolutely continuous with respect to \(\mathbb {P}_{\widetilde{f}}\) (we of course assume that \(\lambda ,\,\widetilde{\lambda }>0\)). A simple condition to ensure the latter is to assume that \(\widetilde{f}\) is continuous and does not take the value zero on \(\mathbb {R}.\)

Define the random measure \(\mu \) by

$$\begin{aligned} \mu (B) = \left\{ \# t {\text {:}}\,\left( t,\,X_t-X_{t-}\right) \in B\right\} , \quad B \in \mathcal {B}([0,\,\Delta ])\otimes \mathcal {B}(\mathbb {R}\setminus \{0\}). \end{aligned}$$

Under \(\mathbb {R}_{{\lambda },{f}},\) the random measure \(\mu \) is a Poisson point process on \([0,\,\Delta ]\times (\mathbb {R}\setminus \{0\})\) with intensity measure \( {\Lambda }(dt,\,dx)={\lambda } dt {f}(x)dx,\) which follows, e.g., from Theorem 1 on p. 69 and Corollary on p. 64 in Skorohod (1964). By formula (46.1) on p. 262 in Skorohod (1964), we have

$$\begin{aligned} \frac{\mathrm {d} \mathbb {R}_{\lambda ,f}^{\Delta }}{ \mathrm {d}\mathbb {R}_{\widetilde{\lambda },\widetilde{f}}^{\Delta } }(X)=\exp \left( \int _0^{\Delta } \int _{\mathbb {R}} \log \left( \frac{\lambda f(x)}{{\widetilde{\lambda }} \widetilde{f}(x)} \right) \mu (dt,\,dx) - \Delta (\lambda -{\widetilde{\lambda }}) \right) . \end{aligned}$$

By Theorem 2 on p. 245 in Skorohod (1964) and Corollary 2 on p. 246 there, the density \(k_{\lambda ,f}^{\Delta }\) of \(\mathbb {Q}_{\lambda ,f}^{\Delta }\) with respect to \(\mathbb {Q}_{\widetilde{\lambda },\widetilde{f}}^{\Delta }\) is given by the conditional expectation

$$\begin{aligned} k_{\lambda ,f}^{\Delta }(x)=\mathrm{{\mathbb E\,}}_{\widetilde{\lambda },\widetilde{f}}\left( \frac{\mathrm {d} \mathbb {R}_{\lambda ,f}^{\Delta }}{ \mathrm {d}\mathbb {R}_{\widetilde{\lambda },\widetilde{f}}^{\Delta } }(X)\Bigg | X_{\Delta }=x \right) , \end{aligned}$$

where the subscript in the conditional expectation operator signifies the fact that it is evaluated under the probability \(\mathbb {R}_{ \widetilde{\lambda },\widetilde{f} }^{\Delta }.\) Hence the likelihood [in the parameter pair \((\lambda ,\,f)\)] associated with the sample \(\mathcal {Z}_n^{\Delta }\) is given by the product

$$\begin{aligned} L_n^{\Delta }(\lambda ,\,f)=\prod _{i=1}^n k_{\lambda ,f}^{\Delta }\left( Z_i^{\Delta }\right) . \end{aligned}$$

An advantage of specifying the likelihood in this manner is that it allows one to reduce some of the difficult computations for the laws \(\mathbb {Q}_{\lambda ,f}^{\Delta }\) to those for the laws \(\mathbb {R}_{\lambda ,f}^{\Delta },\) which are simpler.

Observe that the priors on \(\lambda \) and f indirectly induce the prior \(\Pi = \Pi _1\times \Pi _2\) on the collection of densities \(k_{\lambda ,f}^{\Delta }.\) We will indiscriminately use the symbol \(\Pi \) to signify both the prior on \((\lambda ,\,f),\) but also on the density \(k_{\lambda ,f}^{\Delta }.\) The posterior in the first case will be understood as the posterior for the pair \((\lambda ,\,f),\) while in the second case as the posterior for the density \(k_{\lambda ,f}^{\Delta }.\) We will often use the same symbol \(\Pi \) to denote the posterior distribution of \((\lambda ,\,f)\) and on the density \(k_{\lambda ,f}^{\Delta }.\) This simplifies notationally some of the formulations below.

By Bayes’ theorem, the posterior measure of any measurable set \(A\subset (0,\,\infty )\times \mathcal {F}\) is given by

$$\begin{aligned} \Pi \left( A|\mathcal {Z}_n^{\Delta }\right) =\frac{ \iint _A L_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) }{ \iint L_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) }. \end{aligned}$$

Upon setting \(\overline{A}=\{ k_{\lambda ,f}{\text {:}}\,(k,\,\lambda )\in A \}\) and recalling our conventions above, this can also be written as

$$\begin{aligned} \Pi \left( \overline{A}|\mathcal {Z}_n^{\Delta }\right) =\frac{ \int _{\overline{A}} L_n^{\Delta }(k) \mathrm {d}\Pi (k) }{ \int L_n^{\Delta }(k) \mathrm {d}\Pi (k) }. \end{aligned}$$

Once the posterior is available, one can next proceed with computation of other quantities of interest in Bayesian statistics, such as Bayes point estimates or credible sets.


Throughout the paper we will use the following notation to compare two sequences \(\{a_n\}\) and \(\{b_n\}\) of positive real numbers: \(a_n\lesssim b_n\) will mean that there exists a constant \(C>0\) that is independent of n and is such that \(a_n\le C b_n,\) while \(a_n\gtrsim b_n\) will signify the fact that \(a_n\ge C b_n.\)

Next we introduce various notions of distances between probability measures. The Hellinger distance \(h(\mathbb {Q}_{0},\,\mathbb {Q}_{1})\) between two probability laws \(\mathbb {Q}_{0}\) and \(\mathbb {Q}_{1}\) on a measurable space \((\Omega ,\,\mathfrak {F})\) is defined as

$$\begin{aligned} h\left( \mathbb {Q}_{0},\,\mathbb {Q}_{1}\right) =\left( \int \left( \mathrm {d}\mathbb {Q}_{0}^{1/2} - \mathrm {d}\mathbb {Q}_{1}^{1/2} \right) ^2 \right) ^{1/2}. \end{aligned}$$

Assume further \(\mathbb {Q}_0\ll \mathbb {Q}_1.\) The Kullback–Leibler (or informational) divergence \(\mathrm {K}(\mathbb {Q}_{0},\,\mathbb {Q}_{1})\) is defined as

$$\begin{aligned} \mathrm {K}\left( \mathbb {Q}_{0},\,\mathbb {Q}_{1}\right) = \int \log \left( \frac{ \mathrm {d}\mathbb {Q}_{0}}{ \mathrm {d}\mathbb {Q}_{1} } \right) \mathrm {d}\mathbb {Q}_{0}, \end{aligned}$$

while the \(\mathrm {V}\)-discrepancy is defined through

$$\begin{aligned} \mathrm {V}\left( \mathbb {Q}_{0},\,\mathbb {Q}_{1}\right) =\int \log ^2 \left( \frac{ \mathrm {d}\mathbb {Q}_{0}}{ \mathrm {d}\mathbb {Q}_{1} } \right) \mathrm {d}\mathbb {Q}_{0}. \end{aligned}$$

Here is some additional notation. For \(f,\,g\) nonnegative integrable functions, not necessarily densities, we write

$$\begin{aligned} h^2(f,\,g)&=\int (\sqrt{f}-\sqrt{g})^2, \\ \mathrm {K}(f,\,g)&= \int \log \frac{f}{g}f -\int f + \int g,\\ \mathrm {V}(f,\,g)&= \int \log ^2\frac{f}{g}f. \end{aligned}$$

Note that these ‘distances’ are all nonnegative and only zero if \(f=g\) a.e. If f and g are densities of probability measures \(\mathbb {Q}_0\) and \(\mathbb {Q}_1\) on \((\mathbb {R},\,\mathcal {B}),\) respectively, then the above ‘distances’ reduce to the previously introduced ones.

We will also use \(\mathrm {K}(x,\,y)=x\log \frac{x}{y}-x+y\) for \(x,\,y>0.\) Note that also \(\mathrm {K}(x,\,y)\ge 0\) and \(\mathrm {K}(x,\,y)=0\) if and only if \(x=y.\)

Main result on posterior contraction rate

Denote the true parameter values for the CPP by \((\lambda _0,\, f_0).\) Recall that the problem is to estimate \(f_0\) and \(\lambda _0\) based on the observations \(\mathcal {Z}^\Delta _n\) and that \(\Delta \rightarrow 0\) in a high frequency regime. To say that a pair \((f,\,\lambda )\) lies in a neighbourhood of \((f_0,\,\lambda _0),\) one needs a notion of distance on the corresponding measures \(\mathbb {Q}^{\Delta }_{\lambda ,f}\) and \(\mathbb {Q}^{\Delta }_{\lambda _0,f_0},\) the two possible induced laws of \(Z^\Delta _i=X_{i\Delta }-X_{(i-1)\Delta }.\) The Hellinger distance is a popular and rather reasonable choice to that end in non-parametric Bayesian statistics. However, for \(\Delta \rightarrow 0\) the Hellinger metric h between those laws automatically tends to 0. The first assertion of Lemma 1 below states that \(h(\mathbb {Q}^{\Delta }_{\lambda ,f},\,\mathbb {Q}^{\Delta }_{\lambda _0,f_0})\) is of order \(\sqrt{\Delta }\) when \(\Delta \rightarrow 0.\) This motivates to replace the ordinary Hellinger metric h with the scaled metric \(h^\Delta =h/\sqrt{\Delta }\) in our asymptotic analysis for high frequency data. Of course, for fixed \(\Delta \) (in which case one can take \(\Delta =1\) w.l.o.g.), nothing changes with this replacement. The lemma also shows that the Kullback–Leibler divergence and the V-discrepancy are of order \(\Delta \) for \(\Delta \rightarrow 0.\) Therefore we will also use the scaled distances \(\mathrm {K}^\Delta =\mathrm {K}/\Delta \) and \(\mathrm {V}^\Delta =\mathrm {V}/\Delta \)

Lemma 1

The following expressions hold true:

$$\begin{aligned}&\lim _{\Delta \rightarrow 0}\frac{1}{\Delta } h^2\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&= h^2\left( \lambda f,\,\lambda _0 f_0\right) =\int \left( \sqrt{\lambda f(x)}-\sqrt{\lambda _0f_0(x)}\right) ^2\mathrm {d}x, \end{aligned}$$
$$\begin{aligned}&\lim _{\Delta \rightarrow 0}\frac{1}{\Delta }\mathrm {K}\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&= \mathrm {K}\left( \lambda f,\,\lambda _0 f_0\right) =\lambda \mathrm {K}\left( f,\,f_0\right) +\mathrm {K}\left( \lambda ,\,\lambda _0\right) , \end{aligned}$$
$$\begin{aligned}&\lim _{\Delta \rightarrow 0}\frac{1}{\Delta }\mathrm {V}\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&= \mathrm {V}\left( \lambda f,\,\lambda _0 f_0\right) =\int \log ^2\frac{\lambda f(x)}{\lambda _0 f_0(x)}\lambda f(x)\mathrm {d}x. \end{aligned}$$

The proof will be presented in the appendix.

Remark 1

The Hellinger process (here deterministic) of order \(\frac{1}{2}\) for continuous observations of X on an interval \([0,\,t]\) is given by Jacod and Shiryaev (2003, Sects. IV.3 and IV.4a)

$$\begin{aligned} h_t=\frac{t}{2}\int \left( \sqrt{\lambda f(x)}-\sqrt{\lambda _0f_0(x)}\right) ^2\mathrm {d}x= h_1t, \end{aligned}$$

from which it follows that \(h^2(\mathbb {R}^t_{\lambda ,f},\,\mathbb {R}^t_{\lambda _0,f_0})=2-2\exp (-h_t),\) whose derivative in \(t=0\) is the same as in (9) and thus equal to \(2h_1.\) For the Kullback–Leibler divergence and the discrepancy \(\mathrm {V}\) similar assertions hold. These observations have the following heuristic explanation. For \(\Delta \rightarrow 0,\) there is no big difference between observing the path of X over the interval \([0,\,\Delta ]\) and \(X_\Delta ,\) as the probability of \(\{N_\Delta \ge 2\}\) is small (of order \(\Delta ^2\)).

In order to determine the posterior contraction rate in our problem, we now specify suitable neighbourhoods \(A_n\) of \((\lambda _0,\,f_0),\) for which this will be done. Let \(M>0\) be a constant and let \(\{\varepsilon _n\}\) be a sequence of positive numbers, such that \(\varepsilon _n\rightarrow 0\) as \(n\rightarrow \infty .\) Let

$$\begin{aligned} h^\Delta \left( \mathbb {Q}_0,\,\mathbb {Q}_1\right) =\frac{1}{\sqrt{\Delta }}h\left( \mathbb {Q}_0,\,\mathbb {Q}_1\right) , \end{aligned}$$

be a rescaled Hellinger distance. Lemma 1 suggests that this is the right scaling to use. Introduce the complements of the Hellinger-type neighbourhoods of \((\lambda _0,\,f_0),\)

$$\begin{aligned} A\left( \varepsilon _n,\,M\right) =\left\{ (\lambda ,\,f) {\text {:}}\, h^\Delta \left( \mathbb {Q}_{\lambda _0,f_0}^{\Delta },\,\mathbb {Q}_{\lambda ,f}^{\Delta } \right) > M \varepsilon _n\right\} . \end{aligned}$$

We shall say that \(\varepsilon _n\) is a posterior contraction rate, if there exists a constant \(M>0,\) such that

$$\begin{aligned} \Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta }\right) \rightarrow 0, \end{aligned}$$

in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability as \(n\rightarrow \infty .\) Our goal in this section is to determine the ‘fastest’ rate at which \(\varepsilon _n\) is allowed to tend to zero, while not violating (12).

We will assume that the observations are generated from a CPP that satisfies the following assumption.

Assumption 1

  1. (i)

    \(\lambda _0\) is in a compact set \([\underline{\lambda },\, \overline{\lambda }]\subset (0,\,\infty );\)

  2. (ii)

    The true density \(f_0\) is a location mixture of normal densities, i.e.,

    $$\begin{aligned} f_0(x)=f_{ H_0 , \sigma _0 } (x)=\int \phi _{\sigma _0 }(x-z)\mathrm {d}H_0(z), \end{aligned}$$

    for some fixed distribution \(H_0\) and a constant \(\sigma _0 \in [\underline{\sigma },\, \overline{\sigma }]\subset (0,\,\infty ).\) Furthermore, for some \(0<\kappa _0<\infty ,\, H_0[{-}\kappa _0,\,\kappa _0]=1,\) i.e., \(H_0\) has compact support.

The more general location-scale mixtures of normal densities,

$$\begin{aligned} f_0(x)=f_{ H_0 , K_0 } (x)=\iint \phi _{\sigma }(x-z)\mathrm {d}H_0(z)\mathrm {d}K_0(\sigma ), \end{aligned}$$

possess even better approximation properties than the location mixtures of the normals (here \(H_0\) and \(K_0\) are distributions) and could also be considered in our setup. However, this would lead to additional technical complications, which could obscure essential contributions of our work.

For obtaining posterior contraction rates we need to make some assumptions on the prior.

Assumption 2

  1. (i)

    The prior on \(\lambda ,\, \Pi _1,\) has a density \(\pi _1\) (with respect to the Lebesgue measure) that is supported on the finite interval \([\underline{\lambda },\,\overline{\lambda }]\subset (0,\,\infty )\) and is such that

    $$\begin{aligned} 0<\underline{\pi }_1 \le \pi _1(\lambda )\le \overline{\pi }_1<\infty , \quad \lambda \in [\underline{\lambda },\,\overline{\lambda }], \end{aligned}$$

    for some constants \(\underline{\pi }_1\) and \(\overline{\pi }_1;\)

  2. (ii)

    The base measure \(\alpha \) of the Dirichlet process prior \(D_{\alpha }\) has a continuous density on an interval \([{-}\kappa _0-\zeta ,\,\kappa _0+\zeta ],\) with \(\kappa _0\) as in Assumption 1(ii), for some \(\zeta >0,\) is bounded away from zero there, and for all \(t>0\) satisfies the tail condition

    $$\begin{aligned} \alpha (|z|>t)\lesssim e^{-b|t|^{\delta }}, \end{aligned}$$

    with some constants \(b>0\) and \(\delta >0;\)

  3. (iii)

    The prior on \(\sigma ,\,\Pi _3,\) is supported on the interval \([\underline{\sigma },\,\overline{\sigma }]\subset (0,\,\infty )\) and is such that its density \(\pi _3\) with respect to the Lebesgue measure satisfies

    $$\begin{aligned} 0<\underline{\pi }_3 \le \pi _3(\sigma )\le \overline{\pi }_3<\infty , \quad \sigma \in [\underline{\sigma },\,\overline{\sigma }], \end{aligned}$$

    for some constants \(\underline{\pi }_3\) and \(\overline{\pi }_3.\)

Assumptions 1 and 2 parallel those given in Ghosal and Vaart (2001) in the context of non-parametric Bayesian density estimation using the Dirichlet location mixture of normal densities as a prior. We refer to that paper for an additional discussion.

The following is our main result. Note that it covers both the case of high frequency observations (\(\Delta \rightarrow 0\)) and observations with fixed intersampling intervals. We use \(\Pi \) to denote the posterior on \((\lambda ,\,f).\)

Theorem 1

Under Assumptions 1 and 2, provided \(n\Delta \rightarrow \infty ,\) there exists a constant \(M>0,\) such that for

$$\begin{aligned} \varepsilon _n=\frac{\log ^{\kappa }(n\Delta )}{\sqrt{n\Delta }}, \quad \kappa =\max \left( \frac{2}{\delta },\,\frac{1}{2}\right) +\frac{1}{2}, \end{aligned}$$

we have

$$\begin{aligned} \Pi \left( A\left( \varepsilon _n,\,M\right) \Big | \mathcal {Z}_n^{\Delta }\right) \rightarrow 0, \end{aligned}$$

in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability as \(n\rightarrow \infty .\)

For fixed \(\Delta \) (w.l.o.g. one may then assume \(\Delta =1\)) the posterior contraction rate in Theorem 1 reduces to \(\varepsilon _n=\frac{\log ^{\kappa }(n)}{\sqrt{n}}.\) We also see that the posterior contraction rate is controlled by the parameter \(\delta \) of the tail behaviour in (14). Note that if (14) is satisfied for some \(\delta >4,\) it is also automatically satisfied for all \(0<\delta \le 4.\) The stronger the decay rate in (14), the better the contraction rate, but all \(\delta \ge 4\) give the same value \(\kappa =1.\) The best possible posterior contraction rate in Theorem 1 for minimal \(\delta \) is obtained for \(\delta =4.\) In the proof in Sect. 5 we can therefore assume that \(\delta \le 4.\)

As on p. 1239 in Ghosal and Vaart (2001) and similar Corollary 5.1 there, Theorem 1 implies existence of a point estimate of \((\lambda _0,\,f_0)\) with a frequentist convergence rate \(\varepsilon _n.\) The (frequentist) minimax convergence rate for estimation of \(k_{\lambda ,f}^{\Delta }\) relative to the Hellinger distance is unknown in our problem, but an analogy to Ibragimov and Khas’minskiĭ (1982) suggests that up to a logarithmic factor it should be of order \(\sqrt{n\Delta }\) (cf. Ghosal and Vaart 2001, p. 1236). The logarithmic factor is insignificant for all practical purposes. The convergence rate of an estimator of the Lévy density with loss measured in the \(L_2\)-metric in a more general Lévy model than the CPP model is \((n\Delta )^{-\beta /(2\beta +1)},\) whenever the target density is Sobolev smooth of order \(\beta \) (cf. Comte and Genon-Catalot 2011). Our contraction rate is hence, roughly speaking, a limiting case of the convergence in Comte and Genon-Catalot (2011) for \(\beta \rightarrow \infty .\)

Algorithms for drawing from the posterior

In this section we discuss computational methods for drawing from the distribution of the pair \((\lambda ,\, f),\) conditional on \(\mathcal {X}_n^\Delta \) (or equivalently: conditional on \(\mathcal {Z}_n^\Delta \)). In the following there is no specific need that the observational times are equidistant. We will assume observations at times \(0<t_1<\cdots <t_n\) and set \(\Delta _i =t_i-t_{i-1}\, (1\le i \le n\)). Further, for consistency with notation following shortly, we set \(z_i=X_{t_i}-X_{t_{i-1}}\) and \(z=(z_1,\ldots , z_n).\) We will use “Bayesian notation” throughout and write p for a probability density of mass function and use \(\pi \) similarly for a prior density or mass function.

In general, it is infeasible to generate independent realisations of the posterior distribution of \((\lambda ,\,f).\) To see this: from (4) one obtains that the conditional density of a nonzero increment z on a time interval of length \(\Delta \) is given by

$$\begin{aligned} p(z\mid \lambda ,\, f) = \frac{e^{-\lambda \Delta }}{1-e^{-\lambda \Delta }} \sum _{k=1}^\infty \frac{(\lambda \Delta )^k}{k!} f^{*k}(z), \end{aligned}$$

which generally is rather intractable due to the infinite weighted sum of convolutions. We specialise to the case where the jump size distribution is a mixture of \(J\ge 1\) Gaussians. The richness and versatility of the class of finite normal mixtures is convincingly demonstrated in Marron and Wand (1992).

Hence, we assume

$$\begin{aligned} f(\cdot )=\sum _{j=1}^J \rho _j \phi \left( \cdot ;\, \mu _j,\, 1/\tau \right) , \quad \sum _{j=1}^J \rho _j=1, \end{aligned}$$

where \(\phi (\cdot ;\, \mu ,\, \sigma ^2)\) denotes the density of a random variable with \(\mathcal {N}(\mu ,\, \sigma ^2)\) distribution. Note that in (16) we parametrise the density with the precision \(\tau .\) In the “simple” case \(J=2\) the convolution density of k independent jumps is given by

$$\begin{aligned} f^{*k}(\cdot ) = \sum _{\ell =0}^k {k \atopwithdelims ()\ell } \rho _1^\ell \rho _2^{k-\ell } \phi \left( \cdot ;\, \ell \mu _1 +(k-\ell ) \mu _2;\, k/\tau \right) . \end{aligned}$$

Plugging this expression into Eq. (15) confirms the intractable form of \(p(z\mid \lambda ,\, f).\)

We will introduce auxiliary variables to circumvent the intractable form of the likelihood. In case the CPP is observed continuously, the problem is much easier as now the continuous time likelihood on an interval \([0,\,T]\) is known to be (Shreve 2008, Theorem 11.6.7)

$$\begin{aligned} \lambda ^{|V|} e^{-\lambda T} \prod _{i \in V} f\left( J_i\right) , \end{aligned}$$

where the \(T_i\) are the jump times of the CPP, \(J_i\) the corresponding jump sizes and \(V=\{i{\text {:}}\, T_i \le T\}.\) The tractability of the continuous time likelihood naturally suggests the construction of a data augmentation scheme. Denote the values of the CPP in between times \(t_{i-1}\) and \(t_i\) by \(x_{(i-1,i)}.\) We will refer to \(x_{(i-1,i)}\) as the missing values on the ith segment. Set

$$\begin{aligned} x^{mis} = \left\{ x_{(i-1,i)},\, 1\le i \le n\right\} . \end{aligned}$$

A data augmentation scheme now consists of augmenting auxiliary variables \(x^{mis}\) to \((\lambda ,\,f)\) and constructing a Markov chain that has \(p(x^{mis},\, \lambda ,\, f \mid z)\) as invariant distribution. More specifically, a standard implementation of this algorithm consists of the following steps:

  1. (1)

    Initialise \(x^{mis}.\)

  2. (2)

    Draw \((\lambda ,\, f) \mid (x^{mis},\, z).\)

  3. (3)

    Draw \(x^{mis} \mid (\lambda ,\, f,\, z).\)

  4. (4)

    Repeat steps 2 and 3 many times.

Under weak conditions, the iterates for \((\lambda ,\,f)\) are (dependent) draws from the posterior distribution. Step 3 entails generating compound Poisson bridges. By the Markov property, bridges on different segments can be drawn independently. Data augmentation has been used in many Bayesian computational problems, see, e.g., Tanner and Wong (1987). The outlined scheme can be applied to the problem at hand, but we explain shortly that imputation of complete CPP-bridges (which is nontrivial) is unnecessary and we can do with less imputation, thereby effectively reducing the state space of the Markov chain.

As we assume that the jumps are drawn from a non-atomic distribution, imputation is only necessary on segments with nonzero increments. For this reason we let

$$\begin{aligned} \mathcal {I} =\left\{ i \in \{1,\ldots , n\}{\text {:}}\, z_i \ne 0\right\} , \end{aligned}$$

denote the set of observations with nonzero jump sizes and define the number of segments with nonzero jumps to be \(I =|\mathcal {I}|.\)

Auxiliary variables

Note that if \(Y \sim f\) with f as in (16), then Y can be simulated by first drawing its label L,  which equals j with probability \(\rho _j,\) and next drawing from the \(N(\mu _{L},\, 1/\tau )\) distribution. Knowing the labels, sampling the jumps conditional on their sum being z is much easier compared to the case with unknown labels. Adding auxiliary variables as labels is a standard trick used for inference in mixture models (see, e.g., Diebolt and Robert 1994; Richardsen and Green 1997). For the problem at hand, we can do with even less imputation: all we need to know is the number of jumps of each type on every segment with nonzero jump size. For \(i \in \mathcal {I}\) and \(j \in \{1,\ldots , J\},\) let \(n_{ij}\) denote the number of jumps of type j on segment i. Denote the set of all auxiliary variables by \(\mathbf {a}=\{a_i, \, i \in I\},\) where

$$\begin{aligned} a_i=\left( n_{i1},\,n_{i2}, \ldots , n_{iJ}\right) . \end{aligned}$$

In the following we will use the following additional notation: for \(i=1,\ldots ,n,\,j=1,\ldots , J\) we set

$$\begin{aligned} n_i = \sum _{j=1}^J n_{ij} \quad s_j = \sum _{i=1}^n n_{ij} \quad s= \sum _{j=1}^J s_j. \end{aligned}$$

These are the number of jumps on the i-th segment, the total number of jumps of type j (summed over all segments) and the total number of jumps of all types, respectively.

Reparametrisation and prior specification

Instead of parametrising with \((\lambda ,\, \rho _1,\ldots , \rho _J),\) we define

$$\begin{aligned} \psi _j = \lambda \rho _j, \quad j=1,\ldots , J. \end{aligned}$$


$$\begin{aligned} \lambda = \sum _{j=1}^J \psi _j,\quad \rho _j=\frac{\psi _j}{\sum _{j=1}^J \psi _j}. \end{aligned}$$

The background of this reparametrisation is the observation that a compound Poisson random variable Z whose jumps are of J types can be decomposed as \(Z=\sum _{j=1}^JZ_j,\) where the \(Z_j\) are independent, compound Poisson random variables whose jumps are of type j only, and where the parameter of the Poisson random variable is \(\psi _j.\) In what follows we use \(\theta =(\psi ,\,\mu ,\, \tau )\) with \(\psi =(\psi _1,\ldots , \psi _J)\) and \(\mu =(\mu _1, \ldots , \mu _J).\)

Denote the Gamma distribution with shape parameter \(\alpha \) and rate \(\beta \) by \(\mathcal {G}(\alpha ,\,\beta ).\) We take priors

$$\begin{aligned} \psi _1,\ldots , \psi _J&\mathop {\sim }\limits ^{\text {iid}}\, \mathcal {G}\left( \alpha _0,\, \beta _0\right) , \\ \mu \mid \tau&\sim \, \mathcal {N}\left( \left[ \xi _1,\ldots , \xi _J\right] ^{\prime },\, I_{J\times J}(\tau \kappa )^{-1}\right) , \\ \tau&\sim \, \mathcal {G}\left( \alpha _1,\,\beta _1\right) , \end{aligned}$$

with positive hyperparameters \((\alpha _0,\, \beta _0,\, \alpha _1,\, \beta _1,\, \kappa )\) fixed.

Hierarchical model and data augmentation scheme

We construct a Metropolis–Hastings algorithm to draw from

$$\begin{aligned} p(\theta ,\, \mathbf {a}\mid z) =\frac{p(\theta ,\, z,\, \mathbf {a})}{p(z)}. \end{aligned}$$

For an index \(i\in \mathcal {I}\) we set \(\mathbf {a}_{-i} = \{ a_j,\, j \in \mathcal {I}\setminus \{i\}\}.\) The two main steps of the algorithm are:

  1. (i)

    Update segments for each segment \(i \in \mathcal {I},\) draw \(a_i\) conditional on \((\theta ,\, z,\, \mathbf {a}_{-i});\)

  2. (ii)

    Update parameters draw \(\theta \) conditional on \((z,\, \mathbf {a}).\)

Compared to the full data augmentation scheme discussed previously, the present approach is computationally much cheaper as the amount of imputation scales with the number of segments that need imputation. If the time in between observations is fixed and equal to \(\Delta ,\) then the expected number of segments for imputation equals \(n (1- e^{-\lambda \Delta }),\) which is for small \(\Delta \) approximately proportional to \(n\Delta \lambda .\)

Denote the Poisson distribution with mean \(\lambda \) by \(\mathcal {P}(\lambda ).\) Including the auxiliary variables, we can write the observation model as a hierarchical model

$$\begin{aligned} z_i \mid a_i, \,\mu , \,\tau&\mathop {\sim }\limits ^{\text {ind}} N\left( a_i^{\prime }\mu , \,n_i/\tau \right) ,\nonumber \\ n_{ij} \mid \psi&\mathop {\sim }\limits ^{\text {ind}} \mathcal {P}\left( \psi _j\Delta _i\right) , \nonumber \\ (\psi ,\, \mu ,\, \tau )&\sim \pi (\psi ,\, \mu , \,\tau ) \end{aligned}$$

(with \(i\in \{1,\ldots , n\}\) and \(j\in \{1,\ldots , J\}\)). This implies

$$\begin{aligned} p(\theta ,\, z,\, \mathbf {a})= \pi (\theta ) \times \prod _{i=1}^n\left( \phi \left( z_i ;\, a_i^{\prime } \mu ,\, n_i/\tau \right) \prod _{j=1}^J e^{-\psi _j\Delta _i} \frac{(\psi _j\Delta _i)^{n_{ij}}}{n_{ij}!}\right) . \end{aligned}$$

Updating segments

Updating the ith segment requires drawing from

$$\begin{aligned} p\left( a_i \mid \theta ,\, z,\, \mathbf {a}_{-i}\right) \propto \phi \left( z_i ;\, a_i^{\prime } \mu ,\, n_i/\tau \right) \prod _{j=1}^J \frac{(\psi _j\Delta _i)^{n_{ij}}}{n_{ij}!}. \end{aligned}$$

We do this with a Metropolis–Hastings step. First we draw a proposal \(n_i^\circ \) (for \(n_i\)) from a \(\mathcal {P}(\lambda \Delta _i\)) distribution, conditioned to have nonzero outcome. Next, we draw

$$\begin{aligned} a_i^\circ = \left( n_{i1}^\circ ,\ldots , n^\circ _{iJ}\right) \sim \mathcal {MN}\left( n_i^\circ ; \,\psi _1/\lambda ,\ldots , \psi _J/\lambda \right) , \end{aligned}$$

where \(\mathcal {MN}\) denotes the multinomial distribution. Hence the proposal density equals

$$\begin{aligned} q\left( n_{i1}^\circ ,\ldots , n^\circ _{iJ} \mid \theta \right)&= \frac{e^{-\lambda \Delta _i}}{1-e^{-\lambda \Delta _i}} \frac{(\lambda \Delta _i)^{n_i^\circ }}{n_i^\circ !} {n^\circ _i \atopwithdelims ()n^\circ _{i1} \ldots n^\circ _{iJ}} \prod _{j=1}^J \left( \psi _j/\lambda \right) ^{n^\circ _{ij}}\\&=\frac{e^{-\lambda \Delta _i}}{1-e^{-\lambda \Delta _i}} \prod _{j=1}^J \frac{(\psi _j\Delta _i)^{n_{ij}^\circ }}{n^\circ _{ij}!}. \end{aligned}$$

The acceptance probability for the proposal \(n^\circ \) equals \(1\wedge A,\) with

$$\begin{aligned} A= \frac{\phi (z_i ;\, (a_i^\circ )^{\prime } \mu ,\, n^\circ _i/\tau )}{\phi (z_i ;\, a_i^{\prime } \mu ,\, n_i/\tau ) }. \end{aligned}$$

Updating parameters

The proof of the following lemma is given in Appendix 1.

Lemma 2

Conditional on \(\mathbf {a},\, \psi _1,\ldots , \psi _J\) are independent and

$$\begin{aligned} \psi _j \mid \mathbf {a}\sim \mathcal {G}\left( \alpha _0+s_j,\, \beta _0+T\right) . \end{aligned}$$


$$\begin{aligned} \begin{aligned}&\mu \mid \tau ,\, z,\, \mathbf {a}\sim \mathcal {N}\left( P^{-1}q,\, \tau ^{-1} P^{-1}\right) , \\&\quad \tau \mid z,\, \mathbf {a}\sim \mathcal {G}\left( \alpha _1+I/2,\, \beta _1+\left( R-q^{\prime } P^{-1} q\right) /2\right) , \end{aligned} \end{aligned}$$

where P is the symmetric \(J\times J\) matrix with elements

$$\begin{aligned} P= \kappa I_{J\times J} + \tilde{P} \quad \tilde{P}_{j,k} = \sum _{i\in \mathcal {I}} n_i^{-1}n_{ij} n_{ik}, \quad j,\,k \in \{1,\ldots , J\}, \end{aligned}$$

q is the J-dimensional vector with

$$\begin{aligned} q_j=\kappa \xi _j +\sum _{i \in \mathcal {I}} n_i^{-1} n_{ij} z_i, \end{aligned}$$

\(R>0\) is given by

$$\begin{aligned} R= \kappa \sum _{j=1}^J \xi _j^2 + \sum _{i \in \mathcal {I}} n_i^{-1} z_i^2, \end{aligned}$$

and \(R-q^{\prime } P^{-1} q >0.\)

Remark 2

If for some \(j\in \{1,\ldots , J\}\) we have \(s_j=0\) (no jumps of type j), then the matrix \(\tilde{P}\) is singular. However, adding \(\kappa I_{J\times J}\) ensures invertibility of P.

Numerical illustrations

The first two examples concern mixtures of two normal distributions We simulated \(n=5.000\) segments with \(\Delta =1,\, \mu _1=2,\, \mu _2={-}1\) and \(\tau =1.\) For the prior-hyperparameters we took \(\alpha _0=\beta _0=\alpha _1=\beta _1=1,\, \xi _1=\xi _2=0\) and \(\kappa =1.\)

The results for \(\lambda \Delta =1,\, \rho _1=0.8,\, \rho _2=0.2\) and hence \(\psi _1=0.8\) and \(\psi _2=0.2\) are shown in Fig. 1. The densities obtained from the posterior mean of the parameter estimates and the true density are shown in Fig. 2. The average acceptance probability for updating the segments was 51%.

The results for \(\lambda \Delta =3,\, \rho _1=0.8,\, \rho _2=0.2\) and hence \(\psi _1=2.4\) and \(\psi _2=0.6\) are shown in Fig. 3. The densities obtained from the posterior mean of the parameter estimates and the true density are shown in Fig. 4. The average acceptance probability for updating the segments was 41%. Observe that the autocorrelation functions of the iterations of the \(\psi _i\) in the second case display a much slower decay.

We also assessed the performance of our method on a more complicated example where we took a mixture of four normals. Here \(\Delta =1,\, (\mu _1,\, \mu _2,\, \mu _3,\, \mu _4)=({-}1,\, 0,\, 0.8,\, 2),\, (\psi _1,\,\psi _2,\, \psi _3,\, \psi _4)=(0.3,\, 0.4,\, 0.2,\, 0.1)\) (hence \(\lambda =1\)) and \(\tau ^{-1}=0.09.\) The results obtained after simulating \(n=10.000\) segments are shown in Figs. 5 and 6.

Mixtures of normals need not be multimodal and can also yield skew densities. As an example, we consider the case where \((\mu _1,\,\mu _2)=(0,\,2),\,(\psi _1,\, \psi _2)=(1.5,\,0.5)\) (hence \(\lambda =2\)) and \(\tau =1.\) Data were generated and discretely sampled with \(\Delta =1\) and \(n=5.000\) segments. A plot of the posterior mean is shown in Fig. 7.

Fig. 1

Results for \(\lambda =1\) using 15.000 MCMC iterations. The trace plots show all iterations; in the other plots the first 5.000 iterations are treated as burnin. The figures are obtained after subsampling the iterates, where only each fifth iterate was saved. The horizontal yellow lines are obtained from computing the posterior mean of \(\theta \) based on the true auxiliary variables on all segments

Fig. 2

Results for \(\lambda =1;\) the first 5.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

Fig. 3

Results for \(\lambda =3\) using 25.000 MCMC iterations. The trace plots show all iterations; in the other plots the first 10.000 iterations are treated as burnin. The figures are obtained after subsampling the iterates, where only each fifth iterate was saved. The horizontal yellow lines are obtained from computing the posterior mean of \(\theta \) based on the true auxiliary variables on all segments

Fig. 4

Results for \(\lambda =3;\) the first 10.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

Fig. 5

Results for the example with a mixture of four normals using 100.000 MCMC iterations. The trace plots show all iterations, in the autocorrelation plot the first 20.000 iterations are treated as burnin. The figures are obtained after subsampling the iterates, where only each fifth iterate was saved. The horizontal yellow lines indicate true values. The results for the other parameters are similar and therefore not displayed

Fig. 6

Results for the example with a mixture of four normals; the first 20.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

Fig. 7

Results for the example with a skew density; the first 20.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates


As can be seen from the autocorrelation plots, mixing of the chain deteriorates when \(\lambda \Delta \) increases. As the focus in this article is on high frequency data, where there are on average only a few jumps in between observations, we do not go into details on improving the algorithm. We remark that a non-centred parametrisation (see for instance Papaspiliopoulos et al. 2007) may give more satisfactory results when \(\lambda \Delta \) is large. A non centred parametrisation can be obtained by changing the hierarchical model in (17). Denote by \(F^{-1}_\lambda \) the inverse cumulative distribution function of the \(\mathcal {P}(\lambda )\) distribution. Let \(u_{ij}\) (\(i=1,\ldots , n\) and \(j=1,\ldots , J\)) be a sequence of independent \(U(0,\,1)\) random variables and set \(u=\{u_{ij},\, i=1,\ldots , n,\, j=1,\ldots , J\}.\) By considering the hierarchical model

$$\begin{aligned} z_i \mid u,\, \mu ,\, \tau&\mathop {\sim }\limits ^{\text {ind}} N\left( \sum _{j=1}^J \mu _j F^{-1}_{\psi _j \Delta _i}\left( u_{ij}\right) ,\, \tau ^{-1}\sum _{j=1}^J F^{-1}_{\psi _j \Delta _i}\left( u_{ij}\right) \right) ,\nonumber \\ u_{ij}&\mathop {\sim }\limits ^{\text {iid}}U(0,\,1), \nonumber \\ (\psi ,\, \mu , \,\tau )&\sim \pi (\psi ,\, \mu , \,\tau ), \end{aligned}$$

(\(i\in \{1,\ldots , n\}\) and \(j\in \{1,\ldots , J\}),\, \psi \) can be updated using a Metropolis–Hastings step. In this way \(\{n_{ij}\}\) and \(\psi \) are updated simultaneously.

Another option is to integrate out \((\mu ,\,\tau )\) from \(p(\theta ,\,z,\,\mathbf {a}).\) In this model it is even possible to integrate out \(\psi \) as well. In that case only the auxiliary variables \(\mathbf {a}\) have to be updated. Yet another method to improve the efficiency of the algorithm is to use ideas from parallel tempering (Cf. Brooks et al. 2011, Chap. 11).

Proof of Theorem 1

There are a number of general results in Bayesian nonparametric statistics, such as the fundamental Theorem 2.1 in Ghosal et al. (2000) and Theorem 2.1 in Ghosal and Vaart (2001), which allow determination of the posterior contraction rates through checking certain conditions, but none of these results is easily and directly applicable in our case. The principle bottleneck is that a main assumption underlying these theorems is sampling from a fixed distribution, whereas in our high frequency setting, the distributions vary with \(\Delta .\) Therefore, for the clarity of exposition in the proof of our main theorem we will choose an alternative path, which consists in mimicking the main steps of the proof of Theorem 2.1, involving judiciously chosen statistical tests, as in Ghosal et al. (2000), while also employing some results on the Dirichlet location mixtures of normal densities from Ghosal and Vaart (2001). However, a significant part of technicalities we will encounter are characteristic of the decompounding problem only.

Throughout this section we assume that Assumptions 1 and 2 hold. Furthermore, in view of the discussion that followed Theorem 1 we will without loss of generality assume that \(0<\delta \le 4.\) All the technical lemmas used in this section are collected in the appendices.

We start with the decomposition

$$\begin{aligned} \Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta }\right) =\Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta } \right) \phi _n+\Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta } \right) \left( 1-\phi _n\right) =:\mathrm {I}_n+\mathrm {II}_n, \end{aligned}$$

where \(0\le \phi _n\le 1\) is a sequence of tests based on observations \(\mathcal {Z}_n^{\Delta }\) and with properties to be specified below. The idea is to show that the terms on the right-hand side of the above display separately converge to zero in probability. The tests \(\phi _n\) allow one to control the behaviour of the likelihood ratio

$$\begin{aligned} \mathcal {L}_n^{\Delta }(\lambda ,\,f)=\prod _{i=1}^n \frac{k_{\lambda ,f}^{\Delta }(Z_i^{\Delta })}{k_{\lambda _0,f_0}^{\Delta }(Z_i^{\Delta })}, \end{aligned}$$

on the set where it is not well-behaved due to the fact that \((\lambda ,\,f)\) is ‘far away’ from \((\lambda _0,\,f_0).\)

Construction of tests

The next lemma is an adaptation of Theorem 7.1 from Ghosal et al. (2000) to decompounding. A proof is given in the appendix. We use the notation \(D(\varepsilon ,\,A,\,d)\) to denote the \(\varepsilon \)-packing number of a set A in a metric space with metric d,  applied in our case with d the scaled Hellinger metric \(h^\Delta .\)

Lemma 3

Let \(\mathcal {Q}\) be an arbitrary set of probability measures \(\mathbb {Q}^\Delta _{\lambda ,f}.\) Suppose for some non-increasing function \(D(\varepsilon ),\) some sequence \(\{\varepsilon _n\}\) of positive numbers and every \(\varepsilon >\varepsilon _n,\)

$$\begin{aligned} D\left( \frac{\varepsilon }{2},\,\left\{ \mathbb {Q}^\Delta _{\lambda ,f}\in \mathcal {Q}{\text {:}}\,\varepsilon \le h^\Delta \left( \mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f}\right) \le 2\varepsilon \right\} ,\,h^\Delta \right) \le D(\varepsilon ). \end{aligned}$$

Then for every \(\varepsilon >\varepsilon _n\) there exists a sequence of tests \(\{\phi _n\}\) (depending on \(\varepsilon >0),\) such that

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{\lambda _0,f_0}\left[ \phi _n\right] \le D(\varepsilon )\exp \left( {-}Kn\Delta \varepsilon ^2\right) \frac{1}{1-\exp ({-}Kn\Delta \varepsilon ^2)},\\&\sup _{\{\mathbb {Q}^\Delta _{\lambda ,f}\in \mathcal {Q}{\text {:}}\,h^\Delta (\mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f})>\varepsilon \}}\mathbb {E}_{\lambda ,f}\left[ 1-\phi _n\right] \le \exp \left( {-}Kn\Delta \varepsilon ^2 \right) , \end{aligned} \end{aligned}$$

where \(K>0\) is a universal constant.

In the proofs of Propositions 1 and 2 we need the inequalities below. There exists a constant \(\overline{C}\in (0,\,\infty )\) depending on \(\underline{\lambda }\) and \(\overline{\lambda }\) only, such that for all \(\lambda _1,\,\lambda _2\in [\underline{\lambda },\,\overline{\lambda }]\) and \(f_1,\,f_2\) it holds that

$$\begin{aligned} \mathrm {K}\left( \mathbb {Q}^{\Delta }_{\lambda _1,f_1},\,\mathbb {Q}^{\Delta }_{\lambda _2,f_2}\right)&\le \overline{C}\Delta \left( \mathrm {K}\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) +\left| \lambda _1-\lambda _2\right| ^2\right) , \end{aligned}$$
$$\begin{aligned} \mathrm {V}\left( \mathbb {Q}^{\Delta }_{\lambda _1,f_1},\,\mathbb {Q}^{\Delta }_{\lambda _2,f_2}\right)&\le \overline{C}\Delta \left( \mathrm {V}\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) +\mathrm {K}\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) +\left| \lambda _1-\lambda _2\right| ^2\right) , \end{aligned}$$
$$\begin{aligned} h\left( \mathbb {Q}^{\Delta }_{\lambda _1,f_1},\,\mathbb {Q}^{\Delta }_{\lambda _2,f_2}\right)&\le \overline{C}\sqrt{\Delta }\left( \left| \lambda _1-\lambda _2\right| + h\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) \right) . \end{aligned}$$

These inequalities can be proven in the same way as Lemma 1 in Gugushvili et al. (2015).

Let \(\varepsilon _n\) be as in Theorem 1. Throughout, \(\overline{C}\) denotes the above constant. For a constant \(L>0\) define the sequences \(\{a_n\}\) and \(\{\eta _n\}\) by

$$\begin{aligned} a_n=L\log ^{2/\delta }\left( \frac{1}{\eta _n} \right) , \quad \eta _n=\frac{\varepsilon _n}{4\overline{C}}. \end{aligned}$$

We will show that inequality (24) holds true for every \(\varepsilon =M\varepsilon _n\) with \(M>2\) and the set of measures \(\mathcal {Q}\) equal to

$$\begin{aligned} \mathcal {Q}_n=\left\{ \mathbb { Q}_{\lambda ,f_{H,\sigma }}^{\Delta }{\text {:}}\, \lambda \in [\underline{\lambda },\, \overline{\lambda }],\, H\left[ {-}a_n,\,a_n\right] \ge 1-\eta _n,\, \sigma \in [\underline{\sigma },\,\overline{\sigma }] \right\} . \end{aligned}$$

As a first step, note that we have

$$\begin{aligned} \log D\left( \frac{\varepsilon }{2},\,\mathcal {Q}_n,\,h^\Delta \right)&\le \log D\left( {\varepsilon _n},\,\mathcal {Q}_n,\,h^\Delta \right) \nonumber \\&\le \log N\left( \frac{\varepsilon _n}{2},\,\mathcal {Q}_n,\,h^\Delta \right) = \log N\left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,h \right) , \end{aligned}$$

where \(N\left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,h\right) \) is the covering number of the set \(\mathcal {Q}_n\) with h-balls of size \(\varepsilon _n\sqrt{\Delta }/2.\) The first inequality in (28) follows from assuming \(M>2.\) For bounding the righthand side in (28), we have the following proposition.

Proposition 1

We have

$$\begin{aligned} \log N\left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,h\right) \lesssim \log ^{4/\delta +1}\left( \frac{1}{\varepsilon _n} \right) . \end{aligned}$$



$$\begin{aligned} \mathcal {F}_n = \left\{ f_{H,\sigma }{\text {:}}\, H\left[ {-}a_n,\,a_n\right] \ge 1-\eta _n,\, \sigma \in [\underline{\sigma },\, \overline{\sigma }]\right\} . \end{aligned}$$

Let \(\{\lambda _i\}\) be centres of the balls from a minimal covering of \([\underline{\lambda },\, \overline{\lambda }]\) with \(|\cdot |\)-balls of size \(\eta _n.\) Let \(\{f_j\}\) be centres of the balls from a minimal covering of \(\mathcal {F}_n\) with h-balls of size \(\eta _n.\) For any \(\mathbb { Q}_{\lambda , f_{H,\sigma }} \in \mathcal {Q }_n,\) by (27) we have

$$\begin{aligned} h\left( \mathbb { Q}_{\lambda , f_{H,\sigma }} ,\, \mathbb { Q}_{\lambda _i, f_j}\right) \le \frac{\varepsilon _n\sqrt{\Delta }}{2}, \end{aligned}$$

by appropriate choices of i and j. It follows that

$$\begin{aligned} \log N \left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,{h}\right) \le \log N \left( \eta _n,\,[\underline{\lambda },\,\overline{\lambda }],\,|\cdot |\right) + \log N \left( \eta _n ,\, {\mathcal {F}}_n ,\, {h}\right) . \end{aligned}$$


$$\begin{aligned} \log N\left( \eta _n,\,[\underline{\lambda },\,\overline{\lambda }],\,|\cdot |\right) \lesssim \log \left( \frac{1}{\varepsilon _n} \right) . \end{aligned}$$

As we assume \(\delta \le 4,\) we can apply the arguments in Ghosal and van der Vaart (2001, pp. 1251–1252) see in particular formulae (5.8)–(5.10) (cf. also Theorem 3.1 and Lemma A.3 there), which yield

$$\begin{aligned} \log N\left( \eta _n ,\, {\mathcal {F}}_n ,\, {h}\right) \lesssim \log ^{4/\delta +1}\left( \frac{1}{\varepsilon _n} \right) . \end{aligned}$$

Combination of the above three inequalities implies the statement of the proposition.

An application of Proposition 1 to (28) gives

$$\begin{aligned} \log D\left( \frac{\varepsilon }{2},\,\mathcal {Q}_n,\,h^\Delta \right) \lesssim \log ^{4/\delta +1}\left( \frac{1}{\varepsilon _n} \right) \le c_1 n\Delta \varepsilon _n^2, \end{aligned}$$

for some positive constant \(c_1.\) Here, the final inequality follows from our choice for \(\varepsilon _n.\) Hence, (24) is satisfied for

$$\begin{aligned} D(\varepsilon )=\exp \left( \left( c_1/M^2-K\right) n \Delta \varepsilon ^2\right) . \end{aligned}$$

By Lemma 3 there exist tests \(\phi _n\) such that for all n large enough

$$\begin{aligned}&\mathbb {E}_{\lambda _0,f_0}\left[ \phi _n\right] \le 2 \exp \left( -\left( KM^2-c_1\right) n\Delta \varepsilon _n^2 \right) , \end{aligned}$$
$$\begin{aligned}&\sup _{\{\mathbb {Q}^\Delta _{\lambda ,f}\in \mathcal {Q}_n{\text {:}}\,h^\Delta (\mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f})>\varepsilon \}}\mathbb {E}_{\lambda ,f}\left[ 1-\phi _n\right] \le \exp \left( {-}Kn\Delta M^2\varepsilon _n^2 \right) . \end{aligned}$$

Bound on \(\mathrm {I}_n\) in (23)

First note that by Eq. (30)

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {I}_n \right] \le \mathbb {E}_{\lambda _0,f_0}\left[ \phi _n \right] \le 2 \exp \left( -\left( KM^2-c_1\right) n\Delta \varepsilon _n^2 \right) . \end{aligned}$$

Chebyshev’s inequality implies that \(\mathrm {I}_n\) converges to zero in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability as \(n\rightarrow \infty ,\) as soon as M is chosen so large that \(KM^2-c_1>0.\,\square \)

Bound on \(\mathrm {II}_n\)

Now we consider \(\mathrm {II}_n.\) We have

$$\begin{aligned} \mathrm {II}_n=\frac{ \iint _{A(\varepsilon _n,M)} \mathcal {L}_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) (1-\phi _n) }{ \iint \mathcal {L}_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) }=:\frac{\mathrm {III}_n}{\mathrm {IV}_n}. \end{aligned}$$

We will show that the numerator \(\mathrm {III}_n\) goes exponentially fast to zero, in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability, while the denominator \(\mathrm {IV}_n\) is bounded from below by an exponential function, with \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability tending to one, in such a way that the ratio of \(\mathrm {III}_n\) and \(\mathrm {IV}_n\) still goes to zero in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability.

Bounding \(\mathrm {III}_n\)

As \(1_{\{A(\varepsilon _n,M)\}} \le 1_{\mathcal {Q}_n^c}+1_{\{A(\varepsilon _n,M)\cap \mathcal {Q}_n\}}\) we have

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {III}_n\right] \le \Pi \left( \mathcal {Q}_n^c\right) +\iint _{ \mathcal {Q}_n \cap A(\varepsilon _n,M) } \mathbb {E}_{\lambda ,f} \left[ 1-\phi _n\right] \mathrm {d}\Pi _1(\lambda )\mathrm {d}\Pi _2(f). \end{aligned}$$

Here we applied Fubini’s theorem to obtain the second term on the right-hand-side, which by (31) is bounded by \(\exp (-KM^2n\Delta \varepsilon _n^2).\) Furthermore,

$$\begin{aligned} \Pi \left( \mathcal {Q}_n^c\right) =\Pi _2\left( H\left[ {-}a_n,\,a_n\right] <1-\eta _n,\,\sigma \in [\underline{\sigma },\,\overline{\sigma }]\right) \lesssim \frac{1}{\eta _n}e^{-ba_n^{\delta }}, \end{aligned}$$

where the last inequality is formula (5.11) in Ghosal and Vaart (2001). Hence

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {III}_n\right] \lesssim \frac{1}{\eta _n}e^{-ba_n^{\delta }} + \exp \left( -KM^2n\Delta \varepsilon _n^2\right) . \end{aligned}$$

Bounding \(\mathrm {IV}_n\)

Recall \(\mathrm {K}_\Delta =\mathrm {K}/\Delta \) and \(\mathrm {V}_\Delta =\mathrm {V}/\Delta .\) Let

$$\begin{aligned} B^{\Delta }\left( \varepsilon ,\,\left( \lambda _0,\,f_0\right) \right) =\left\{ (\lambda ,\,f){\text {:}}\, K^\Delta \left( \mathbb {Q}^{\Delta }_{\lambda _0,f_0},\, \mathbb {Q}^{\Delta }_{\lambda ,f}\right) \le \varepsilon ^2,\, V^\Delta \left( \mathbb {Q}^{\Delta }_{\lambda _0,f_0},\, \mathbb {Q}^{\Delta }_{\lambda ,f}\right) \le \varepsilon ^2 \right\} , \end{aligned}$$


$$\begin{aligned} \widetilde{\varepsilon }_n=\frac{\log (n\Delta )}{\sqrt{n\Delta }}. \end{aligned}$$

Note that \(n\Delta \widetilde{\varepsilon }_n^2\rightarrow \infty \) when \(n\rightarrow \infty .\)

We will use the following bound, an adaptation of Lemma 8.1 in Ghosal et al. (2000) to our setting, valid for every \(\varepsilon >0\) and \(C>0,\)

$$\begin{aligned} \mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\left( \iint _{ B^{\Delta }(\varepsilon ,(\lambda _0,f_0)) } \mathcal {L}_n(\lambda ,\,f)\mathrm {d}\widetilde{\Pi }(\lambda ,\,f) \le \exp \left( -(1+C)n\Delta \varepsilon ^2\right) \right) \le \frac{1}{C^2n\Delta \varepsilon ^2}, \end{aligned}$$


$$\begin{aligned} \widetilde{\Pi }(\cdot )=\frac{\Pi (\cdot )}{\Pi ( B^{\Delta }(\varepsilon ,\,(\lambda _0,\,f_0)) )}, \end{aligned}$$

is a normalised restriction of \(\Pi (\cdot )\) to \(B^{\Delta }(\varepsilon ,\,(\lambda _0,\,f_0)).\)

By virtue of (33), with \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability tending to one, for any constant \(C>0\) we have

$$\begin{aligned} \mathrm {IV}_n&\ge \iint _{B^{\Delta }(\widetilde{\varepsilon }_n,(\lambda _0,f_0))} \mathcal {L}^\Delta _n(\lambda ,\,f)\mathrm {d}\Pi _1(\lambda )\times \mathrm {d}\Pi _2(f)\nonumber \\&>\Pi \left( B^{\Delta }\left( \widetilde{\varepsilon }_n,\,\left( \lambda _0,\,f_0\right) \right) \right) \exp \left( -(1+C)n\Delta \widetilde{\varepsilon }_n^2\right) . \end{aligned}$$

We will now work out the product probability on the right-hand side of this inequality.

Proposition 2

It holds that

$$\begin{aligned} \Pi \left( B^{\Delta }\left( \widetilde{\varepsilon }_n,\,\mathbb {Q}_{\lambda _0,f_0}\right) \right) \gtrsim \exp \left( - \bar{c} \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) , \end{aligned}$$

for some constant \(\bar{c}.\)


Let \(0<c\le 1/\sqrt{5\overline{C}}\) be a constant. Here \(\overline{C}\) is the constant in (25) and (26). By these inequalities it is readily seen that

$$\begin{aligned} \left\{ (\lambda ,\,f){\text {:}}\,K\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2 \widetilde{\varepsilon }_n^2,\,V\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2 \widetilde{\varepsilon }_n^2 ,\,\left| \lambda _0-\lambda \right| ^2 \le c^2 \widetilde{\varepsilon }_n^2\right\} \subset B^{\Delta }\left( \widetilde{\varepsilon }_n,\, \mathbb { Q}_{\lambda _0, f_0}^{\Delta }\right) . \end{aligned}$$

It then follows by the independence assumption on \(\Pi _1\) and \(\Pi _2\) that

$$\begin{aligned} \Pi \left( B^{\Delta }\left( \widetilde{\varepsilon }_n,\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right) \right)&\ge \Pi _1\left( \left| \lambda _0-\lambda \right| \le c \widetilde{\varepsilon }_n \right) \\&\quad \times \Pi _2 \left( f {\text {:}}\, \mathrm {K}\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2{\widetilde{\varepsilon }_n^2},\, \mathrm {V}\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2{\widetilde{\varepsilon }_n^2} \right) . \end{aligned}$$

For the first factor on the right-hand side we have by (13) that

$$\begin{aligned} \Pi _1\left( \left| \lambda _0-\lambda \right| \le c \widetilde{\varepsilon }_n \right) \gtrsim \widetilde{\varepsilon }_n. \end{aligned}$$

As far as the second factor is concerned, for some constants \(\overline{c}_1,\,\overline{c}_2\) it is bounded from below by

$$\begin{aligned} \overline{c}_1 \exp \left( {-} \overline{c}_2 \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) , \end{aligned}$$

by the same arguments as in inequality (5.17) in Ghosal and Vaart (2001). The result now follows by combining the two lower bounds.

Combining (34) with Proposition 2, with \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability tending to one as \(n\rightarrow \infty ,\) for any constant \(C>0\) we have

$$\begin{aligned} \mathrm {IV}_n&> \exp \left( -(1+C)n\Delta \widetilde{\varepsilon }_n^2-\bar{c} \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) . \end{aligned}$$

We are now ready for showing the final steps of proving that \(\mathrm {II}_n\) tends to zero in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability. Let \(G_n\) denote the set on which inequality (35) is true. Then by (32) we obtain

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {II}_n 1_{G_n} \right]&\lesssim \exp \left( (1+C)n\Delta \widetilde{\varepsilon }_n^2+\bar{c} \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) \\&\quad \times \left[ \frac{1}{\eta _n}e^{-ba_n^{\delta }} + \exp \left( -KM^2n\Delta \varepsilon _n^2\right) \right] . \end{aligned}$$

Recall that \(n\Delta \widetilde{\varepsilon }_n^2=\log ^2(n\Delta ).\) Hence, the exponent in the first factor of this display is of order \(\log ^2 (n\Delta ).\) Furthermore \(a_n^\delta =L^\delta \log ^2({4\overline{C}}/{\varepsilon _n}),\) which is of order \(\log ^2(n\Delta )\) as well. It follows that, provided the constants L and M are chosen large enough, the right-hand side of the above display converges to zero as \(n\rightarrow \infty .\) Chebyshev’s inequality then implies that \(\mathrm {II}_n\) converges to zero in probability as \(n\rightarrow \infty .\) This completes the proof of Theorem 1. \(\square \)


  1. Alexandersson H (1985) A simple stochastic model of the precipitation process. J Clim Appl Meteorol 24(12):1282–1295

    Article  Google Scholar 

  2. Belomestny D, Comte F, Genon-Catalot V, Masuki H, Reiß M (eds) (2015) Lévy matters IV, estimation for discretely observed Lévy processes. Lecture notes in mathematics 2128. Springer, Cham

  3. Birgé L (1984) Sur un théorème de minimax et son application aux tests. Probab Math Stat 3:259–282

    MATH  Google Scholar 

  4. Brooks S, Gelman A, Jones GL, Meng XL (2011) Handbook of Markov chain Monte Carlo. Chapman and Hall/CRC, Hoboken

    Book  MATH  Google Scholar 

  5. Buchmann B, Grübel R (2003) Decompounding: an estimation problem for Poisson random sums. Ann Stat 31:1054–1074

    MathSciNet  Article  MATH  Google Scholar 

  6. Buchmann B, Grübel R (2004) Decompounding Poisson random sums: recursively truncated estimates in the discrete case. Ann Inst Stat Math 56:743–756

    MathSciNet  Article  MATH  Google Scholar 

  7. Burlando P, Rosso R. (1993) Stochastic Models of Temporal Rainfall: Reproducibility, Estimation and Prediction of Extreme Events. In: Stochastic Hydrology and its Use in Water Resources Systems, Simulation and Optimization, Marco JB, Harboe R, Salas JD (eds.), NATO ASI Series 237: 137–173. Springer

  8. Comte F, Genon-Catalot V (2010) Non-parametric estimation for pure jump irregularly sampled or noisy Lévy processes. Stat Neerl 64(3):290–313

    MathSciNet  Article  MATH  Google Scholar 

  9. Comte F, Genon-Catalot V (2010) Nonparametric adaptive estimation for pure jump Lévy processes. Annales de Institut Henri Poincare (B), Probability and Statistics 46(3): 595–617

  10. Comte F, Genon-Catalot V (2015) Adaptive estimation for Lévy processes. In: Belomestny D, Comte F, Genon-Catalot V, Masuki H, Reiß M (eds.). Lévy matters IV, Estimation for discretely observed Lévy processes. Lecture Notes in Mathematics 2128: 77–177. Springer, Cham

  11. Comte F, Duval C, Genon-Catalot V (2014) Nonparametric density estimation in compound Poisson process using convolution power estimators. Metrika 77:163–183

    MathSciNet  Article  MATH  Google Scholar 

  12. Comte F, Genon-Catalot V (2011) Estimation for Lévy processes from high frequency data within a long time interval. Ann Stat 39:803–837

    Article  MATH  Google Scholar 

  13. Diebolt J, Robert CP (1994) Estimation of finite mixture distributions through Bayesian sampling. J R Stat Soc B 56:363–375

    MathSciNet  MATH  Google Scholar 

  14. Duval C (2013) Density estimation for compound Poisson processes from discrete data. Stoch Process Appl 123:3963–3986

    MathSciNet  Article  MATH  Google Scholar 

  15. Embrechts P, Klüppelberg C, Mikosch T (1997) Modelling Extremal Events for Insurance and Finance. Applications of Mathematics (New York), 33. Springer-Verlag, Berlin

  16. van Es B, Gugushvili S, Spreij P (2007) A kernel type nonparametric density estimator for decompounding. Bernoulli 13:672–694

    MathSciNet  Article  MATH  Google Scholar 

  17. Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1:209–230

    MathSciNet  Article  MATH  Google Scholar 

  18. Ferguson TS (1983) Bayesian density estimation by mixtures of normal distributions. In: Recent advances in statistics. Academic, New York, p 287–302

  19. Figueroa-López JE (2008) Small-time moment asymptotics for Lévy processes. Stat Probab Lett 78(18):3355–3365

    Article  MATH  Google Scholar 

  20. Figueroa-López JE (2009) Nonparametric estimation of Lévy models based on discrete-sampling. In: Optimality. IMS lecture notes monograph series, vol 57. Institute of Statistical Mathematics, Beachwood, p 117–146

  21. Ghosal S (2010) The Dirichlet process, related priors and posterior asymptotics. In: Bayesian nonparametrics. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge, p 35–79

  22. Ghosal S, Ghosh JK, van der Vaart AW (2000) Convergence rates of posterior distributions. Ann Stat 28:500–531

    MathSciNet  Article  MATH  Google Scholar 

  23. Ghosal S, Tang Y (2006) Bayesian consistency for Markov processes. Sankhyā 68:227–239

    MathSciNet  MATH  Google Scholar 

  24. Ghosal S, van der Vaart AW (2001) Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann Stat 29:1233–1263

    MathSciNet  Article  MATH  Google Scholar 

  25. Ghosal S, van der Vaart AW (2007) Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann Stat 35:697–723

    MathSciNet  Article  MATH  Google Scholar 

  26. Gugushvili S, van der Meulen F, Spreij P (2015) Non-parametric Bayesian inference for multi-dimensional compound Poisson processes. Mod Stoch Theory Appl 2:1–15

    MathSciNet  Article  MATH  Google Scholar 

  27. Hjort NL, Holmes C, Müller P, Walker SG (2010) Bayesian nonparametrics. Cambridge series in statistical and probabilistic mathematics, 28. Cambridge University Press, Cambridge

  28. Ibragimov IA, Khas’minskiĭ RZ (1982) An estimate of the density of a distribution belonging to a class of entire functions (Russian). Teor Veroyatnost i Primenen 27:514–524

    MathSciNet  Google Scholar 

  29. Insua DR, Ruggeri F, Wiper MP (2012) Bayesian analysis of stochastic process models. Wiley, Chichester

    Book  MATH  Google Scholar 

  30. Jacod J, Shiryaev AN (2003) Limit theorems for stochastic processes, 2nd edn. Grundlehren der Mathematischen Wissenschaften, vol 288. Springer, Berlin

  31. Katz RW (2002) Stochastic modeling of hurricane damage. J Appl Meteorol 41(7):754–762

    Article  Google Scholar 

  32. Le Cam LM (1986) Asymptotic methods in statistical decision theory. Springer, New York

    Book  MATH  Google Scholar 

  33. Lo AY (1984) On a class of Bayesian nonparametric estimates: I. Density estimates. Ann Stat 12:351–357

    MathSciNet  Article  MATH  Google Scholar 

  34. McLachlan G, Peel D (2000) Finite mixture models. Wiley series in probability and statistics: applied probability and statistics. Wiley-Interscience, New York

    Book  MATH  Google Scholar 

  35. Marron JS, Wand MP (1992) Exact mean integrated squared error. Ann Stat 20(2):712–736

    MathSciNet  Article  MATH  Google Scholar 

  36. Nickl R, Reiß M (2012) A Donsker theorem for Lévy measures. J Funct Anal 263(10):3306–3332

  37. Nickl R, Reiß M, Söhl J, Trabs M (2016) High-frequency Donsker theorems for Lévy measures. Probab Theory Relat Fields 164(1):61–108

  38. Papaspiliopoulos O, Roberts GO, Sköld M (2007) A general framework for the parametrization of hierarchical models. Stat Sci 22(1):59–73

    MathSciNet  Article  MATH  Google Scholar 

  39. Prabhu NU (1998) Stochastic storage processes. Queues, insurance risk, dams, and data communication, 2nd edn. Applications of mathematics (New York), vol 15. Springer, New York

  40. Richardsen S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc B 59:731–792

    Article  MATH  Google Scholar 

  41. Scalas E (2006) The application of continuous time random walks in finance and economics. Physica A 362(2):225–239

    Article  Google Scholar 

  42. Shreve SE (2008) Stochastic calculus for finance II, 2nd edn. Springer, New York

    Google Scholar 

  43. Skorohod AV (1964) Random processes with independent increments. Izdat. “Nauka”, Moscow

  44. Tang Y, Ghosal S (2007) Posterior consistency of Dirichlet mixtures for estimating a transition density. J Stat Plan Inference 137:1711–1726

    MathSciNet  Article  MATH  Google Scholar 

  45. Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. J Am Stat Assoc 82:528–540

    MathSciNet  Article  MATH  Google Scholar 

  46. Ueltzhöfer FAJ, Klüppelberg C (2011) An Oracle inequality for penalised projection estimation of Lévy densities from high frequency observations. J Nonparametr Stat 23(4):967–989

    MathSciNet  Article  MATH  Google Scholar 

Download references


We wish to thank Wikash Sewlal from Delft University of Technology for the simulation results of the example with a mixture of four normals and the skewed density. The research leading to these results has received funding from the European Research Council under ERC Grant Agreement 320637.

Author information



Corresponding author

Correspondence to Frank van der Meulen.

Additional lemmas and proofs

Additional lemmas and proofs

Proof of Lemma 1

We give a detailed proof of equality (9). As we are interested in small values of \(\Delta ,\) we make some necessary approximations. Starting point is the expansion for the ‘density’ of \(\mathbb {Q}^\Delta _{\lambda ,f}\) with respect to the Lebesgue measure,

$$\begin{aligned} e^{-\lambda \Delta }\delta _0(x)+\left( 1-e^{-\lambda \Delta }\right) \sum _{m=1}^{\infty } a_m(\lambda \Delta ) f^{*m}(x), \end{aligned}$$

see (4), with coefficients \(a_m\) defined in (5). It follows that we have the likelihood ratio

$$\begin{aligned} \frac{\mathrm {d}\mathbb {Q}^\Delta _{\lambda ,f}}{\mathrm {d}\mathbb {Q}^\Delta _{\lambda _0,f_0}}(x)&= \mathbf {1}_{x=0}e^{-(\lambda -\lambda _0)\Delta }+\mathbf {1}_{x\ne 0}\frac{(1-e^{-\lambda \Delta })\sum _{m=1}^{\infty } a_m(\lambda \Delta ) f^{*m}(x)}{(1-e^{-\lambda _0\Delta })\sum _{m=1}^{\infty } a_m(\lambda _0\Delta ) f_0^{*m}(x)} \\&= e^{-(\lambda -\lambda _0)\Delta }\left( \mathbf {1}_{x=0}+\mathbf {1}_{x\ne 0}\frac{\lambda f(x)}{\lambda _0 f_0(x)}+o(\Delta )\right) , \end{aligned}$$

where we collected terms of order \(\Delta ^m\) for \(m\ge 2\) as \(o(\Delta ).\) Hence we get for the Hellinger affinity

$$\begin{aligned} H\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right) = \int \sqrt{ \mathrm {d}\mathbb {Q}^\Delta _{\lambda ,f} \mathrm {d}\mathbb {Q}^\Delta _{\lambda _0,f_0} }, \end{aligned}$$

the approximating expression

$$\begin{aligned} H\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right) = e^{-(\lambda +\lambda _0)\Delta /2}\left( 1+\Delta \sqrt{\lambda _0\lambda }H\left( f,\,f_0\right) +o(\Delta )\right) . \end{aligned}$$

It follows that for \(\Delta \rightarrow 0,\)

$$\begin{aligned} h^2\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&= 2-2H\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right) \\&= 2-2e^{-(\lambda +\lambda _0)\Delta /2}\left( 1+\Delta \sqrt{\lambda _0\lambda }H\left( f,\,f_0\right) +o(\Delta )\right) \\&= 2\left( 1-e^{-(\lambda +\lambda _0)\Delta /2}\right) -2e^{-(\lambda +\lambda _0))\Delta /2}\left( \Delta \sqrt{\lambda _0\lambda }H\left( f,\,f_0\right) +o(\Delta )\right) . \end{aligned}$$

Hence, for \(\Delta \rightarrow 0,\)

$$\begin{aligned} \frac{1}{\Delta }h^2\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&\rightarrow \lambda +\lambda _0-2\sqrt{\lambda _0\lambda }H\left( f,\,f_0\right) \\&= \int \left( \sqrt{\lambda f(x)}-\sqrt{\lambda _0f_0(x)}\right) ^2\mathrm {d}x. \end{aligned}$$

Equality (9) follows. The proofs of the equalities (10) and (11) follow a similar line of reasoning.

Proof of Lemma 3

The proof is an adaptation of Theorem 7.1 from Ghosal et al. (2000) to decompounding. In all what follows it is assumed that \(\mathbb {Q}^\Delta _{\lambda ,f}\in \mathcal {Q},\) but we suppress this assumption in the notation. Observe that

$$\begin{aligned}&D\left( \frac{\varepsilon }{2},\,\left\{ \mathbb {Q}^\Delta _{\lambda ,f}{\text {:}}\,\varepsilon \le h^\Delta \left( \mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f}\right) \le 2\varepsilon \right\} ,\,h^\Delta \right) \\&\quad =D\left( \frac{\varepsilon \sqrt{\Delta }}{2},\,\left\{ \mathbb {Q}^\Delta _{\lambda ,f}{\text {:}}\,\varepsilon \sqrt{\Delta }\le h\left( \mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f}\right) \le 2\varepsilon \sqrt{\Delta }\right\} ,\,h \right) . \end{aligned}$$

From this point on the arguments from the proof of Theorem 7.1 in Ghosal et al. (2000) are applicable (with \(\varepsilon \) replaced by \(\varepsilon \sqrt{\Delta }\)) and eventually lead to the desired result. The role of formulae (7.1)–(7.2) in that proof are played in the present context by (36) and (37) below.

For a given \((\lambda _1,\,f_1)\) there exists a sequence of tests \(\phi _n\) based on \(\mathcal {Z}_n^{\Delta },\) such that

$$\begin{aligned}&\mathbb {E}_{\lambda _0,f_0}\left[ \phi _n\right] \le \exp \left( {-}\frac{1}{2}n\Delta h^\Delta \left( \mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f}\right) ^2 \right) , \end{aligned}$$
$$\begin{aligned}&\sup _{ h^\Delta (\mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _1,f_1}) < h^\Delta (\mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda _1,f_1}) } \mathbb {E}_{\lambda ,f}\left[ 1-\phi _n\right] \le \exp \left( {-}\frac{1}{2}n\Delta h^\Delta \left( \mathbb {Q}_{\lambda _0,f_0},\,\mathbb {Q}_{\lambda ,f}\right) ^2 \right) . \end{aligned}$$

These two inequalities simply follow by rewriting the inequalities

$$\begin{aligned}&\mathbb {E}_{\lambda _0,f_0}\left[ \phi _n\right] \le \exp \left( {-}\frac{1}{2}nh^2\left( \mathbb {Q}_{\lambda _0,f_0}^{\Delta },\,\mathbb {Q}_{\lambda ,f}^{\Delta }\right) \right) ,\\&\sup _{ h(\mathbb {Q}_{\lambda ,f}^{\Delta },\,\mathbb {Q}_{\lambda _1,f_1}^{\Delta }) < h(\mathbb {Q}_{\lambda _0,f_0}^{\Delta },\,\mathbb {Q}_{\lambda _1,f_1}^{\Delta }) } \mathbb {E}_{\lambda ,f}\left[ 1-\phi _n\right] \le \exp \left( {-}\frac{1}{2}n h^2\left( \mathbb {Q}_{\lambda _0,f_0}^{\Delta },\,\mathbb {Q}_{\lambda ,f}^{\Delta }\right) \right) , \end{aligned}$$

which are proved in Ghosal et al. (2000, pp. 520–521) and rely upon the results in Birgé (1984) and Cam (1986).

Proof of Lemma 2

As the priors for \(\psi _1,\ldots , \psi _J\) are independent, we obtain that

$$\begin{aligned} p(\psi \mid \mu , \,\tau ,\, z,\, \mathbf {a})&= p(\psi \mid \mathbf {a}) \propto \prod _{j=1}^J \left( e^{-\psi _j T} \psi _j^{s_j} \pi \left( \psi _j\right) \right) \\&=\prod _{j=1}^J \left( e^{-(\psi _j T+\beta _0)} \psi _j^{s_j+\alpha _0-1} \right) , \end{aligned}$$

which proves the first statement of the lemma.

For \((\mu ,\, \tau )\) we get

$$\begin{aligned} p(\mu ,\, \tau \mid z,\, \mathbf {a})&\propto \prod _{i \in \mathcal {I}} \phi \left( z_i ;\, a_i^{\prime }\mu ,\, n_i/\tau \right) \\&\quad \times \tau ^{\alpha _1-1} e^{-\beta _1\tau } \tau ^{J/2} \exp \left( {-}\frac{\tau \kappa }{2} \sum _{j=1}^J \left( \mu _j-\xi _j\right) ^2\right) . \end{aligned}$$

This is proportional to

$$\begin{aligned} \tau ^{\alpha _1-1+(I+J)/2} \exp \left( {-}\beta _1\tau - \frac{D(\mu )}{2} \tau \right) , \end{aligned}$$


$$\begin{aligned} D(\mu )=\kappa \sum _{j=1}^J \left( \mu _j-\xi _j\right) ^2 + \sum _{i \in \mathcal {I}} n_i^{-1} \left( z_i - a_i^{\prime } \mu \right) ^2. \end{aligned}$$

From this expression it is easily seen that we can integrate out \(\mu \) to obtain the distribution of \(\tau ,\) conditional on \((z,\, \mathbf {a}).\) To get this right, write \(D(\mu )\) as a quadratic form of \(\mu {\text {:}}\)

$$\begin{aligned} D(\mu ) = \mu ^{\prime } P \mu - 2q^{\prime } \mu +R. \end{aligned}$$

By completing the square, we find that

$$\begin{aligned} \int \exp \left( { -}\frac{\tau }{2} D(\mu ) \right) d \mu = e^{-\tau R/2} \int \exp \left( { -}\frac{1}{2} \mu \tau P \mu + \tau q^{\prime } \mu \right) d \mu . \end{aligned}$$

The integrand is (up to a proportionality constant), the density of a bivariate normal random vector with mean vector \(P^{-1}q\) and covariance matrix \(\tau ^{-1} P^{-1}\) evaluated in \(\mu .\) This implies that the preceding display equals

$$\begin{aligned} e^{-\tau R/2} (2 \pi )^{J/2} \sqrt{\left| \tau ^{-1} P^{-1}\right| } \exp \left( \frac{1}{2} \tau q^{\prime } P^{-1} q\right) . \end{aligned}$$

We conclude that

$$\begin{aligned} p(\tau \mid z, \,\mathbf {a}) \propto \tau ^{\alpha _1+I/2-1} \exp \left( -\left( \beta _1+\frac{1}{2}\left( R - q^{\prime } P^{-1} q\right) \right) \tau \right) , \end{aligned}$$

which proves the asserted Gamma distribution of \(\tau .\) This computation also immediately leads to the assertion on the distribution of \(\mu .\) We finally show that the rate parameter appearing for \(\tau \) is positive. By definition \(D(\mu )\ge 0\) for all \(\mu .\) This implies that \(D(P^{-1} q)=q^{\prime } P^{-1} q-2 q^{\prime } P^{-1}q+R=R-q^{\prime } P^{-1} q\ge 0.\)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gugushvili, S., van der Meulen, F. & Spreij, P. A non-parametric Bayesian approach to decompounding from high frequency data. Stat Inference Stoch Process 21, 53–79 (2018).

Download citation


  • Compound Poisson process
  • Non-parametric Bayesian estimation
  • Posterior contraction rate
  • High frequency observations

Mathematics Subject Classification

  • Primary: 62G20
  • Secondary: 62M30