1 Introduction

1.1 Problem formulation

Let \(N=(N_{t},\, t\ge 0)\) be a Poisson process with a constant intensity \(\lambda >0\) and let \(Y_1,\,Y_2,\,Y_3,\ldots \) be a sequence of independent random variables independent of N and having a common distribution function F with density f (with respect to the Lebesgue measure). A compound Poisson process (abbreviated CPP) \(X=(X_t,\, t\ge 0)\) is defined as

$$\begin{aligned} X_t=\sum _{j=1}^{N_t} Y_j, \end{aligned}$$
(1)

where the sum over an empty set is by definition equal to zero. CPPs form a basic model in a variety of applied fields, most notably in e.g., queueing and risk theory, see Embrechts et al. (1997) and Prabhu (1998) and the references therein, but also in other fields of science, see, e.g., Alexandersson (1985) and Burlando and Rosso (1993) for stochastic models for precipitation, Katz (2002) on modelling of hurricane damage, or Scalas (2006) for applications in economics and finance.

Suppose that corresponding to the ‘true’ parameter values \(\lambda =\lambda _0\) and \(f=f_0,\) a discrete time sample \(X_{\Delta },\,X_{2\Delta },\ldots ,X_{n\Delta }\) is available from (1), where \(\Delta >0.\) Such a discrete time observation scheme is common in a number of applications of CPP, e.g., in the precipitation models of the above references. Based on the sample \(\mathcal {X}_n^\Delta =(X_\Delta ,\,X_{2\Delta },\ldots ,X_{n\Delta }),\) we are interested in (non-parametric) estimation of \(\lambda _0\) and \(f_0.\) Before proceeding further, we notice that by the stationary independent increments property of a CPP, the random variables \(Z_i^{\Delta } = X_{i\Delta }-X_{(i-1)\Delta },\,1\le i \le n,\) are independent and identically distributed. Each \(Z_i^{\Delta }\) has the same distribution as the random variable

$$\begin{aligned} Z^{\Delta }=\sum _{j=1}^{T^{\Delta }} Y_j, \end{aligned}$$
(2)

where \(T^{\Delta }\) is independent of the sequence \(Y_1,\, Y_2,\ldots \) and has a Poisson distribution with parameter \(\Delta \lambda .\) Hence, our problem is equivalent to estimating (non-parametrically) \(\lambda _0\) and \(f_0\) based on the sample \(\mathcal {Z}_n^{\Delta }=(Z_1^{\Delta },\,Z_2^{\Delta },\ldots ,Z_n^{\Delta }).\) We will henceforth use this alternative formulation of the problem. Our emphasis is on high frequency data, \(\Delta =\Delta _n\rightarrow 0\) as \(n\rightarrow \infty ,\) but the obtained results are also valid for low frequency observations, i.e., for fixed \(\Delta .\)

Our main result is on the contraction rate of the posterior distribution, which we show to be, up to a logarithmic factor, \((n\Delta )^{-1/2}.\) A by now standard approach to obtain contraction rates in an IID setting is to verify the assumptions of the fundamental Theorem 2.1 in Ghosal et al. (2000). It should be noted that in the present high frequency setting, this theorem is not applicable. One of the model assumptions underlying this theorem, which is satisfied in Gugushvili et al. (2015), is that one deals with samples of a fixed distribution, whereas in our present high frequency observation regime the distribution of \(Z^\Delta \) is varying, with the Dirac distribution concentrated at zero as its limit for \(\Delta \rightarrow 0.\) Therefore we propose an alternative approach, circumventing the use of the cited Theorem 2.1. The theoretical contribution of the present paper is therefore not only the statement of the main result itself, but also its proof. Next to this we also discuss a practical implementation of our non-parametric Bayesian approach, a Markov chain Monte Carlo algorithm that samples from the joint distribution of the unknown parameters in the mixture density and certain introduced auxiliary variables.

1.2 Literature review and present approach

Because adding a Poisson number of \(Y_j\)’s amounts to compounding their distributions, the problem of recovering the intensity \(\lambda _0\) and the density \(f_0\) from the observations \(Z_i\)’s can be referred to as decompounding. Decompounding already has some history: the early contributions (Buchmann and Grübel 2003, 2004) dealt with estimation of the distribution function \(F_0,\) paying particular attention to the case when \(F_0\) is discrete, while the later contributions (Comte et al. 2014; Duval 2013; Es et al. 2007) concentrated on estimation of the density \(f_0\) instead. More (frequentist) theory on statistical inference on CPPs (and more generally on Lévy processes) can be found in the volume (Belomestny et al. 2015), with the survey paper (Comte et al. 2015) devoted to statistical methods for high frequency discrete observations, with a special section on CPPs. Other references on statistics for Lévy processes in the high frequency data setting are Comte and Genon-Catalot (2011), Comte and Genon-Catalot (2010), Comte et al. (2010), Figueroa-López (2008), Figueroa-Lopez (2009), Nickl and Reiß (2012), Nickl et al. (2016), and Ueltzhöfer and Klüppelberg (2011). All these approaches are frequentist in nature. On the other hand, theoretical and computational advances made over the recent years have shown that a non-parametric Bayesian approach is feasible in various statistical settings; see e.g., Hjort et al. (2010) for an overview. This is the approach we will take in this work to estimate \(\lambda _0\) and \(f_0.\)

To the best of our knowledge, non-parametric Bayesian approach to inference for (a class of) Lévy processes was first considered in Gugushvili et al. (2015). That paper, contrary to the present context, dealt with observations at fixed equidistant times, and was strongly based on an application of Theorem 2.1 of Ghosal et al. (2000), as already alluded to in the problem formulation of Sect. 1.1. The present work complements the results from Gugushvili et al. (2015), in the sense that we now allow high frequency observations, which requires a substantially different route to prove our results, as we will explain in more detail in Sect. 1.3.

We will study the non-parametric Bayesian approach to decompounding from a frequentist point of view (in the sense specified below), so that one may also think of it as a means for obtaining a frequentist estimator. Advantages of the non-parametric Bayesian approach include automatic quantification of uncertainty in parameter estimates through Bayesian posterior credible sets and automatic selection of the degree of smoothing required in non-parametric inferential procedures.

1.3 Results

The non-parametric class \(\mathcal {F}\) of densities f that we consider is that of location mixtures of normal densities. So we consider densities specified by

$$\begin{aligned} f(x)=f_{H,\sigma } (x)=\int \phi _{\sigma }(x-z)\mathrm {d}H(z), \end{aligned}$$
(3)

where \(\phi _{\sigma }\) denotes the density of the normal distribution with mean zero and variance \(\sigma ^2\) and H is a mixing measure. These mixtures form a rich and flexible class of densities, see Marron and Wand (1992) and McLachlan and Peel (2000), that are capable of closely approximating many densities that themselves are not representable in this way. The resulting mixture densities will be infinitely smooth, which is arguably the case in many, if not most, practical applications.

Bayesian estimation requires specification of prior distributions on \(\lambda \) and f. We propose independent priors on \(\lambda \) and f that we denote by \(\Pi _1\) and \(\Pi _2,\) respectively. For f,  we take a Dirichlet mixture of normal densities as a prior. This type of prior in the context of Bayesian density estimation has been introduced in Ferguson (1983) and Lo (1984); for recent references see, e.g., Ghosal and Vaart (2001). The prior for f is defined as the law of the function \(f_{ H , \sigma }\) as in (3), with H assumed to follow a Dirichlet process prior \(D_\alpha \) with base measure \(\alpha \) and \(\sigma \) a priori independent with distribution \(\Pi _3.\) Recall that a Dirichlet process \(D_{\alpha }\) on \(\mathbb {R}\) with the base measure \(\alpha \) defined on the Borel \(\sigma \)-algebra \(\mathcal {B}(\mathbb {R})\) (we assume \(\alpha \) to be non-negative and \(\sigma \)-additive) is a random probability measure G on \(\mathbb {R},\) such that for every finite and measurable partition \(B_1,\,B_2,\ldots ,B_k\) of \(\mathbb {R},\) the probability vector \((G(B_1),\,G(B_2),\ldots ,G(B_k))\) possesses the Dirichlet distribution on the k-dimensional simplex with parameters \((\alpha (B_1),\,\alpha (B_2),\ldots ,\alpha (B_k)).\) See, e.g., the original paper (Ferguson 1973), or the overview article (Ghosal 2010) for more information on Dirichlet process priors.

A nonparametric Bayesian approach to density estimation employing a Dirichlet mixture of normal densities as a prior can in very rough sense be thought of as a Bayesian counterpart of kernel density estimation (with a Gaussian kernel), cf. Ghosal and van der Vaart (2007, p. 697).

With the sample size n tending to infinity, the Bayesian approach should be able to discern the true parameter pair \((\lambda _0,\,f_0)\) with increasing accuracy. We can formalise this by requiring, for instance, that for any fixed neighbourhood A (in an appropriate topology) of \((\lambda _0,\,f_0),\, \Pi (A^c|\mathcal {Z}_{n}^\Delta )\rightarrow 0\) in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability. Here \(\Pi \) is used as a shorthand notation for the posterior distribution of \((\lambda ,\,f)\) and we use \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta }\) to denote the law of the random variable \(Z^{\Delta }\) in (2) and \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\) the law of \(\mathcal {Z}_n^{\Delta }.\) More generally, one may take a sequence of shrinking neighbourhoods \(A_n\) of \((\lambda _0,\,f_0)\) and try to determine the rate at which the neighbourhoods \(A_n\) are allowed to shrink, while still capturing most of the posterior mass. This rate is referred to as a posterior convergence rate (we will give the precise definition in Sect. 3). Two fundamental references dealing with establishing it in various statistical settings are Ghosal et al. (2000) and Ghosal and Vaart (2001). This convergence rate can be thought of as an analogue of the convergence rate of a frequentist estimator. The analogy can be made precise: contraction of the posterior distribution at a certain rate implies existence of a Bayes point estimate with the same convergence rate (in the frequentist sense); see Theorem 2.5 in Ghosal et al. (2000) and the discussion on pp. 506–507 there.

Obviously, for our programme to be successful, \(\Delta \) has to satisfy the assumption \(n\Delta \rightarrow \infty ,\) which is a necessary condition for consistent estimation of \((\lambda _0,\,f_0),\) as it ensures that asymptotically we observe an infinite number of jumps in the process. We cover both the case of so called high frequency observation schemes (\(\Delta \rightarrow 0\)) as well as low frequency observations (fixed \(\Delta \)). A sufficient condition, which covers both observation regimes and which relates \(\Delta \) to n,  is \(\Delta =n^{-\alpha },\) where \(0\le \alpha <1.\)

We note that in Ghosal and Tang (2006) and Tang and Ghosal (2007) non-parametric Bayesian inference for Markov processes is studied, of which CPPs form a particular class, but these papers deal with estimation of the transition density of a discretely observed Markov process, which is different from the problem we consider here. A parametric Bayesian approach to inference for CPPs is studied in Insua et al. (2012, Sects. 5.5 and 10.3).

The main result of our paper is Theorem 1, in which we state sufficient conditions on the prior that yield a posterior rate of contraction of the order \((\log ^{\kappa }(n\Delta ))/ \sqrt{n\Delta },\) for some constant \(\kappa >0.\) We argue that this rate is a nearly (up to a logarithmic factor) optimal posterior contraction rate in our problem. Our main result complements the one in Gugushvili et al. (2015), in that it treats both the low and high frequency observation schemes simultaneously, with emphasis on the latter. We note (again) a fundamental difference between the present paper and Gugushvili et al. (2015), when it comes down to the techniques to prove the main result. As Theorem 2.1 of Ghosal et al. (2000) cannot immediately be used, we take an alternative tour that avoids this theorem, but instead refines a number of technical results involving properties of statistical tests that form essential ingredients of the proof in Ghosal et al. (2000). These refined results are then used as key technical steps in a direct proof of our Theorem 1. Furthermore, it establishes the posterior contraction rate for infinitely smooth jump size densities \(f_0,\) which is not covered by Gugushvili et al. (2015). On the other hand, Gugushvili et al. (2015) deals with multi-dimensional CPPs, while in this paper we consider only the one-dimensional case. Finally, in this work we also discuss a practical implementation of our non-parametric Bayesian approach. The computational problem is dealt with by inclusion of auxiliary variables. More precisely, we show how a Markov chain Monte Carlo algorithm can be devised that samples from the joint distribution of the unknown parameters in the mixture density and the introduced auxiliary variables. Numerical examples illustrate the feasibility of this approach.

1.4 Organisation

The remainder of the paper is organised as follows. In the next section we state some preliminaries on the likelihood, prior and notation. In Sect. 3 we first motivate the use of the scaled Hellinger metric to define neighbourhoods for which posterior contraction rate is derived in case the observations are sampled at high frequency. Then we present the main result on the posterior contraction rate (Theorem 1), whose proof is given in Sect. 5. We discuss the numerical implementation of our results in Sect. 4. Technical lemmas and their proofs used to prove the main theorem are gathered in the Appendix.

2 Preliminaries and notation

2.1 Likelihood, prior and posterior

We are interested in Bayesian inference with Bayes’ formula. Therefore we need to specify the likelihood in our model. We use the following notation:

figure a

The characteristic function of the Poisson sum \(Z^{\Delta }\) defined in (2) is given by

$$\begin{aligned} \phi (t)=e^{-\lambda \Delta +\lambda \Delta \phi _{f}(t)}, \end{aligned}$$

where \(\phi _{f}\) is the characteristic function of f. This can be rewritten as

$$\begin{aligned} \phi (t)=e^{-\lambda \Delta }+\left( 1-e^{-\lambda \Delta }\right) \frac{1}{e^{\lambda \Delta }-1}\left( e^{\lambda \Delta \phi _{f}(t)} -1 \right) , \end{aligned}$$

which, using the fact that \(\phi _{f}\) vanishes at infinity, shows that the distribution of \(Z^{\Delta }\) is a mixture of a point mass at zero and an absolutely continuous distribution. Letting \(t\rightarrow \infty ,\) we get that \(\phi (t)\rightarrow e^{-\lambda \Delta }.\) Hence \(\lambda \) is identifiable from the law of \(Z^{\Delta },\) and then so is f. The density of the law \(\mathbb {Q}_{\lambda , f}^{\Delta }\) of \(Z^{\Delta }\) with respect to the measure \(\mu ,\) which is the sum of Lebesgue measure and the Dirac measure concentrated at zero, can in fact be written explicitly as (cf. van Es et al. 2007, p. 681 and Proposition 2.1 in Duval 2013)

$$\begin{aligned} \frac{\mathrm {d}\mathbb {Q}_{\lambda , f}^{\Delta }}{\mathrm {d}\mu }(x)=e^{-\lambda \Delta }\mathbf {1}_{\{0\}}(x)+\left( 1-e^{-\lambda \Delta }\right) \sum _{m=1}^{\infty } a_m(\lambda \Delta ) f^{*m}(x)\mathbf {1}_{\mathbb {R}\setminus \{0\}}(x), \end{aligned}$$
(4)

where \(\mathbf {1}_A\) denotes the indicator of a set A

$$\begin{aligned} a_m(\lambda \Delta )=\frac{1}{ e^{\lambda \Delta } -1 } \frac{(\lambda \Delta )^m}{m!}, \end{aligned}$$
(5)

and \(f^{*m}\) denotes the m-fold convolution of f with itself. However, the expression (4) is useless for Bayesian computations. To work around this problem, we will employ a different dominating measure. Consider the law \(\mathbb {R}_{\lambda ,f}^{\Delta }\) of \((X_t,\, t\in [0,\,\Delta ]).\) By the Theorem in Skorohod (1964, p. 261) \(\mathbb {R}_{\lambda ,f}^{\Delta }\) is absolutely continuous with respect to \(\mathbb {R}_{\widetilde{\lambda },\widetilde{f}}^{\Delta }\) if and only if \(\mathbb {P}_{f}\) is absolutely continuous with respect to \(\mathbb {P}_{\widetilde{f}}\) (we of course assume that \(\lambda ,\,\widetilde{\lambda }>0\)). A simple condition to ensure the latter is to assume that \(\widetilde{f}\) is continuous and does not take the value zero on \(\mathbb {R}.\)

Define the random measure \(\mu \) by

$$\begin{aligned} \mu (B) = \left\{ \# t {\text {:}}\,\left( t,\,X_t-X_{t-}\right) \in B\right\} , \quad B \in \mathcal {B}([0,\,\Delta ])\otimes \mathcal {B}(\mathbb {R}\setminus \{0\}). \end{aligned}$$

Under \(\mathbb {R}_{{\lambda },{f}},\) the random measure \(\mu \) is a Poisson point process on \([0,\,\Delta ]\times (\mathbb {R}\setminus \{0\})\) with intensity measure \( {\Lambda }(dt,\,dx)={\lambda } dt {f}(x)dx,\) which follows, e.g., from Theorem 1 on p. 69 and Corollary on p. 64 in Skorohod (1964). By formula (46.1) on p. 262 in Skorohod (1964), we have

$$\begin{aligned} \frac{\mathrm {d} \mathbb {R}_{\lambda ,f}^{\Delta }}{ \mathrm {d}\mathbb {R}_{\widetilde{\lambda },\widetilde{f}}^{\Delta } }(X)=\exp \left( \int _0^{\Delta } \int _{\mathbb {R}} \log \left( \frac{\lambda f(x)}{{\widetilde{\lambda }} \widetilde{f}(x)} \right) \mu (dt,\,dx) - \Delta (\lambda -{\widetilde{\lambda }}) \right) . \end{aligned}$$
(6)

By Theorem 2 on p. 245 in Skorohod (1964) and Corollary 2 on p. 246 there, the density \(k_{\lambda ,f}^{\Delta }\) of \(\mathbb {Q}_{\lambda ,f}^{\Delta }\) with respect to \(\mathbb {Q}_{\widetilde{\lambda },\widetilde{f}}^{\Delta }\) is given by the conditional expectation

$$\begin{aligned} k_{\lambda ,f}^{\Delta }(x)=\mathrm{{\mathbb E\,}}_{\widetilde{\lambda },\widetilde{f}}\left( \frac{\mathrm {d} \mathbb {R}_{\lambda ,f}^{\Delta }}{ \mathrm {d}\mathbb {R}_{\widetilde{\lambda },\widetilde{f}}^{\Delta } }(X)\Bigg | X_{\Delta }=x \right) , \end{aligned}$$
(7)

where the subscript in the conditional expectation operator signifies the fact that it is evaluated under the probability \(\mathbb {R}_{ \widetilde{\lambda },\widetilde{f} }^{\Delta }.\) Hence the likelihood [in the parameter pair \((\lambda ,\,f)\)] associated with the sample \(\mathcal {Z}_n^{\Delta }\) is given by the product

$$\begin{aligned} L_n^{\Delta }(\lambda ,\,f)=\prod _{i=1}^n k_{\lambda ,f}^{\Delta }\left( Z_i^{\Delta }\right) . \end{aligned}$$
(8)

An advantage of specifying the likelihood in this manner is that it allows one to reduce some of the difficult computations for the laws \(\mathbb {Q}_{\lambda ,f}^{\Delta }\) to those for the laws \(\mathbb {R}_{\lambda ,f}^{\Delta },\) which are simpler.

Observe that the priors on \(\lambda \) and f indirectly induce the prior \(\Pi = \Pi _1\times \Pi _2\) on the collection of densities \(k_{\lambda ,f}^{\Delta }.\) We will indiscriminately use the symbol \(\Pi \) to signify both the prior on \((\lambda ,\,f),\) but also on the density \(k_{\lambda ,f}^{\Delta }.\) The posterior in the first case will be understood as the posterior for the pair \((\lambda ,\,f),\) while in the second case as the posterior for the density \(k_{\lambda ,f}^{\Delta }.\) We will often use the same symbol \(\Pi \) to denote the posterior distribution of \((\lambda ,\,f)\) and on the density \(k_{\lambda ,f}^{\Delta }.\) This simplifies notationally some of the formulations below.

By Bayes’ theorem, the posterior measure of any measurable set \(A\subset (0,\,\infty )\times \mathcal {F}\) is given by

$$\begin{aligned} \Pi \left( A|\mathcal {Z}_n^{\Delta }\right) =\frac{ \iint _A L_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) }{ \iint L_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) }. \end{aligned}$$

Upon setting \(\overline{A}=\{ k_{\lambda ,f}{\text {:}}\,(k,\,\lambda )\in A \}\) and recalling our conventions above, this can also be written as

$$\begin{aligned} \Pi \left( \overline{A}|\mathcal {Z}_n^{\Delta }\right) =\frac{ \int _{\overline{A}} L_n^{\Delta }(k) \mathrm {d}\Pi (k) }{ \int L_n^{\Delta }(k) \mathrm {d}\Pi (k) }. \end{aligned}$$

Once the posterior is available, one can next proceed with computation of other quantities of interest in Bayesian statistics, such as Bayes point estimates or credible sets.

2.2 Notation

Throughout the paper we will use the following notation to compare two sequences \(\{a_n\}\) and \(\{b_n\}\) of positive real numbers: \(a_n\lesssim b_n\) will mean that there exists a constant \(C>0\) that is independent of n and is such that \(a_n\le C b_n,\) while \(a_n\gtrsim b_n\) will signify the fact that \(a_n\ge C b_n.\)

Next we introduce various notions of distances between probability measures. The Hellinger distance \(h(\mathbb {Q}_{0},\,\mathbb {Q}_{1})\) between two probability laws \(\mathbb {Q}_{0}\) and \(\mathbb {Q}_{1}\) on a measurable space \((\Omega ,\,\mathfrak {F})\) is defined as

$$\begin{aligned} h\left( \mathbb {Q}_{0},\,\mathbb {Q}_{1}\right) =\left( \int \left( \mathrm {d}\mathbb {Q}_{0}^{1/2} - \mathrm {d}\mathbb {Q}_{1}^{1/2} \right) ^2 \right) ^{1/2}. \end{aligned}$$

Assume further \(\mathbb {Q}_0\ll \mathbb {Q}_1.\) The Kullback–Leibler (or informational) divergence \(\mathrm {K}(\mathbb {Q}_{0},\,\mathbb {Q}_{1})\) is defined as

$$\begin{aligned} \mathrm {K}\left( \mathbb {Q}_{0},\,\mathbb {Q}_{1}\right) = \int \log \left( \frac{ \mathrm {d}\mathbb {Q}_{0}}{ \mathrm {d}\mathbb {Q}_{1} } \right) \mathrm {d}\mathbb {Q}_{0}, \end{aligned}$$

while the \(\mathrm {V}\)-discrepancy is defined through

$$\begin{aligned} \mathrm {V}\left( \mathbb {Q}_{0},\,\mathbb {Q}_{1}\right) =\int \log ^2 \left( \frac{ \mathrm {d}\mathbb {Q}_{0}}{ \mathrm {d}\mathbb {Q}_{1} } \right) \mathrm {d}\mathbb {Q}_{0}. \end{aligned}$$

Here is some additional notation. For \(f,\,g\) nonnegative integrable functions, not necessarily densities, we write

$$\begin{aligned} h^2(f,\,g)&=\int (\sqrt{f}-\sqrt{g})^2, \\ \mathrm {K}(f,\,g)&= \int \log \frac{f}{g}f -\int f + \int g,\\ \mathrm {V}(f,\,g)&= \int \log ^2\frac{f}{g}f. \end{aligned}$$

Note that these ‘distances’ are all nonnegative and only zero if \(f=g\) a.e. If f and g are densities of probability measures \(\mathbb {Q}_0\) and \(\mathbb {Q}_1\) on \((\mathbb {R},\,\mathcal {B}),\) respectively, then the above ‘distances’ reduce to the previously introduced ones.

We will also use \(\mathrm {K}(x,\,y)=x\log \frac{x}{y}-x+y\) for \(x,\,y>0.\) Note that also \(\mathrm {K}(x,\,y)\ge 0\) and \(\mathrm {K}(x,\,y)=0\) if and only if \(x=y.\)

3 Main result on posterior contraction rate

Denote the true parameter values for the CPP by \((\lambda _0,\, f_0).\) Recall that the problem is to estimate \(f_0\) and \(\lambda _0\) based on the observations \(\mathcal {Z}^\Delta _n\) and that \(\Delta \rightarrow 0\) in a high frequency regime. To say that a pair \((f,\,\lambda )\) lies in a neighbourhood of \((f_0,\,\lambda _0),\) one needs a notion of distance on the corresponding measures \(\mathbb {Q}^{\Delta }_{\lambda ,f}\) and \(\mathbb {Q}^{\Delta }_{\lambda _0,f_0},\) the two possible induced laws of \(Z^\Delta _i=X_{i\Delta }-X_{(i-1)\Delta }.\) The Hellinger distance is a popular and rather reasonable choice to that end in non-parametric Bayesian statistics. However, for \(\Delta \rightarrow 0\) the Hellinger metric h between those laws automatically tends to 0. The first assertion of Lemma 1 below states that \(h(\mathbb {Q}^{\Delta }_{\lambda ,f},\,\mathbb {Q}^{\Delta }_{\lambda _0,f_0})\) is of order \(\sqrt{\Delta }\) when \(\Delta \rightarrow 0.\) This motivates to replace the ordinary Hellinger metric h with the scaled metric \(h^\Delta =h/\sqrt{\Delta }\) in our asymptotic analysis for high frequency data. Of course, for fixed \(\Delta \) (in which case one can take \(\Delta =1\) w.l.o.g.), nothing changes with this replacement. The lemma also shows that the Kullback–Leibler divergence and the V-discrepancy are of order \(\Delta \) for \(\Delta \rightarrow 0.\) Therefore we will also use the scaled distances \(\mathrm {K}^\Delta =\mathrm {K}/\Delta \) and \(\mathrm {V}^\Delta =\mathrm {V}/\Delta \)

Lemma 1

The following expressions hold true:

$$\begin{aligned}&\lim _{\Delta \rightarrow 0}\frac{1}{\Delta } h^2\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&= h^2\left( \lambda f,\,\lambda _0 f_0\right) =\int \left( \sqrt{\lambda f(x)}-\sqrt{\lambda _0f_0(x)}\right) ^2\mathrm {d}x, \end{aligned}$$
(9)
$$\begin{aligned}&\lim _{\Delta \rightarrow 0}\frac{1}{\Delta }\mathrm {K}\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&= \mathrm {K}\left( \lambda f,\,\lambda _0 f_0\right) =\lambda \mathrm {K}\left( f,\,f_0\right) +\mathrm {K}\left( \lambda ,\,\lambda _0\right) , \end{aligned}$$
(10)
$$\begin{aligned}&\lim _{\Delta \rightarrow 0}\frac{1}{\Delta }\mathrm {V}\left( \mathbb {Q}^\Delta _{\lambda ,f},\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right)&= \mathrm {V}\left( \lambda f,\,\lambda _0 f_0\right) =\int \log ^2\frac{\lambda f(x)}{\lambda _0 f_0(x)}\lambda f(x)\mathrm {d}x. \end{aligned}$$
(11)

The proof will be presented in the appendix.

Remark 1

The Hellinger process (here deterministic) of order \(\frac{1}{2}\) for continuous observations of X on an interval \([0,\,t]\) is given by Jacod and Shiryaev (2003, Sects. IV.3 and IV.4a)

$$\begin{aligned} h_t=\frac{t}{2}\int \left( \sqrt{\lambda f(x)}-\sqrt{\lambda _0f_0(x)}\right) ^2\mathrm {d}x= h_1t, \end{aligned}$$

from which it follows that \(h^2(\mathbb {R}^t_{\lambda ,f},\,\mathbb {R}^t_{\lambda _0,f_0})=2-2\exp (-h_t),\) whose derivative in \(t=0\) is the same as in (9) and thus equal to \(2h_1.\) For the Kullback–Leibler divergence and the discrepancy \(\mathrm {V}\) similar assertions hold. These observations have the following heuristic explanation. For \(\Delta \rightarrow 0,\) there is no big difference between observing the path of X over the interval \([0,\,\Delta ]\) and \(X_\Delta ,\) as the probability of \(\{N_\Delta \ge 2\}\) is small (of order \(\Delta ^2\)).

In order to determine the posterior contraction rate in our problem, we now specify suitable neighbourhoods \(A_n\) of \((\lambda _0,\,f_0),\) for which this will be done. Let \(M>0\) be a constant and let \(\{\varepsilon _n\}\) be a sequence of positive numbers, such that \(\varepsilon _n\rightarrow 0\) as \(n\rightarrow \infty .\) Let

$$\begin{aligned} h^\Delta \left( \mathbb {Q}_0,\,\mathbb {Q}_1\right) =\frac{1}{\sqrt{\Delta }}h\left( \mathbb {Q}_0,\,\mathbb {Q}_1\right) , \end{aligned}$$

be a rescaled Hellinger distance. Lemma 1 suggests that this is the right scaling to use. Introduce the complements of the Hellinger-type neighbourhoods of \((\lambda _0,\,f_0),\)

$$\begin{aligned} A\left( \varepsilon _n,\,M\right) =\left\{ (\lambda ,\,f) {\text {:}}\, h^\Delta \left( \mathbb {Q}_{\lambda _0,f_0}^{\Delta },\,\mathbb {Q}_{\lambda ,f}^{\Delta } \right) > M \varepsilon _n\right\} . \end{aligned}$$

We shall say that \(\varepsilon _n\) is a posterior contraction rate, if there exists a constant \(M>0,\) such that

$$\begin{aligned} \Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta }\right) \rightarrow 0, \end{aligned}$$
(12)

in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability as \(n\rightarrow \infty .\) Our goal in this section is to determine the ‘fastest’ rate at which \(\varepsilon _n\) is allowed to tend to zero, while not violating (12).

We will assume that the observations are generated from a CPP that satisfies the following assumption.

Assumption 1

  1. (i)

    \(\lambda _0\) is in a compact set \([\underline{\lambda },\, \overline{\lambda }]\subset (0,\,\infty );\)

  2. (ii)

    The true density \(f_0\) is a location mixture of normal densities, i.e.,

    $$\begin{aligned} f_0(x)=f_{ H_0 , \sigma _0 } (x)=\int \phi _{\sigma _0 }(x-z)\mathrm {d}H_0(z), \end{aligned}$$

    for some fixed distribution \(H_0\) and a constant \(\sigma _0 \in [\underline{\sigma },\, \overline{\sigma }]\subset (0,\,\infty ).\) Furthermore, for some \(0<\kappa _0<\infty ,\, H_0[{-}\kappa _0,\,\kappa _0]=1,\) i.e., \(H_0\) has compact support.

The more general location-scale mixtures of normal densities,

$$\begin{aligned} f_0(x)=f_{ H_0 , K_0 } (x)=\iint \phi _{\sigma }(x-z)\mathrm {d}H_0(z)\mathrm {d}K_0(\sigma ), \end{aligned}$$

possess even better approximation properties than the location mixtures of the normals (here \(H_0\) and \(K_0\) are distributions) and could also be considered in our setup. However, this would lead to additional technical complications, which could obscure essential contributions of our work.

For obtaining posterior contraction rates we need to make some assumptions on the prior.

Assumption 2

  1. (i)

    The prior on \(\lambda ,\, \Pi _1,\) has a density \(\pi _1\) (with respect to the Lebesgue measure) that is supported on the finite interval \([\underline{\lambda },\,\overline{\lambda }]\subset (0,\,\infty )\) and is such that

    $$\begin{aligned} 0<\underline{\pi }_1 \le \pi _1(\lambda )\le \overline{\pi }_1<\infty , \quad \lambda \in [\underline{\lambda },\,\overline{\lambda }], \end{aligned}$$
    (13)

    for some constants \(\underline{\pi }_1\) and \(\overline{\pi }_1;\)

  2. (ii)

    The base measure \(\alpha \) of the Dirichlet process prior \(D_{\alpha }\) has a continuous density on an interval \([{-}\kappa _0-\zeta ,\,\kappa _0+\zeta ],\) with \(\kappa _0\) as in Assumption 1(ii), for some \(\zeta >0,\) is bounded away from zero there, and for all \(t>0\) satisfies the tail condition

    $$\begin{aligned} \alpha (|z|>t)\lesssim e^{-b|t|^{\delta }}, \end{aligned}$$
    (14)

    with some constants \(b>0\) and \(\delta >0;\)

  3. (iii)

    The prior on \(\sigma ,\,\Pi _3,\) is supported on the interval \([\underline{\sigma },\,\overline{\sigma }]\subset (0,\,\infty )\) and is such that its density \(\pi _3\) with respect to the Lebesgue measure satisfies

    $$\begin{aligned} 0<\underline{\pi }_3 \le \pi _3(\sigma )\le \overline{\pi }_3<\infty , \quad \sigma \in [\underline{\sigma },\,\overline{\sigma }], \end{aligned}$$

    for some constants \(\underline{\pi }_3\) and \(\overline{\pi }_3.\)

Assumptions 1 and 2 parallel those given in Ghosal and Vaart (2001) in the context of non-parametric Bayesian density estimation using the Dirichlet location mixture of normal densities as a prior. We refer to that paper for an additional discussion.

The following is our main result. Note that it covers both the case of high frequency observations (\(\Delta \rightarrow 0\)) and observations with fixed intersampling intervals. We use \(\Pi \) to denote the posterior on \((\lambda ,\,f).\)

Theorem 1

Under Assumptions 1 and 2, provided \(n\Delta \rightarrow \infty ,\) there exists a constant \(M>0,\) such that for

$$\begin{aligned} \varepsilon _n=\frac{\log ^{\kappa }(n\Delta )}{\sqrt{n\Delta }}, \quad \kappa =\max \left( \frac{2}{\delta },\,\frac{1}{2}\right) +\frac{1}{2}, \end{aligned}$$

we have

$$\begin{aligned} \Pi \left( A\left( \varepsilon _n,\,M\right) \Big | \mathcal {Z}_n^{\Delta }\right) \rightarrow 0, \end{aligned}$$

in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability as \(n\rightarrow \infty .\)

For fixed \(\Delta \) (w.l.o.g. one may then assume \(\Delta =1\)) the posterior contraction rate in Theorem 1 reduces to \(\varepsilon _n=\frac{\log ^{\kappa }(n)}{\sqrt{n}}.\) We also see that the posterior contraction rate is controlled by the parameter \(\delta \) of the tail behaviour in (14). Note that if (14) is satisfied for some \(\delta >4,\) it is also automatically satisfied for all \(0<\delta \le 4.\) The stronger the decay rate in (14), the better the contraction rate, but all \(\delta \ge 4\) give the same value \(\kappa =1.\) The best possible posterior contraction rate in Theorem 1 for minimal \(\delta \) is obtained for \(\delta =4.\) In the proof in Sect. 5 we can therefore assume that \(\delta \le 4.\)

As on p. 1239 in Ghosal and Vaart (2001) and similar Corollary 5.1 there, Theorem 1 implies existence of a point estimate of \((\lambda _0,\,f_0)\) with a frequentist convergence rate \(\varepsilon _n.\) The (frequentist) minimax convergence rate for estimation of \(k_{\lambda ,f}^{\Delta }\) relative to the Hellinger distance is unknown in our problem, but an analogy to Ibragimov and Khas’minskiĭ (1982) suggests that up to a logarithmic factor it should be of order \(\sqrt{n\Delta }\) (cf. Ghosal and Vaart 2001, p. 1236). The logarithmic factor is insignificant for all practical purposes. The convergence rate of an estimator of the Lévy density with loss measured in the \(L_2\)-metric in a more general Lévy model than the CPP model is \((n\Delta )^{-\beta /(2\beta +1)},\) whenever the target density is Sobolev smooth of order \(\beta \) (cf. Comte and Genon-Catalot 2011). Our contraction rate is hence, roughly speaking, a limiting case of the convergence in Comte and Genon-Catalot (2011) for \(\beta \rightarrow \infty .\)

4 Algorithms for drawing from the posterior

In this section we discuss computational methods for drawing from the distribution of the pair \((\lambda ,\, f),\) conditional on \(\mathcal {X}_n^\Delta \) (or equivalently: conditional on \(\mathcal {Z}_n^\Delta \)). In the following there is no specific need that the observational times are equidistant. We will assume observations at times \(0<t_1<\cdots <t_n\) and set \(\Delta _i =t_i-t_{i-1}\, (1\le i \le n\)). Further, for consistency with notation following shortly, we set \(z_i=X_{t_i}-X_{t_{i-1}}\) and \(z=(z_1,\ldots , z_n).\) We will use “Bayesian notation” throughout and write p for a probability density of mass function and use \(\pi \) similarly for a prior density or mass function.

In general, it is infeasible to generate independent realisations of the posterior distribution of \((\lambda ,\,f).\) To see this: from (4) one obtains that the conditional density of a nonzero increment z on a time interval of length \(\Delta \) is given by

$$\begin{aligned} p(z\mid \lambda ,\, f) = \frac{e^{-\lambda \Delta }}{1-e^{-\lambda \Delta }} \sum _{k=1}^\infty \frac{(\lambda \Delta )^k}{k!} f^{*k}(z), \end{aligned}$$
(15)

which generally is rather intractable due to the infinite weighted sum of convolutions. We specialise to the case where the jump size distribution is a mixture of \(J\ge 1\) Gaussians. The richness and versatility of the class of finite normal mixtures is convincingly demonstrated in Marron and Wand (1992).

Hence, we assume

$$\begin{aligned} f(\cdot )=\sum _{j=1}^J \rho _j \phi \left( \cdot ;\, \mu _j,\, 1/\tau \right) , \quad \sum _{j=1}^J \rho _j=1, \end{aligned}$$
(16)

where \(\phi (\cdot ;\, \mu ,\, \sigma ^2)\) denotes the density of a random variable with \(\mathcal {N}(\mu ,\, \sigma ^2)\) distribution. Note that in (16) we parametrise the density with the precision \(\tau .\) In the “simple” case \(J=2\) the convolution density of k independent jumps is given by

$$\begin{aligned} f^{*k}(\cdot ) = \sum _{\ell =0}^k {k \atopwithdelims ()\ell } \rho _1^\ell \rho _2^{k-\ell } \phi \left( \cdot ;\, \ell \mu _1 +(k-\ell ) \mu _2;\, k/\tau \right) . \end{aligned}$$

Plugging this expression into Eq. (15) confirms the intractable form of \(p(z\mid \lambda ,\, f).\)

We will introduce auxiliary variables to circumvent the intractable form of the likelihood. In case the CPP is observed continuously, the problem is much easier as now the continuous time likelihood on an interval \([0,\,T]\) is known to be (Shreve 2008, Theorem 11.6.7)

$$\begin{aligned} \lambda ^{|V|} e^{-\lambda T} \prod _{i \in V} f\left( J_i\right) , \end{aligned}$$

where the \(T_i\) are the jump times of the CPP, \(J_i\) the corresponding jump sizes and \(V=\{i{\text {:}}\, T_i \le T\}.\) The tractability of the continuous time likelihood naturally suggests the construction of a data augmentation scheme. Denote the values of the CPP in between times \(t_{i-1}\) and \(t_i\) by \(x_{(i-1,i)}.\) We will refer to \(x_{(i-1,i)}\) as the missing values on the ith segment. Set

$$\begin{aligned} x^{mis} = \left\{ x_{(i-1,i)},\, 1\le i \le n\right\} . \end{aligned}$$

A data augmentation scheme now consists of augmenting auxiliary variables \(x^{mis}\) to \((\lambda ,\,f)\) and constructing a Markov chain that has \(p(x^{mis},\, \lambda ,\, f \mid z)\) as invariant distribution. More specifically, a standard implementation of this algorithm consists of the following steps:

  1. (1)

    Initialise \(x^{mis}.\)

  2. (2)

    Draw \((\lambda ,\, f) \mid (x^{mis},\, z).\)

  3. (3)

    Draw \(x^{mis} \mid (\lambda ,\, f,\, z).\)

  4. (4)

    Repeat steps 2 and 3 many times.

Under weak conditions, the iterates for \((\lambda ,\,f)\) are (dependent) draws from the posterior distribution. Step 3 entails generating compound Poisson bridges. By the Markov property, bridges on different segments can be drawn independently. Data augmentation has been used in many Bayesian computational problems, see, e.g., Tanner and Wong (1987). The outlined scheme can be applied to the problem at hand, but we explain shortly that imputation of complete CPP-bridges (which is nontrivial) is unnecessary and we can do with less imputation, thereby effectively reducing the state space of the Markov chain.

As we assume that the jumps are drawn from a non-atomic distribution, imputation is only necessary on segments with nonzero increments. For this reason we let

$$\begin{aligned} \mathcal {I} =\left\{ i \in \{1,\ldots , n\}{\text {:}}\, z_i \ne 0\right\} , \end{aligned}$$

denote the set of observations with nonzero jump sizes and define the number of segments with nonzero jumps to be \(I =|\mathcal {I}|.\)

4.1 Auxiliary variables

Note that if \(Y \sim f\) with f as in (16), then Y can be simulated by first drawing its label L,  which equals j with probability \(\rho _j,\) and next drawing from the \(N(\mu _{L},\, 1/\tau )\) distribution. Knowing the labels, sampling the jumps conditional on their sum being z is much easier compared to the case with unknown labels. Adding auxiliary variables as labels is a standard trick used for inference in mixture models (see, e.g., Diebolt and Robert 1994; Richardsen and Green 1997). For the problem at hand, we can do with even less imputation: all we need to know is the number of jumps of each type on every segment with nonzero jump size. For \(i \in \mathcal {I}\) and \(j \in \{1,\ldots , J\},\) let \(n_{ij}\) denote the number of jumps of type j on segment i. Denote the set of all auxiliary variables by \(\mathbf {a}=\{a_i, \, i \in I\},\) where

$$\begin{aligned} a_i=\left( n_{i1},\,n_{i2}, \ldots , n_{iJ}\right) . \end{aligned}$$

In the following we will use the following additional notation: for \(i=1,\ldots ,n,\,j=1,\ldots , J\) we set

$$\begin{aligned} n_i = \sum _{j=1}^J n_{ij} \quad s_j = \sum _{i=1}^n n_{ij} \quad s= \sum _{j=1}^J s_j. \end{aligned}$$

These are the number of jumps on the i-th segment, the total number of jumps of type j (summed over all segments) and the total number of jumps of all types, respectively.

4.2 Reparametrisation and prior specification

Instead of parametrising with \((\lambda ,\, \rho _1,\ldots , \rho _J),\) we define

$$\begin{aligned} \psi _j = \lambda \rho _j, \quad j=1,\ldots , J. \end{aligned}$$

Then

$$\begin{aligned} \lambda = \sum _{j=1}^J \psi _j,\quad \rho _j=\frac{\psi _j}{\sum _{j=1}^J \psi _j}. \end{aligned}$$

The background of this reparametrisation is the observation that a compound Poisson random variable Z whose jumps are of J types can be decomposed as \(Z=\sum _{j=1}^JZ_j,\) where the \(Z_j\) are independent, compound Poisson random variables whose jumps are of type j only, and where the parameter of the Poisson random variable is \(\psi _j.\) In what follows we use \(\theta =(\psi ,\,\mu ,\, \tau )\) with \(\psi =(\psi _1,\ldots , \psi _J)\) and \(\mu =(\mu _1, \ldots , \mu _J).\)

Denote the Gamma distribution with shape parameter \(\alpha \) and rate \(\beta \) by \(\mathcal {G}(\alpha ,\,\beta ).\) We take priors

$$\begin{aligned} \psi _1,\ldots , \psi _J&\mathop {\sim }\limits ^{\text {iid}}\, \mathcal {G}\left( \alpha _0,\, \beta _0\right) , \\ \mu \mid \tau&\sim \, \mathcal {N}\left( \left[ \xi _1,\ldots , \xi _J\right] ^{\prime },\, I_{J\times J}(\tau \kappa )^{-1}\right) , \\ \tau&\sim \, \mathcal {G}\left( \alpha _1,\,\beta _1\right) , \end{aligned}$$

with positive hyperparameters \((\alpha _0,\, \beta _0,\, \alpha _1,\, \beta _1,\, \kappa )\) fixed.

4.3 Hierarchical model and data augmentation scheme

We construct a Metropolis–Hastings algorithm to draw from

$$\begin{aligned} p(\theta ,\, \mathbf {a}\mid z) =\frac{p(\theta ,\, z,\, \mathbf {a})}{p(z)}. \end{aligned}$$

For an index \(i\in \mathcal {I}\) we set \(\mathbf {a}_{-i} = \{ a_j,\, j \in \mathcal {I}\setminus \{i\}\}.\) The two main steps of the algorithm are:

  1. (i)

    Update segments for each segment \(i \in \mathcal {I},\) draw \(a_i\) conditional on \((\theta ,\, z,\, \mathbf {a}_{-i});\)

  2. (ii)

    Update parameters draw \(\theta \) conditional on \((z,\, \mathbf {a}).\)

Compared to the full data augmentation scheme discussed previously, the present approach is computationally much cheaper as the amount of imputation scales with the number of segments that need imputation. If the time in between observations is fixed and equal to \(\Delta ,\) then the expected number of segments for imputation equals \(n (1- e^{-\lambda \Delta }),\) which is for small \(\Delta \) approximately proportional to \(n\Delta \lambda .\)

Denote the Poisson distribution with mean \(\lambda \) by \(\mathcal {P}(\lambda ).\) Including the auxiliary variables, we can write the observation model as a hierarchical model

$$\begin{aligned} z_i \mid a_i, \,\mu , \,\tau&\mathop {\sim }\limits ^{\text {ind}} N\left( a_i^{\prime }\mu , \,n_i/\tau \right) ,\nonumber \\ n_{ij} \mid \psi&\mathop {\sim }\limits ^{\text {ind}} \mathcal {P}\left( \psi _j\Delta _i\right) , \nonumber \\ (\psi ,\, \mu ,\, \tau )&\sim \pi (\psi ,\, \mu , \,\tau ) \end{aligned}$$
(17)

(with \(i\in \{1,\ldots , n\}\) and \(j\in \{1,\ldots , J\}\)). This implies

$$\begin{aligned} p(\theta ,\, z,\, \mathbf {a})= \pi (\theta ) \times \prod _{i=1}^n\left( \phi \left( z_i ;\, a_i^{\prime } \mu ,\, n_i/\tau \right) \prod _{j=1}^J e^{-\psi _j\Delta _i} \frac{(\psi _j\Delta _i)^{n_{ij}}}{n_{ij}!}\right) . \end{aligned}$$

4.4 Updating segments

Updating the ith segment requires drawing from

$$\begin{aligned} p\left( a_i \mid \theta ,\, z,\, \mathbf {a}_{-i}\right) \propto \phi \left( z_i ;\, a_i^{\prime } \mu ,\, n_i/\tau \right) \prod _{j=1}^J \frac{(\psi _j\Delta _i)^{n_{ij}}}{n_{ij}!}. \end{aligned}$$

We do this with a Metropolis–Hastings step. First we draw a proposal \(n_i^\circ \) (for \(n_i\)) from a \(\mathcal {P}(\lambda \Delta _i\)) distribution, conditioned to have nonzero outcome. Next, we draw

$$\begin{aligned} a_i^\circ = \left( n_{i1}^\circ ,\ldots , n^\circ _{iJ}\right) \sim \mathcal {MN}\left( n_i^\circ ; \,\psi _1/\lambda ,\ldots , \psi _J/\lambda \right) , \end{aligned}$$

where \(\mathcal {MN}\) denotes the multinomial distribution. Hence the proposal density equals

$$\begin{aligned} q\left( n_{i1}^\circ ,\ldots , n^\circ _{iJ} \mid \theta \right)&= \frac{e^{-\lambda \Delta _i}}{1-e^{-\lambda \Delta _i}} \frac{(\lambda \Delta _i)^{n_i^\circ }}{n_i^\circ !} {n^\circ _i \atopwithdelims ()n^\circ _{i1} \ldots n^\circ _{iJ}} \prod _{j=1}^J \left( \psi _j/\lambda \right) ^{n^\circ _{ij}}\\&=\frac{e^{-\lambda \Delta _i}}{1-e^{-\lambda \Delta _i}} \prod _{j=1}^J \frac{(\psi _j\Delta _i)^{n_{ij}^\circ }}{n^\circ _{ij}!}. \end{aligned}$$

The acceptance probability for the proposal \(n^\circ \) equals \(1\wedge A,\) with

$$\begin{aligned} A= \frac{\phi (z_i ;\, (a_i^\circ )^{\prime } \mu ,\, n^\circ _i/\tau )}{\phi (z_i ;\, a_i^{\prime } \mu ,\, n_i/\tau ) }. \end{aligned}$$

4.5 Updating parameters

The proof of the following lemma is given in Appendix 1.

Lemma 2

Conditional on \(\mathbf {a},\, \psi _1,\ldots , \psi _J\) are independent and

$$\begin{aligned} \psi _j \mid \mathbf {a}\sim \mathcal {G}\left( \alpha _0+s_j,\, \beta _0+T\right) . \end{aligned}$$

Furthermore,

$$\begin{aligned} \begin{aligned}&\mu \mid \tau ,\, z,\, \mathbf {a}\sim \mathcal {N}\left( P^{-1}q,\, \tau ^{-1} P^{-1}\right) , \\&\quad \tau \mid z,\, \mathbf {a}\sim \mathcal {G}\left( \alpha _1+I/2,\, \beta _1+\left( R-q^{\prime } P^{-1} q\right) /2\right) , \end{aligned} \end{aligned}$$
(18)

where P is the symmetric \(J\times J\) matrix with elements

$$\begin{aligned} P= \kappa I_{J\times J} + \tilde{P} \quad \tilde{P}_{j,k} = \sum _{i\in \mathcal {I}} n_i^{-1}n_{ij} n_{ik}, \quad j,\,k \in \{1,\ldots , J\}, \end{aligned}$$
(19)

q is the J-dimensional vector with

$$\begin{aligned} q_j=\kappa \xi _j +\sum _{i \in \mathcal {I}} n_i^{-1} n_{ij} z_i, \end{aligned}$$
(20)

\(R>0\) is given by

$$\begin{aligned} R= \kappa \sum _{j=1}^J \xi _j^2 + \sum _{i \in \mathcal {I}} n_i^{-1} z_i^2, \end{aligned}$$
(21)

and \(R-q^{\prime } P^{-1} q >0.\)

Remark 2

If for some \(j\in \{1,\ldots , J\}\) we have \(s_j=0\) (no jumps of type j), then the matrix \(\tilde{P}\) is singular. However, adding \(\kappa I_{J\times J}\) ensures invertibility of P.

4.6 Numerical illustrations

The first two examples concern mixtures of two normal distributions We simulated \(n=5.000\) segments with \(\Delta =1,\, \mu _1=2,\, \mu _2={-}1\) and \(\tau =1.\) For the prior-hyperparameters we took \(\alpha _0=\beta _0=\alpha _1=\beta _1=1,\, \xi _1=\xi _2=0\) and \(\kappa =1.\)

The results for \(\lambda \Delta =1,\, \rho _1=0.8,\, \rho _2=0.2\) and hence \(\psi _1=0.8\) and \(\psi _2=0.2\) are shown in Fig. 1. The densities obtained from the posterior mean of the parameter estimates and the true density are shown in Fig. 2. The average acceptance probability for updating the segments was 51%.

The results for \(\lambda \Delta =3,\, \rho _1=0.8,\, \rho _2=0.2\) and hence \(\psi _1=2.4\) and \(\psi _2=0.6\) are shown in Fig. 3. The densities obtained from the posterior mean of the parameter estimates and the true density are shown in Fig. 4. The average acceptance probability for updating the segments was 41%. Observe that the autocorrelation functions of the iterations of the \(\psi _i\) in the second case display a much slower decay.

We also assessed the performance of our method on a more complicated example where we took a mixture of four normals. Here \(\Delta =1,\, (\mu _1,\, \mu _2,\, \mu _3,\, \mu _4)=({-}1,\, 0,\, 0.8,\, 2),\, (\psi _1,\,\psi _2,\, \psi _3,\, \psi _4)=(0.3,\, 0.4,\, 0.2,\, 0.1)\) (hence \(\lambda =1\)) and \(\tau ^{-1}=0.09.\) The results obtained after simulating \(n=10.000\) segments are shown in Figs. 5 and 6.

Mixtures of normals need not be multimodal and can also yield skew densities. As an example, we consider the case where \((\mu _1,\,\mu _2)=(0,\,2),\,(\psi _1,\, \psi _2)=(1.5,\,0.5)\) (hence \(\lambda =2\)) and \(\tau =1.\) Data were generated and discretely sampled with \(\Delta =1\) and \(n=5.000\) segments. A plot of the posterior mean is shown in Fig. 7.

Fig. 1
figure 1

Results for \(\lambda =1\) using 15.000 MCMC iterations. The trace plots show all iterations; in the other plots the first 5.000 iterations are treated as burnin. The figures are obtained after subsampling the iterates, where only each fifth iterate was saved. The horizontal yellow lines are obtained from computing the posterior mean of \(\theta \) based on the true auxiliary variables on all segments

Fig. 2
figure 2

Results for \(\lambda =1;\) the first 5.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

Fig. 3
figure 3

Results for \(\lambda =3\) using 25.000 MCMC iterations. The trace plots show all iterations; in the other plots the first 10.000 iterations are treated as burnin. The figures are obtained after subsampling the iterates, where only each fifth iterate was saved. The horizontal yellow lines are obtained from computing the posterior mean of \(\theta \) based on the true auxiliary variables on all segments

Fig. 4
figure 4

Results for \(\lambda =3;\) the first 10.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

Fig. 5
figure 5

Results for the example with a mixture of four normals using 100.000 MCMC iterations. The trace plots show all iterations, in the autocorrelation plot the first 20.000 iterations are treated as burnin. The figures are obtained after subsampling the iterates, where only each fifth iterate was saved. The horizontal yellow lines indicate true values. The results for the other parameters are similar and therefore not displayed

Fig. 6
figure 6

Results for the example with a mixture of four normals; the first 20.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

Fig. 7
figure 7

Results for the example with a skew density; the first 20.000 iterations are treated as burnin. Shown are the true jump size density and the density obtained from the posterior mean of the non-burnin iterates

4.7 Discussion

As can be seen from the autocorrelation plots, mixing of the chain deteriorates when \(\lambda \Delta \) increases. As the focus in this article is on high frequency data, where there are on average only a few jumps in between observations, we do not go into details on improving the algorithm. We remark that a non-centred parametrisation (see for instance Papaspiliopoulos et al. 2007) may give more satisfactory results when \(\lambda \Delta \) is large. A non centred parametrisation can be obtained by changing the hierarchical model in (17). Denote by \(F^{-1}_\lambda \) the inverse cumulative distribution function of the \(\mathcal {P}(\lambda )\) distribution. Let \(u_{ij}\) (\(i=1,\ldots , n\) and \(j=1,\ldots , J\)) be a sequence of independent \(U(0,\,1)\) random variables and set \(u=\{u_{ij},\, i=1,\ldots , n,\, j=1,\ldots , J\}.\) By considering the hierarchical model

$$\begin{aligned} z_i \mid u,\, \mu ,\, \tau&\mathop {\sim }\limits ^{\text {ind}} N\left( \sum _{j=1}^J \mu _j F^{-1}_{\psi _j \Delta _i}\left( u_{ij}\right) ,\, \tau ^{-1}\sum _{j=1}^J F^{-1}_{\psi _j \Delta _i}\left( u_{ij}\right) \right) ,\nonumber \\ u_{ij}&\mathop {\sim }\limits ^{\text {iid}}U(0,\,1), \nonumber \\ (\psi ,\, \mu , \,\tau )&\sim \pi (\psi ,\, \mu , \,\tau ), \end{aligned}$$
(22)

(\(i\in \{1,\ldots , n\}\) and \(j\in \{1,\ldots , J\}),\, \psi \) can be updated using a Metropolis–Hastings step. In this way \(\{n_{ij}\}\) and \(\psi \) are updated simultaneously.

Another option is to integrate out \((\mu ,\,\tau )\) from \(p(\theta ,\,z,\,\mathbf {a}).\) In this model it is even possible to integrate out \(\psi \) as well. In that case only the auxiliary variables \(\mathbf {a}\) have to be updated. Yet another method to improve the efficiency of the algorithm is to use ideas from parallel tempering (Cf. Brooks et al. 2011, Chap. 11).

5 Proof of Theorem 1

There are a number of general results in Bayesian nonparametric statistics, such as the fundamental Theorem 2.1 in Ghosal et al. (2000) and Theorem 2.1 in Ghosal and Vaart (2001), which allow determination of the posterior contraction rates through checking certain conditions, but none of these results is easily and directly applicable in our case. The principle bottleneck is that a main assumption underlying these theorems is sampling from a fixed distribution, whereas in our high frequency setting, the distributions vary with \(\Delta .\) Therefore, for the clarity of exposition in the proof of our main theorem we will choose an alternative path, which consists in mimicking the main steps of the proof of Theorem 2.1, involving judiciously chosen statistical tests, as in Ghosal et al. (2000), while also employing some results on the Dirichlet location mixtures of normal densities from Ghosal and Vaart (2001). However, a significant part of technicalities we will encounter are characteristic of the decompounding problem only.

Throughout this section we assume that Assumptions 1 and 2 hold. Furthermore, in view of the discussion that followed Theorem 1 we will without loss of generality assume that \(0<\delta \le 4.\) All the technical lemmas used in this section are collected in the appendices.

We start with the decomposition

$$\begin{aligned} \Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta }\right) =\Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta } \right) \phi _n+\Pi \left( A\left( \varepsilon _n,\,M\right) |\mathcal {Z}_n^{\Delta } \right) \left( 1-\phi _n\right) =:\mathrm {I}_n+\mathrm {II}_n, \end{aligned}$$
(23)

where \(0\le \phi _n\le 1\) is a sequence of tests based on observations \(\mathcal {Z}_n^{\Delta }\) and with properties to be specified below. The idea is to show that the terms on the right-hand side of the above display separately converge to zero in probability. The tests \(\phi _n\) allow one to control the behaviour of the likelihood ratio

$$\begin{aligned} \mathcal {L}_n^{\Delta }(\lambda ,\,f)=\prod _{i=1}^n \frac{k_{\lambda ,f}^{\Delta }(Z_i^{\Delta })}{k_{\lambda _0,f_0}^{\Delta }(Z_i^{\Delta })}, \end{aligned}$$

on the set where it is not well-behaved due to the fact that \((\lambda ,\,f)\) is ‘far away’ from \((\lambda _0,\,f_0).\)

5.1 Construction of tests

The next lemma is an adaptation of Theorem 7.1 from Ghosal et al. (2000) to decompounding. A proof is given in the appendix. We use the notation \(D(\varepsilon ,\,A,\,d)\) to denote the \(\varepsilon \)-packing number of a set A in a metric space with metric d,  applied in our case with d the scaled Hellinger metric \(h^\Delta .\)

Lemma 3

Let \(\mathcal {Q}\) be an arbitrary set of probability measures \(\mathbb {Q}^\Delta _{\lambda ,f}.\) Suppose for some non-increasing function \(D(\varepsilon ),\) some sequence \(\{\varepsilon _n\}\) of positive numbers and every \(\varepsilon >\varepsilon _n,\)

$$\begin{aligned} D\left( \frac{\varepsilon }{2},\,\left\{ \mathbb {Q}^\Delta _{\lambda ,f}\in \mathcal {Q}{\text {:}}\,\varepsilon \le h^\Delta \left( \mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f}\right) \le 2\varepsilon \right\} ,\,h^\Delta \right) \le D(\varepsilon ). \end{aligned}$$
(24)

Then for every \(\varepsilon >\varepsilon _n\) there exists a sequence of tests \(\{\phi _n\}\) (depending on \(\varepsilon >0),\) such that

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{\lambda _0,f_0}\left[ \phi _n\right] \le D(\varepsilon )\exp \left( {-}Kn\Delta \varepsilon ^2\right) \frac{1}{1-\exp ({-}Kn\Delta \varepsilon ^2)},\\&\sup _{\{\mathbb {Q}^\Delta _{\lambda ,f}\in \mathcal {Q}{\text {:}}\,h^\Delta (\mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f})>\varepsilon \}}\mathbb {E}_{\lambda ,f}\left[ 1-\phi _n\right] \le \exp \left( {-}Kn\Delta \varepsilon ^2 \right) , \end{aligned} \end{aligned}$$

where \(K>0\) is a universal constant.

In the proofs of Propositions 1 and 2 we need the inequalities below. There exists a constant \(\overline{C}\in (0,\,\infty )\) depending on \(\underline{\lambda }\) and \(\overline{\lambda }\) only, such that for all \(\lambda _1,\,\lambda _2\in [\underline{\lambda },\,\overline{\lambda }]\) and \(f_1,\,f_2\) it holds that

$$\begin{aligned} \mathrm {K}\left( \mathbb {Q}^{\Delta }_{\lambda _1,f_1},\,\mathbb {Q}^{\Delta }_{\lambda _2,f_2}\right)&\le \overline{C}\Delta \left( \mathrm {K}\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) +\left| \lambda _1-\lambda _2\right| ^2\right) , \end{aligned}$$
(25)
$$\begin{aligned} \mathrm {V}\left( \mathbb {Q}^{\Delta }_{\lambda _1,f_1},\,\mathbb {Q}^{\Delta }_{\lambda _2,f_2}\right)&\le \overline{C}\Delta \left( \mathrm {V}\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) +\mathrm {K}\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) +\left| \lambda _1-\lambda _2\right| ^2\right) , \end{aligned}$$
(26)
$$\begin{aligned} h\left( \mathbb {Q}^{\Delta }_{\lambda _1,f_1},\,\mathbb {Q}^{\Delta }_{\lambda _2,f_2}\right)&\le \overline{C}\sqrt{\Delta }\left( \left| \lambda _1-\lambda _2\right| + h\left( \mathbb {P}_{f_1},\,\mathbb {P}_{f_2}\right) \right) . \end{aligned}$$
(27)

These inequalities can be proven in the same way as Lemma 1 in Gugushvili et al. (2015).

Let \(\varepsilon _n\) be as in Theorem 1. Throughout, \(\overline{C}\) denotes the above constant. For a constant \(L>0\) define the sequences \(\{a_n\}\) and \(\{\eta _n\}\) by

$$\begin{aligned} a_n=L\log ^{2/\delta }\left( \frac{1}{\eta _n} \right) , \quad \eta _n=\frac{\varepsilon _n}{4\overline{C}}. \end{aligned}$$

We will show that inequality (24) holds true for every \(\varepsilon =M\varepsilon _n\) with \(M>2\) and the set of measures \(\mathcal {Q}\) equal to

$$\begin{aligned} \mathcal {Q}_n=\left\{ \mathbb { Q}_{\lambda ,f_{H,\sigma }}^{\Delta }{\text {:}}\, \lambda \in [\underline{\lambda },\, \overline{\lambda }],\, H\left[ {-}a_n,\,a_n\right] \ge 1-\eta _n,\, \sigma \in [\underline{\sigma },\,\overline{\sigma }] \right\} . \end{aligned}$$

As a first step, note that we have

$$\begin{aligned} \log D\left( \frac{\varepsilon }{2},\,\mathcal {Q}_n,\,h^\Delta \right)&\le \log D\left( {\varepsilon _n},\,\mathcal {Q}_n,\,h^\Delta \right) \nonumber \\&\le \log N\left( \frac{\varepsilon _n}{2},\,\mathcal {Q}_n,\,h^\Delta \right) = \log N\left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,h \right) , \end{aligned}$$
(28)

where \(N\left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,h\right) \) is the covering number of the set \(\mathcal {Q}_n\) with h-balls of size \(\varepsilon _n\sqrt{\Delta }/2.\) The first inequality in (28) follows from assuming \(M>2.\) For bounding the righthand side in (28), we have the following proposition.

Proposition 1

We have

$$\begin{aligned} \log N\left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,h\right) \lesssim \log ^{4/\delta +1}\left( \frac{1}{\varepsilon _n} \right) . \end{aligned}$$
(29)

Proof

Define

$$\begin{aligned} \mathcal {F}_n = \left\{ f_{H,\sigma }{\text {:}}\, H\left[ {-}a_n,\,a_n\right] \ge 1-\eta _n,\, \sigma \in [\underline{\sigma },\, \overline{\sigma }]\right\} . \end{aligned}$$

Let \(\{\lambda _i\}\) be centres of the balls from a minimal covering of \([\underline{\lambda },\, \overline{\lambda }]\) with \(|\cdot |\)-balls of size \(\eta _n.\) Let \(\{f_j\}\) be centres of the balls from a minimal covering of \(\mathcal {F}_n\) with h-balls of size \(\eta _n.\) For any \(\mathbb { Q}_{\lambda , f_{H,\sigma }} \in \mathcal {Q }_n,\) by (27) we have

$$\begin{aligned} h\left( \mathbb { Q}_{\lambda , f_{H,\sigma }} ,\, \mathbb { Q}_{\lambda _i, f_j}\right) \le \frac{\varepsilon _n\sqrt{\Delta }}{2}, \end{aligned}$$

by appropriate choices of i and j. It follows that

$$\begin{aligned} \log N \left( \frac{\varepsilon _n\sqrt{\Delta }}{2},\,\mathcal {Q}_n,\,{h}\right) \le \log N \left( \eta _n,\,[\underline{\lambda },\,\overline{\lambda }],\,|\cdot |\right) + \log N \left( \eta _n ,\, {\mathcal {F}}_n ,\, {h}\right) . \end{aligned}$$

Evidently,

$$\begin{aligned} \log N\left( \eta _n,\,[\underline{\lambda },\,\overline{\lambda }],\,|\cdot |\right) \lesssim \log \left( \frac{1}{\varepsilon _n} \right) . \end{aligned}$$

As we assume \(\delta \le 4,\) we can apply the arguments in Ghosal and van der Vaart (2001, pp. 1251–1252) see in particular formulae (5.8)–(5.10) (cf. also Theorem 3.1 and Lemma A.3 there), which yield

$$\begin{aligned} \log N\left( \eta _n ,\, {\mathcal {F}}_n ,\, {h}\right) \lesssim \log ^{4/\delta +1}\left( \frac{1}{\varepsilon _n} \right) . \end{aligned}$$

Combination of the above three inequalities implies the statement of the proposition.

An application of Proposition 1 to (28) gives

$$\begin{aligned} \log D\left( \frac{\varepsilon }{2},\,\mathcal {Q}_n,\,h^\Delta \right) \lesssim \log ^{4/\delta +1}\left( \frac{1}{\varepsilon _n} \right) \le c_1 n\Delta \varepsilon _n^2, \end{aligned}$$

for some positive constant \(c_1.\) Here, the final inequality follows from our choice for \(\varepsilon _n.\) Hence, (24) is satisfied for

$$\begin{aligned} D(\varepsilon )=\exp \left( \left( c_1/M^2-K\right) n \Delta \varepsilon ^2\right) . \end{aligned}$$

By Lemma 3 there exist tests \(\phi _n\) such that for all n large enough

$$\begin{aligned}&\mathbb {E}_{\lambda _0,f_0}\left[ \phi _n\right] \le 2 \exp \left( -\left( KM^2-c_1\right) n\Delta \varepsilon _n^2 \right) , \end{aligned}$$
(30)
$$\begin{aligned}&\sup _{\{\mathbb {Q}^\Delta _{\lambda ,f}\in \mathcal {Q}_n{\text {:}}\,h^\Delta (\mathbb {Q}^\Delta _{\lambda _0,f_0},\,\mathbb {Q}^\Delta _{\lambda ,f})>\varepsilon \}}\mathbb {E}_{\lambda ,f}\left[ 1-\phi _n\right] \le \exp \left( {-}Kn\Delta M^2\varepsilon _n^2 \right) . \end{aligned}$$
(31)

5.2 Bound on \(\mathrm {I}_n\) in (23)

First note that by Eq. (30)

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {I}_n \right] \le \mathbb {E}_{\lambda _0,f_0}\left[ \phi _n \right] \le 2 \exp \left( -\left( KM^2-c_1\right) n\Delta \varepsilon _n^2 \right) . \end{aligned}$$

Chebyshev’s inequality implies that \(\mathrm {I}_n\) converges to zero in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability as \(n\rightarrow \infty ,\) as soon as M is chosen so large that \(KM^2-c_1>0.\,\square \)

5.3 Bound on \(\mathrm {II}_n\)

Now we consider \(\mathrm {II}_n.\) We have

$$\begin{aligned} \mathrm {II}_n=\frac{ \iint _{A(\varepsilon _n,M)} \mathcal {L}_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) (1-\phi _n) }{ \iint \mathcal {L}_n^{\Delta }(\lambda ,\,f) \mathrm {d}\Pi _1(\lambda ) \mathrm {d}\Pi _2(f) }=:\frac{\mathrm {III}_n}{\mathrm {IV}_n}. \end{aligned}$$

We will show that the numerator \(\mathrm {III}_n\) goes exponentially fast to zero, in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability, while the denominator \(\mathrm {IV}_n\) is bounded from below by an exponential function, with \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability tending to one, in such a way that the ratio of \(\mathrm {III}_n\) and \(\mathrm {IV}_n\) still goes to zero in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability.

5.3.1 Bounding \(\mathrm {III}_n\)

As \(1_{\{A(\varepsilon _n,M)\}} \le 1_{\mathcal {Q}_n^c}+1_{\{A(\varepsilon _n,M)\cap \mathcal {Q}_n\}}\) we have

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {III}_n\right] \le \Pi \left( \mathcal {Q}_n^c\right) +\iint _{ \mathcal {Q}_n \cap A(\varepsilon _n,M) } \mathbb {E}_{\lambda ,f} \left[ 1-\phi _n\right] \mathrm {d}\Pi _1(\lambda )\mathrm {d}\Pi _2(f). \end{aligned}$$

Here we applied Fubini’s theorem to obtain the second term on the right-hand-side, which by (31) is bounded by \(\exp (-KM^2n\Delta \varepsilon _n^2).\) Furthermore,

$$\begin{aligned} \Pi \left( \mathcal {Q}_n^c\right) =\Pi _2\left( H\left[ {-}a_n,\,a_n\right] <1-\eta _n,\,\sigma \in [\underline{\sigma },\,\overline{\sigma }]\right) \lesssim \frac{1}{\eta _n}e^{-ba_n^{\delta }}, \end{aligned}$$

where the last inequality is formula (5.11) in Ghosal and Vaart (2001). Hence

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {III}_n\right] \lesssim \frac{1}{\eta _n}e^{-ba_n^{\delta }} + \exp \left( -KM^2n\Delta \varepsilon _n^2\right) . \end{aligned}$$
(32)

5.3.2 Bounding \(\mathrm {IV}_n\)

Recall \(\mathrm {K}_\Delta =\mathrm {K}/\Delta \) and \(\mathrm {V}_\Delta =\mathrm {V}/\Delta .\) Let

$$\begin{aligned} B^{\Delta }\left( \varepsilon ,\,\left( \lambda _0,\,f_0\right) \right) =\left\{ (\lambda ,\,f){\text {:}}\, K^\Delta \left( \mathbb {Q}^{\Delta }_{\lambda _0,f_0},\, \mathbb {Q}^{\Delta }_{\lambda ,f}\right) \le \varepsilon ^2,\, V^\Delta \left( \mathbb {Q}^{\Delta }_{\lambda _0,f_0},\, \mathbb {Q}^{\Delta }_{\lambda ,f}\right) \le \varepsilon ^2 \right\} , \end{aligned}$$

and

$$\begin{aligned} \widetilde{\varepsilon }_n=\frac{\log (n\Delta )}{\sqrt{n\Delta }}. \end{aligned}$$

Note that \(n\Delta \widetilde{\varepsilon }_n^2\rightarrow \infty \) when \(n\rightarrow \infty .\)

We will use the following bound, an adaptation of Lemma 8.1 in Ghosal et al. (2000) to our setting, valid for every \(\varepsilon >0\) and \(C>0,\)

$$\begin{aligned} \mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\left( \iint _{ B^{\Delta }(\varepsilon ,(\lambda _0,f_0)) } \mathcal {L}_n(\lambda ,\,f)\mathrm {d}\widetilde{\Pi }(\lambda ,\,f) \le \exp \left( -(1+C)n\Delta \varepsilon ^2\right) \right) \le \frac{1}{C^2n\Delta \varepsilon ^2}, \end{aligned}$$
(33)

where

$$\begin{aligned} \widetilde{\Pi }(\cdot )=\frac{\Pi (\cdot )}{\Pi ( B^{\Delta }(\varepsilon ,\,(\lambda _0,\,f_0)) )}, \end{aligned}$$

is a normalised restriction of \(\Pi (\cdot )\) to \(B^{\Delta }(\varepsilon ,\,(\lambda _0,\,f_0)).\)

By virtue of (33), with \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability tending to one, for any constant \(C>0\) we have

$$\begin{aligned} \mathrm {IV}_n&\ge \iint _{B^{\Delta }(\widetilde{\varepsilon }_n,(\lambda _0,f_0))} \mathcal {L}^\Delta _n(\lambda ,\,f)\mathrm {d}\Pi _1(\lambda )\times \mathrm {d}\Pi _2(f)\nonumber \\&>\Pi \left( B^{\Delta }\left( \widetilde{\varepsilon }_n,\,\left( \lambda _0,\,f_0\right) \right) \right) \exp \left( -(1+C)n\Delta \widetilde{\varepsilon }_n^2\right) . \end{aligned}$$
(34)

We will now work out the product probability on the right-hand side of this inequality.

Proposition 2

It holds that

$$\begin{aligned} \Pi \left( B^{\Delta }\left( \widetilde{\varepsilon }_n,\,\mathbb {Q}_{\lambda _0,f_0}\right) \right) \gtrsim \exp \left( - \bar{c} \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) , \end{aligned}$$

for some constant \(\bar{c}.\)

Proof

Let \(0<c\le 1/\sqrt{5\overline{C}}\) be a constant. Here \(\overline{C}\) is the constant in (25) and (26). By these inequalities it is readily seen that

$$\begin{aligned} \left\{ (\lambda ,\,f){\text {:}}\,K\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2 \widetilde{\varepsilon }_n^2,\,V\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2 \widetilde{\varepsilon }_n^2 ,\,\left| \lambda _0-\lambda \right| ^2 \le c^2 \widetilde{\varepsilon }_n^2\right\} \subset B^{\Delta }\left( \widetilde{\varepsilon }_n,\, \mathbb { Q}_{\lambda _0, f_0}^{\Delta }\right) . \end{aligned}$$

It then follows by the independence assumption on \(\Pi _1\) and \(\Pi _2\) that

$$\begin{aligned} \Pi \left( B^{\Delta }\left( \widetilde{\varepsilon }_n,\,\mathbb {Q}^\Delta _{\lambda _0,f_0}\right) \right)&\ge \Pi _1\left( \left| \lambda _0-\lambda \right| \le c \widetilde{\varepsilon }_n \right) \\&\quad \times \Pi _2 \left( f {\text {:}}\, \mathrm {K}\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2{\widetilde{\varepsilon }_n^2},\, \mathrm {V}\left( \mathbb {P}_{f_0},\,\mathbb {P}_{f}\right) \le c^2{\widetilde{\varepsilon }_n^2} \right) . \end{aligned}$$

For the first factor on the right-hand side we have by (13) that

$$\begin{aligned} \Pi _1\left( \left| \lambda _0-\lambda \right| \le c \widetilde{\varepsilon }_n \right) \gtrsim \widetilde{\varepsilon }_n. \end{aligned}$$

As far as the second factor is concerned, for some constants \(\overline{c}_1,\,\overline{c}_2\) it is bounded from below by

$$\begin{aligned} \overline{c}_1 \exp \left( {-} \overline{c}_2 \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) , \end{aligned}$$

by the same arguments as in inequality (5.17) in Ghosal and Vaart (2001). The result now follows by combining the two lower bounds.

Combining (34) with Proposition 2, with \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability tending to one as \(n\rightarrow \infty ,\) for any constant \(C>0\) we have

$$\begin{aligned} \mathrm {IV}_n&> \exp \left( -(1+C)n\Delta \widetilde{\varepsilon }_n^2-\bar{c} \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) . \end{aligned}$$
(35)

We are now ready for showing the final steps of proving that \(\mathrm {II}_n\) tends to zero in \(\mathbb {Q}_{\lambda _0,f_0}^{\Delta , n}\)-probability. Let \(G_n\) denote the set on which inequality (35) is true. Then by (32) we obtain

$$\begin{aligned} \mathbb {E}_{\lambda _0,f_0}\left[ \mathrm {II}_n 1_{G_n} \right]&\lesssim \exp \left( (1+C)n\Delta \widetilde{\varepsilon }_n^2+\bar{c} \log ^2 \left( \frac{1}{ \widetilde{\varepsilon }_n } \right) \right) \\&\quad \times \left[ \frac{1}{\eta _n}e^{-ba_n^{\delta }} + \exp \left( -KM^2n\Delta \varepsilon _n^2\right) \right] . \end{aligned}$$

Recall that \(n\Delta \widetilde{\varepsilon }_n^2=\log ^2(n\Delta ).\) Hence, the exponent in the first factor of this display is of order \(\log ^2 (n\Delta ).\) Furthermore \(a_n^\delta =L^\delta \log ^2({4\overline{C}}/{\varepsilon _n}),\) which is of order \(\log ^2(n\Delta )\) as well. It follows that, provided the constants L and M are chosen large enough, the right-hand side of the above display converges to zero as \(n\rightarrow \infty .\) Chebyshev’s inequality then implies that \(\mathrm {II}_n\) converges to zero in probability as \(n\rightarrow \infty .\) This completes the proof of Theorem 1. \(\square \)