1 Introduction

Diffusions have been widely applied to model continuous-time phenomena of interest, including molecular dynamics (Boys et al. 2008), neuroscience (Lansky and Ditlevsen 2008), and finance (Karatzas and Shreve 1998). In general, a diffusion on \(\mathbbm{R}^d\) is a Markov process X defined to be the solution, with law we will denote by \(\mathbbm{P}\), to a stochastic differential equation of the following form:

$$\begin{aligned} \mathrm {d}X_t&= b(X_t) \mathrm {d}t + \sigma (X_t) \mathrm {d}W_t,\quad X_0=x_0,\quad t\in [0,T], \end{aligned}$$
(1)

where \(b:\mathbbm{R}^d\rightarrow \mathbbm{R}^d\) and \(\sigma :\mathbbm{R}^d\rightarrow \mathbbm{R}^{d\times d'}\) denote the drift and volatility coefficient respectively, and W is a standard \(d'\)-dimensional Brownian motion. Throughout we assume standard regularity conditions hold which ensure the existence of a unique, global, weak solution to (1) (see for instance Øksendal 2003).

In practice we will typically only have access to discrete observations of (1), and so for practitioners the statistical problem of interest is to use these observations to draw inference on the parameters of b and \(\sigma\) of (1). A common Bayesian strategy is to augment the parameter space with the space describing the complete underlying diffusion trajectory. A Markov Chain Monte Carlo algorithm can then explore this augmented space by alternating between updates of the parameters and updates of the unobserved sample path connecting observations (sampling of diffusion bridges) (Roberts and Stramer 2001). As a consequence a considerable and methodologically diverse literature has been developed concerned with simulating diffusion bridges (the law of (1) conditioned to terminate at the subsequent observation (for instance, \(X_T=x_T\)), which we denote by \(\mathbbm {P}^{(T,x_0,x_T)}\) or generically by \(\mathbbm{P}^{\star }\)), including Beskos and Roberts (2005); Bladt et al. (2014); Delyon and Hu (2006); Durham and Gallant (2002); Golightly and Wilkinson (2008); Hairer et al. (2011); Roberts and Stramer (2001); Schauer et al. (2017).

One of the common difficulties with Markov Chain Monte Carlo strategies is sampling diffusion bridges between distant observations; the duration of the bridge, which we denote by T, is large. This setting naturally arises when the underlying diffusion (1) is sparsely observed (or high-dimensional), for instance in shape analysis applications (Arnaudon et al. 2020), or in the case of diffusions on graphs (Freidlin and Wentzell 1993). The problem here is that methodologies for sampling diffusion bridges scale poorly with T, and many of the most widely used approaches have exponential computational cost in T. Consequently, addressing the poor scaling in T has drawn considerable interest. One popular approach is the blocking scheme introduced by Shephard and Pitt (1997), which has been employed in a number of practical problems with strong empirical evidence of its efficacy (Chib et al. 2004; Golightly and Wilkinson 2008; Kalogeropoulos 2007; Kalogeropoulos et al. 2010; van der Meulen and Schauer 2018; Stramer and Roberts 2007).

Blocking is a conceptually simple idea in which the time domain of the diffusion bridge is overlaid with a set of temporal anchors (\(0=:k_0<k_1<\dots<k_m<k_{m+1}:=T\)), and the values of the bridge are taken for some initialisation trajectory at those points (which are known as knots, and for which we will denote \(X_i:=X_{k_i}\) to simplify notation). Simulation from \(\mathbbm {P}^{(T,x_0,x_T)}\) is then achieved by constructing a Gibbs sampler which alternates between updating knots and updating the segments of the trajectory conditional on the knots, a number of times. For instance, we could begin by simulating from the conditional law \(\mathbbm {P}^{(k_2-k_0,X_0,X_{2})}\) (updating the trajectory between \([t_0,t_2]\) which includes the knot at \(X_1\), conditional on the knots at \(X_0\) and \(X_2\)), and then \(\mathbbm{P}^{(k_3-k_1,X_1,X_{3})}\) (updating the trajectory between \([t_1,t_3]\) and containing the knot \(X_2\), conditional on the knots at \(X_1\) and \(X_3\)), and so on sweeping across all anchor points. This sweep would then be iterated a number of times to reduce the dependency between the resulting bridge and the initial (or previous) trajectory. In this article we consider the three canonical blocking schemes of Roberts and Sahu (1997) with equidistant anchors: the checkerboard scheme, in which the odd and even indexed knots are alternatively updated; the lexicographic scheme, in which the knots are updated in temporal order; and the random scheme, in which at each step a random knot is updated. We more formally introduce blocking and define these schemes in Sect. 2.

From a computational perspective, blocking substitutes the expensive simulation of a (single independent) draw from \(\mathbbm {P}^{(T,x_0,x_T)}\), with the cost of simulating repeated sweeps of the \(m+1\) shorter (and computationally more efficient) bridges for each segment given by the temporal anchors. Any analysis of this trade-off needs to take into account the serial correlation induced by the blocking strategy.

Despite widespread adoption of blocking in practice to mitigate the computational cost of simulating diffusion bridges (as indicated above), there is little theoretical support for its efficacy. Furthermore, there is little concrete guidance on how to implement, and then appropriately tune (selecting for instance the number and locations of the anchor points), a blocking scheme.

In this article we provide general guidance for implementing blocking schemes by addressing these practical considerations for particular classes of diffusion process. We analyse the computational cost of several rejection sampling algorithms for bridges as a function of block size and bridge duration. In all cases we consider a fixed regular spacing of m anchor points as \(m,T\rightarrow \infty\), in contrast with the study of the ‘in-fill’ asymptotic of Roberts and Stramer (2001) in which T is fixed and \(m\rightarrow \infty\). We analyse the expected cost of a single iteration of various algorithms, and then to capture the trade-off described above we consider the cost of the algorithm which comprises both the cost of one iteration, and the total number of iterations required to obtain an ‘independent’ sample. We give a more formal description of what we mean by achieving independence below, in terms of the relaxation time of the underlying Markov chain.

In this article we work under the assumption that the underlying measure is a Gaussian diffusion (i.e. \(\mathbbm {P}\) is the law of a scaled Brownian motion or the law of the Ornstein–Uhlenbeck process). Under this simplification the Gibbs step for updating the bridge segments can be implemented without error, i.e. without discretising time, for example by means of a rejection sampler directly on the path-space of the diffusion (see Appendix 1 for full details). In this setting we prove that Theorem 1 below holds, as the culmination of the results in Sect. 3. We gather all proofs in the appendices.

Theorem 1

Suppose \(\mathbbm {P}^{\star }\) is the conditional law of a Gaussian diffusion which is sampled by rejection on path-space and using a checkerboard or lexicographic or random blocking scheme. Suppose the m anchors are spaced equidistantly such that \(m=c_1T\) (for some constant \(c_1>0\)). Then the expected computational cost of the blocked rejection sampler, \(C_{blocking}(T)\), satisfies:

$$\begin{aligned} C_{blocking}(T)=\mathcal {O}(T^3),\quad \mathrm {as}\quad T \rightarrow \infty , \end{aligned}$$
(2)

whenever \(\mathbbm{P}\) denotes the law of a scaled Brownian motion and

$$\begin{aligned} C_{blocking}(T)=\mathcal {O}(T),\quad \mathrm {as}\quad T \rightarrow \infty , \end{aligned}$$
(3)

whenever \(\mathbbm{P}\) denotes the law of the Ornstein–Uhlenbeck process.

Remark 1

Note that in the case of a Brownian bridge there is long range dependency in the path, in the sense that the correlation between \(X_s\) and \(X_t\) is non-negligible even for \(0 \ll s \ll t \ll T\). On the other hand, for the Ornstein–Uhlenbeck process its ergodicity breaks this dependency. For an Ornstein–Uhlenbeck process whose drift is of the form \(b(X_t) = -\theta X_t\) and for \(T\gg 0\), there is a phase transition in its behaviour as \(\theta \rightarrow 0\) in that the computational cost of a blocked rejection sampler for Brownian motion is not recovered. Recall that in this paper we are working under the assumption that the underlying measure is a Gaussian diffusion, but in most practical settings the target law will be more complicated. In such settings it would be typical to use a Gaussian diffusion as a proposal law for the non-Gaussian target law. In principle Theorem 1 would suggest that an Ornstein–Uhlenbeck process proposal for a stationary target law would be advantageous over a Brownian bridge proposal, although in practice this predicted computational saving would depend on how closely the target process matched the invariant distribution of the Ornstein–Uhlenbeck process.

Theorem 1 contrasts sharply with the case without blocking. We show later in Proposition 1 that, for a d-dimensional Brownian bridge proposal in the absence of blocking, the cost is exponential in T. Although what we prove in Theorem 1 addresses a somewhat idealised setting, the requirement \(m=c_1T\) acts as a concrete guide for choosing the number of blocks. Furthermore, our empirical results in Sect. 4 indicate that the guidance we establish can be more broadly useful beyond the class of linear diffusions. Thus we demonstrate that blocking can lead to significantly improved computational efficiency when conducting inference for discretely observed diffusions.

2 Blocking

In this section we provide a systematic definition of blocking for sampling a diffusion path. Define a set of anchors spread across the time domain: \(0<k_1<\dots<k_m<T\) and knots as the values of the path taken at the anchors:

$$\begin{aligned} \mathcal {K}(\omega ):=\{ X_{k_1}(\omega ), \dots ,X_{k_m}(\omega ) \}. \end{aligned}$$

Each anchor is now assigned to one of \(\mathbbm{k}\) disjoint subsets \(\mathcal {A}_i\), \(i=1,\dots ,\mathbbm{k}\), each comprising \(m_i\) anchors:

In particular, \(\mathcal {A}_i = \{r_{i1},\dots ,r_{im_i}\}\) is the set of anchors associated (uniquely) to the ith block. This allows us to group the knots by associating them with the corresponding subsets of anchors:

$$\begin{aligned} \mathcal {K}_i(\omega ):=\{X_r(\omega )\,;\,r\in \mathcal {A}_i\},\quad i\in \{1,\dots ,\mathbbm{k}\}. \end{aligned}$$

For convenience of notation we let \(\mathcal {K}_{-i}\) (resp. \(\mathcal {A}_{-i}\)) denote all knots (resp. anchors) that do not belong to \(\mathcal {K}_i\) (resp. \(\mathcal {A}_i\)):

and assign labels to an ordered collection of all anchors in \(\mathcal {A}_{-i}\), plus the end-points:

$$\begin{aligned} \{e_{i0},\dots ,e_{i(m+1-m_i)}\} = \mathcal {A}_{-i} \cup \{0, T\},\quad (\text{ with } e_{ij}<e_{i(j+1)}),\quad i\in \{1,\dots ,\mathbbm{k}\}. \end{aligned}$$

Further, define

$$\begin{aligned} \mathcal {B}_i:=\left\{ \left. (e_{ij},e_{i(j+1)})\,\right| \,\exists r\in \mathcal {A}_i \text{ s.t. } r\in [e_{ij},e_{i(j+1)}]\right\} _{j=0}^{m-m_i}, \end{aligned}$$

to be only those intervals between the end-points or anchors in \(\mathcal {A}_{-i}\), which contain at least one anchor belonging to \(\mathcal {A}_i\). The path segments \(X|_{\mathcal {B}_i}\), obtained through restricting X to \(\mathcal {B}_i\), are termed blocks. Finally, in the case \(\mathbbm{k}=2\) we say that \(\mathcal {A}_1\) and \(\mathcal {A}_2\) are interlaced if whenever \(a,c\in \mathcal {A}_i\), with \(a<c\), then there exists \(b\in \mathcal {A}_{(i\mod 2)+1}\) s.t. \(a<b<c\), \(i=1,2\).

A sampler for a path equipped with a blocking technique is a Gibbs sampler that updates the full path only one block at a time by drawing from the conditional laws \(\mathbbm{P}^{\star }|_{\mathcal {B}_i}(\cdot | \mathcal {K}_{-i})\)—i.e. the target laws restricted to blocks \(\mathcal {B}_i\) and conditioned on the knots in \(\mathcal {K}_{-i}\). For simplicity we refer to this technique as a blocked sampler in the remainder of the text, and present general pseudo-code for it in Algorithm 1.

figure a

There are a number of ways we can update the blocks, and in this article we consider the three canonical blocking schemes of Roberts and Sahu (1997). In particular, we refer to a single, full Gibbs sweep of Algorithm 1 (the inner for-loop) as a:

Definition 1

Checkerboard blocking update scheme if \(\mathbbm{k}=2\), \(\mathcal {A}_1\) and \(\mathcal {A}_2\) are interlaced, and \(q(i,j):=\mathbbm{1}_{\{i\}}(j)\).

Definition 2

Lexicographic blocking update scheme if \(\mathbbm{k}=m\), \(\mathcal {A}_i:=\{k_i\}\), \(i\in \{1,\dots ,m\}\), and \(q(i,j):=\mathbbm{1}_{\{i\}}(j)\).

Definition 3

Random blocking update scheme if \(\mathbbm{k}=m\), \(\mathcal {A}_i:=\{k_i\}\), \(i\in \{1,\dots ,m\}\), and \(q(i,j):=\frac{1}{m}\mathbbm{1}_{\{1,\dots ,m\}}(j)\).

The above are not exhaustive, but characterise the most widely used, and are tractable enough for analysis. We further simplify various computations by assuming the anchors are equidistant, and defer discussion of this assumption and its relaxation to Sect. 5.

Assumption 1

The anchors are placed on an equidistant grid:

$$\begin{aligned} k_{i+1}-k_i=\frac{T}{m+1}=:\delta _{m,T},\quad i\in \{1,\dots ,m-1\}. \end{aligned}$$

As mentioned in the Introduction, we will study the asymptotic regime in which \(\delta _{m,T}\) is fixed as \(m,T\rightarrow \infty\).

3 Computational Analysis

3.1 Cost of a Single Sweep

We begin by quantifying the computational cost of a rejection sampling algorithm for diffusion bridges in the absence of blocking. The setting here, as given in further detail in Appendix 1, is one in which the target law is absolutely continuous with respect to a d-dimensional Brownian bridge proposal path.

Proposition 1

Under Assumptions 39 enumerated in Appendix 1, the expected computational cost as a function of T of obtaining a single draw with a path-space rejection sampling algorithm, denoted by \(C_{\texttt {rej}}(T)\), is given by

$$\begin{aligned} C_{\texttt {rej}}(T) = f(T)Te^{c_2T}, \end{aligned}$$
(4)

where \(c_2>0\) is some constant independent of T, and the function \(f:\mathbbm{R}_+\rightarrow \mathbbm{R}\) is continuous and such that \(f(T)\sim T^{-d/2}\) as \(T\rightarrow \infty\). In particular, for large enough T there is a constant \(c_3>0\) such that:

$$\begin{aligned} C_{\texttt {rej}}(T) \ge c_3T^{1-d/2}e^{c_2T}. \end{aligned}$$

Remark 2

Note that Proposition 1 does not stipulate in what way the constant \(c_3\) might vary with dimension. Without further structure it is impossible to characterise this behaviour. However it is highly likely that \(c_3\) will increase at least linearly with dimension (as for instance would be the case for diffusions consisting of d independent components).

Remark 3

If \(\mathbbm{P}\) is the law of a drifted Brownian motion, then Proposition 1 cannot be applied directly, because Assumption 9 does not hold. However, for this case an easy calculation shows that the acceptance probability of a rejection sampler with Brownian bridge proposals is equal to 1, implying (under Assumption 8) that \(C_{\texttt {rej}}(T)\) is proportional to T.

Now considering a single sweep of the blocking schemes introduced in Sect. 2, note that we have substituted sampling a single diffusion bridge (of length T) with sampling a number of diffusion bridges of shorter time horizon, \(2\delta _{m,T}\) (for example, to ensure the point \(X_{2\delta _{m,T}}\) is updated one could sample a new bridge of length \(2\delta _{m,T}\) connecting \(X_{\delta _{m,T}}\) with \(X_{3\delta _{m,T}}\)). By application of Proposition 1, the expected computational cost of simulating each of these shorter bridges is therefore \(C_{\texttt {rej}}(2\delta _{m,T})\), and hence the expected cost of a single Gibbs sweep is:

$$\begin{aligned} C_{\texttt {sweep}}(T,m):=m\cdot C_{\texttt {rej}}(2\delta _{m,T})= f(2\delta _{m,T})\frac{2mT}{m+1}\exp \{2c_2\delta _{m,T}\}. \end{aligned}$$
(5)

Equation (5) holds for all m and T and follows from (4); however, as the behaviour of f(t) for small t is not immediately transparent, to learn something about \(C_{\texttt {sweep}}(T,m)\) when \(\delta _{m,T}\) is small, we may use the fact that the acceptance probability of the rejection sampler approaches 1 as the bridge duration decreases to 0. This fact implies that for small enough t, \(C_{\texttt {rej}}(t)\sim c_5t\) and thus if \(\delta _{m,T}<c_4\) as \(T\rightarrow \infty\) for some constant \(c_4\) then

$$\begin{aligned} C_{\texttt {sweep}}(T,m) \sim c_5T, \end{aligned}$$
(6)

for some \(c_5>0\). For instance, upon setting \(m=\lfloor T \rfloor\), the cost in (5) becomes \(\mathcal {O}(T)\) as \(T\rightarrow \infty\). Contrast this with (4) to see that the relative gain in efficiency, \(C_{\texttt {rej}}(T)/C_{\texttt {sweep}}(T,m)\) grows exponentially in T and suggests that blocking is to be preferred for large enough T. However, this ignores the costs associated with mixing; we address this in the next subsection.

3.2 Cost of Multiple Sweeps

Direct comparison of the exponential cost \(C_{\texttt {rej}}(T)\) of direct rejection sampling (as given by Proposition 1), with the linear cost \(C_{\texttt {sweep}}(T,m)\) of a single sweep of a blocking scheme (as given by (6)), does not capture the remnant dependency structure introduced by the blocking scheme. In addition we need to consider the number of sweeps required to render this dependency negligible. In order to do that we first introduce the following notion

Definition 4

(Roberts and Sahu 1997) The [\(\mathcal {L}^2\)-]convergence rate \(\rho\) of a Markov chain \(\{X^{(n)};n=1,\dots ,N\}\) with the transition kernel P and an invariant density \(\pi\) is defined as the minimum number for which for all square \(\pi\)-integrable functions f, and for all \(r>\rho\)

$$\begin{aligned} \Vert P^nf-\pi (\kern 0.14em f) \Vert _{\mathcal {L}^2(\pi )}:=\int \left[ P^nf(X^{(0)}) - \pi (\kern 0.14em f)\right] ^2\pi (\mathrm {d}X^{(0)}) \le V_f r^n, \end{aligned}$$

where \(P^nf(X^{(0)}):=\mathbbm{E}_\pi [\kern 0.14em f(X^{(n)})|X^{(0)}]\), \(\pi (\kern 0.14em f):=\mathbbm{E}_\pi [\kern 0.14em f(X)]\) and \(V_f\) is a positive number that depends on f.

We can now capture the cost of reducing the dependency on the past by considering the relaxation time, denoted \(\mathcal {T}=\mathcal {T}(T,m)\), and defined as:

$$\begin{aligned} \mathcal {T}= -\frac{1}{\log \left( \rho \right) }. \end{aligned}$$
(7)

It represents the time required by the underlying Markov chain to output a draw from its stationary distribution (Levin and Peres 2017). This makes it possible to compare \(C_{\texttt {rej}}(T)\) with the expected computational cost of the blocked rejection sampler as follows:

$$\begin{aligned} C_{\texttt {blocking}}(T,m):= \mathcal {T}(T,m)\cdot C_{\texttt {sweep}}(T,m). \end{aligned}$$
(8)

We will later consider the most appropriate choice of blocking scheme, and how to optimise m.

Instead of analysing the chain targeting the law \(\mathbbm{P}^{\star }\) it is sufficient to consider a related chain that targets the marginal law of the vector

$$\begin{aligned} \mathcal {G}:=(X_{k_1},\dots ,X_{k_m})|(X_0,X_T)=\mathcal {K}|(X_0,X_T), \end{aligned}$$
(9)

which we denote by \(\mathbbm {G}\). To see this, notice that conditionally on the knots \(\mathcal {K}\) being distributed according to \(\mathbbm {G}\), a path X returned after a single Gibbs sweep of a blocking scheme is distributed exactly according to \(\mathbbm {P}^{\star }\). The object of interest becomes a Markov chain with a transition kernel P denoting a single Gibbs sweep, and with stationary distribution \(\pi =\mathbbm {G}\) (Roberts and Rosenthal 2001).

Throughout, we additionally assume that the following condition holds, which makes the subsequent required calculations tractable.

Assumption 2

The target law \(\mathbbm{P}^{\star }\) is such that \(\mathcal {G}\) is a Gaussian process.

We discuss this key technical assumption in Sect. 4, where we note that the established results seem to hold empirically more broadly.

Under Assumption 2 and using either the lexicographic or checkerboard updating scheme, a single Gibbs step (i.e. an update \(\mathcal {G}|_{\mathcal {B}_I}\sim \mathbbm {P}^{\star }|_{\mathcal {B}_I\cap \{X_{k_1},\dots ,X_{k_m}\}}(\cdot |\mathcal {K}_{-I})\)) has a tractable, Gaussian transition density, and thus so does the entire Gibbs sweep \(\mathcal {G}^{(n)}\mapsto \mathcal {G}^{(n+1)}\) with mean and covariance

$$\begin{aligned} \mu :=\mathbbm{E}[\mathcal {G}],\qquad \Sigma := \mathbbm {C}ov[\mathcal {G}]. \end{aligned}$$

As a consequence it is possible to explicitly characterise the transition kernel P, as follows.

Lemma 1

Under the lexicographic and checkerboard updating schemes, the n-step transition kernel \(P^n\) of the Markov chain \(\{\mathcal {G}^{(l)}\,;\,l=0,\dots \}\) is Gaussian, with mean and covariance matrix given respectively by:

$$\begin{aligned} \mathbbm{E}[\mathcal {G}^{(l+n)}|\mathcal {G}^{(l)}]=B^{n}\mathcal {G}^{(l)}+(I-B)^{-1}(I-B^{n})b,\quad \mathbbm {C}ov[\mathcal {G}^{(l+n)}|\mathcal {G}^{(l)}]=\Sigma -B^n\Sigma (B^n)^\mathrm {T}, \end{aligned}$$
(10)

with \(B\in \mathbbm{R}^{m\times m}\) and \(b\in \mathbbm{R}^m\).

Under the lexicographic or the checkerboard updating schemes \(\{\mathcal {G}^{(l)}\,;\,l=0,\dots \}\) is an AR(1) process, and so the spectral radius \(\rho _{\texttt {spec}}(B)\) of the matrix B must satisfy \(\rho _{\texttt {spec}}(B)<1\) for the process to converge, and equals the \(\mathcal {L}^2\)-convergence rate (Amit 1991). This connection extends to the random updating scheme. In the following lemma we derive the spectral radius of each blocking scheme as a function of m and T, which aids in optimising their parameterisation and analysing their scaling. We denote by \(\Lambda :=\Sigma ^{-1}\) the precision matrix of \(\mathcal {G}\) and define

$$\begin{aligned} A:=I-\text{ diag }\{\Lambda _{11}^{-1},\dots ,\Lambda _{mm}^{-1}\}\Lambda . \end{aligned}$$

Lemma 2

(Roberts and Sahu 1997) Under the checkerboard and lexicographic updating schemes, the spectral radius of the matrix B and the \(\mathcal {L}^2\)-convergence rate of a blocked rejection sampler coincide. More explicitly, under the checkerboard, lexicographic, and random updating schemes respectively the \(\mathcal {L}^2\)-convergence rates (\(\rho _{\texttt {check}}\), \(\rho _{\texttt {lex}}\), and \(\rho _{\texttt {rand}}\) resp.) are equal to:

$$\begin{aligned} \rho _{m,T}:= \ \rho _{\texttt {check}}=\rho _{\texttt {lex}}=\rho _{\texttt {spec}}(B_{\texttt {lex}})=\rho _{\texttt {spec}}&(B_{\texttt{check}})= \lambda _{\texttt {max}}^2(A),\\& \quad\;\,\rho _{\texttt {rand}}=\left[ \frac{m-1+\lambda _{\texttt {max}}(A)}{m}\right] ^m, \end{aligned}$$

where \(\lambda _{\texttt {max}}(A)\) denotes the maximum eigenvalue of the matrix A and where we write \(B_{\texttt {check}}\) (resp. \(B_{\texttt {lex}}\)) to denote a matrix B corresponding to the checkerboard (resp. lexicographic) updating scheme.

\(\lambda _{\texttt {max}}(A)\) can be found more explicitly by exploiting the close connection between the precision matrix \(\Lambda\) and the matrix of partial correlations (given precisely in (19), in Appendix 2).

Theorem 2

We have

$$\begin{aligned} \lambda _{\texttt {max}}(A)=2|c(\delta _{m,T})|\cos \left( \frac{\pi }{m+1}\right) , \end{aligned}$$

with \(c(\delta _{m,T}):=\mathbbm {C}orr(X_\delta ,X_{2\delta }|X_0,X_{3\delta })\). In particular:

$$\begin{aligned} \rho _{m,T} = 4c^2(\delta _{m,T})\cos ^2\left( \frac{\pi }{m+1}\right) ,\qquad \rho _{\texttt {rand}} = \left[ \frac{ m - 1 + 2|c(\delta _{m,T}) |\cos \left( \frac{\pi }{m+1} \right) }{m} \right] ^m. \end{aligned}$$
(11)

The form of \(c(\delta _{m,T})\) will, in general, depend on the type of a Gaussian process that is being considered. In the following corollaries we present more explicit versions of the statements from Theorem 2 for the two choices of \(\mathbbm{P}\): scaled Brownian motion \(\sigma W\), with \(\sigma >0\); and, the Ornstein–Uhlenbeck process. Without loss of generality we centre the latter at 0:

$$\begin{aligned} \mathrm {d}X_t = -\theta X_t\mathrm {d}t + \sigma \mathrm {d}W_t,\quad X_0=x_0,\quad t\in [0,T]. \end{aligned}$$
(12)

Corollary 1

If \(\mathbbm{P}\) is the law of a scaled Brownian motion \(\sigma W\), \(\sigma > 0\), then:

$$\begin{aligned} \rho _{m,T} = \cos ^2\left( \frac{\pi }{m+1}\right) , \quad \rho _{\texttt {rand}} = \left[ \frac{ m - 1 + \cos \left( \frac{\pi }{m+1} \right) }{m} \right] ^m. \end{aligned}$$

In particular, independently of T, as \(m\rightarrow \infty\)

$$\begin{aligned} \rho _{m,T} = 1-\left( \frac{\pi }{m+1}\right) ^2+\mathcal {O}(m^{-4}), \qquad \rho _{\texttt {rand}} = 1-\frac{1}{2}\left( \frac{\pi }{m+1}\right) ^2+\mathcal {O}(m^{-4}). \end{aligned}$$

Corollary 2

If \(\mathbbm{P}\) is the law of the Ornstein–Uhlenbeck process (12), then:

$$\begin{aligned} \rho _{m,T} = \cos ^2\left( \frac{\pi }{m+1}\right) {{\,\mathrm{sech}\,}}^2(\theta \delta _{m,T}), \quad \rho _{\texttt {rand}} = \left[ \frac{ m - 1 + \cos \left( \frac{\pi }{m+1}\right) {{\,\mathrm{sech}\,}}(\theta \delta _{m,T}) }{m} \right] ^m. \end{aligned}$$

In particular, when \(\delta _{m,T}=\delta\) is set to a constant, as \(m,T\rightarrow \infty\):

$$\begin{aligned} \rho _{m,T}&= {{\,\mathrm{sech}\,}}^2(\theta \delta )\left[ 1-\left( \frac{\pi }{m+1}\right) ^2+\mathcal {O}(m^{-4})\right] ,\\ \rho _{\texttt {rand}}&= e^{{{\,\mathrm{sech}\,}}(\theta \delta )-1}\left[ 1-\frac{(1-{{\,\mathrm{sech}\,}}(\theta \delta ))^2}{2(m+1)}+\mathcal {O}(m^{-2})\right] . \end{aligned}$$

Remark 4

Results obtained by Pitt and Shephard (1999), who studied the discrete-time first-order autoregressive process \(\alpha _t = \phi \alpha _{t-1} + \eta _t\), \(\eta _t \sim \mathcal {N}(0,\sigma ^2)\), observed with Gaussian noise, are closely related to Corollaries 1 and 2. In the context of their model, where \(c(\delta _{m,T}) = \phi /(1+\phi ^2)\), they derive the expression (11) for \(\rho _{m,T}\) as well as bounds on \(\rho _{m,T}\) which exhibit the same asymptotic behaviour as in Corollary 2.

We can now combine the above results with (7) to find the relaxation time:

Theorem 3

Suppose we use one of the checkerboard, lexicographic, and random updating schemes. If \(\mathbbm{P}\) is the law of a scaled Brownian motion \(\sigma W\), then we have:

$$\begin{aligned} \mathcal {T}(m)=\mathcal {O}(m^2), \qquad m\rightarrow \infty . \end{aligned}$$

If \(\mathbbm {P}\) is the law of the Ornstein–Uhlenbeck process in (12), and additionally the sequence T(m) is chosen so that \(m=c_1T\) for some constant \(c_1>0\), then we have:

$$\begin{aligned} \mathcal {T}(m)=\mathcal {O}(1), \qquad m\rightarrow \infty . \end{aligned}$$

Remark 5

Note that if \(\mathbbm{P}\) is the law of the Ornstein–Uhlenbeck process in (12) then Theorem 3 holds for only the sequence T(m) where \(m=c_1T\), but in the case of scaled Brownian motion there is no such constraint; see Remark 1.

Remark 6

From the proof of Theorem 3, one can show that for the Ornstein–Uhlenbeck process (12):

$$\begin{aligned} \mathcal {T}(m) = -\frac{1}{\log (\rho _{m,T})}\rightarrow -\frac{1}{2\log ({{\,\mathrm{sech}\,}}(\theta \delta ))}, \qquad m\rightarrow \infty . \end{aligned}$$

This provides insight into the influence of \(\theta\) and \(\delta\) on mixing.

We can minimize the cost of blocking \(C_{\texttt {blocking}}(T,m)\) over the remaining parameter, m, using Theorem 3 and (8). This leads to Theorem 1, which is the main result of this paper (as presented in Sect. 1, with accompanying proof in Appendix 2).

4 Numerical Experiments

Consider a target process defined to be the solution of the following stochastic differential equation (with law \(\mathbbm{P}\)):

$$\begin{aligned} \mathrm {d}X_t = (2-2\sin (8X_t))\mathrm {d}t + \frac{1}{2}\mathrm {d}W_t,\quad X_0=0,\quad t\in [0,T]. \end{aligned}$$
(13)

This diffusion exhibits highly multimodal behaviour, and so in practice it is challenging to simulate trajectories of \(\mathbbm {P}\) (and in particular the conditioned bridge law \(\mathbbm {P}^{\star }\) over large time horizons). It is possible to simulate trajectories exactly by means of path-space rejection sampling (as detailed in Appendix 1). However, X is not a Gaussian process (it violates Assumption 2), and so Theorem 1 does not hold in a rigorous sense. As such (13) makes an interesting case to investigate the practical limitations of Theorem 1. Because it is not an ergodic diffusion, out of the two theoretical results from Theorem 1 the ones for the Brownian motion are expected to be more relevant. As we show below, the empirical results would suggest the theory holds more broadly.

We consider six problems (increasing in difficulty) of simulating paths according to the laws \(\mathbbm{P}^{(T,x_0,x_T)}\), with parameters \(\mathbbm{P}^{(0.2,0,0.1)}\), \(\mathbbm{P}^{(0.4,0,0.85)}\), \(\mathbbm {P}^{(0.5,0,0.85)}\), \(\mathbbm{P}^{(1,0,0.95)}\), \(\mathbbm{P}^{(2,0,2.5)}\), \(\mathbbm {P}^{(4,0,4.85)}\). The values of the end-points were chosen by fixing T, simulating multiple paths according to (13) and picking \(x_T\) to be some point in the vicinity of the (largest) mode as these are the bridges we will most commonly be interested in. For \(T=0.2\), the plotted paths resemble Brownian bridges, but as T increases the non-linear dynamics become pronounced: the diffusion is effectively attracted to a ladder of values and it is repelled at the intermediate points, leading to multimodal behaviour of the trajectories. Drawing paths from the last three laws using path-space rejection sampling but without blocking (an unmodified rejection sampler) is computationally infeasible.

For each of the six examples we ran a blocked rejection sampler with checkerboard updating scheme for \(10^5\) iterations and with various numbers of knots. For the first three problems we also employed an unmodified rejection sampler. We recorded the time required to sample a single path (which for a blocked rejection sampler is counted as one execution of the inner for-loop of Algorithm 1) and plotted it in Fig. 1 against the number of used knots. Code sufficient for reproducing these results can be found at https://github.com/mmider/blocking.

Fig. 1
figure 1

Time (in seconds; log-transformed) required to sample a single path of the sine diffusion (13) as a function of the number of used knots

For \(T=0.2\) the unmodified rejection sampler clearly outperforms any blocking scheme. This is unsurprising as paths under \(\mathbbm {P}^{(0.2,0,0.1)}\) closely resemble Brownian bridges (and indeed every diffusion behaves as a drifted Brownian motion on a small-enough time-scale). However, as T increases, this pattern changes and blocking reduces the cost of obtaining any single sample path. In particular, notice a steep, exponential reduction in cost that is especially pronounced for \((T,x_T)=(1,0.95)\) (this would be illustrated even more emphatically by \((T,x_T)=(2,2.5)\) and \((T,x_T)=(4,4.85)\) had the corresponding experiments with a lower number of knots been run; however, their costs are prohibitively high and had to be omitted).

Figure 1, though helpful in confirming Proposition 1, does not take into account the cost due to decreased speed of mixing—the main motivation for the developments presented in Sect. 3. To incorporate also this cost we plot in Fig. 2 the time-adjusted effective sample size (taESS), with

$$\begin{aligned} \text {taESS}:=[\text {effective sample size}]/[\text {elapsed time in seconds to sample an entire chain}] \end{aligned}$$

and ESS was computed according to Gelman et al. (2013, Section 11.5) against the (half-) length of blocks (i.e. \(\delta _{m,T}\)). As defined in Fig. 2, taESS is approximately equal to a number of independent samples that can be drawn in one second. Clearly, the larger taESS is the more efficient the algorithm is.

Fig. 2
figure 2

Time-adjusted effective sample size vs half-length of blocks (i.e. \(\delta _{m,T}\))

First, for any experiment we expect there to be a point for which increasing the number of knots any further will only lead to a decrease in taESS—this corresponds to all costs being dominated by the cost due to a slowdown in mixing and it is clearly illustrated by sharp dips of curves on the left side of Fig. 2. Second, for examples for which the target law is sufficiently different from the law of Brownian bridges we expect that some level of blocking will improve the overall computational cost. This is also confirmed by the declines of taESS curves toward the right side of Fig. 2. We note that under the most difficult sampling regimes it was impractical to run the algorithm with even fewer blocks due to excessive execution times—had the examples been run and the curves continued, the decline in performance would have been even starker. Additionally, Fig. 2 is suggestive of there being an optimal value of \(\delta _{m,T}\) (somewhere around \(\delta _{m,T}\approx 0.1\)), that is almost independent of T and m and that yields the highest taESS in each experiment. This is consistent with the results of Sect. 3, where an optimal number of knots was found to be \(m= c_1T\) for some \(c_1>0\), which implies the claim about the dependence of the optimal \(\delta _{m,T}\) on T and m.

Finally, we verify the bound from (2) empirically. To this end, notice that taESS\(^{-1}\) is approximately equal to the amount of time needed to obtain a single independent sample. This is consistent with the characterisation of the computational cost of a blocked rejection sampler as given in (8). Theorem 1 asserts that this cost scales at a cubic rate in the duration of the bridge, so long as \(\delta _{m,T}\) is set to a constant when \(T\rightarrow \infty\). Consequently, taESS(T) should be at most a cubic function of T and if plotted on a log-log scale, this would be equivalent to taESS(T) tracking some line with slope 3. Figure 3 gives this precise plot, showing that the prediction (2) is indeed satisfied.

Fig. 3
figure 3

Computational cost as a function of time for the sine example

5 Discussion

In this article we have analysed and provided practical guidance for using blocking schemes when conducting Bayesian inference for discretely observed diffusions. We achieved this by studying the computational cost of diffusion bridge sampling algorithms. We have shown rigorously that the computational cost of rejection sampling on path-space (modified with blocking) targeting the law of scaled Brownian motion scales as \(\mathcal {O}(T^3)\) as \(T\rightarrow \infty\), and as \(\mathcal {O}(T)\) in the case of the Ornstein–Uhlenbeck process, so long as the number of equidistant anchors is \(m=c_1T\) (for some \(c_1>0\)). In Remark 1 we discussed the practicality of exploiting the computational saving achievable in the case of the Ornstein–Uhlenbeck process. Furthermore, using the example of a non-linear sine diffusion we provide empirical evidence which would suggest that the conclusions about Brownian motion hold also for non-ergodic diffusions outside of a restrictive class of Gaussian processes.

Our theory indicates that choosing too few knots results in the computational cost being dominated by the exponential cost for imputing diffusion bridges between successive knots (see Proposition 1). As such our guideline of choosing \(m=c_1T\), (for some \(c_1>0\)) is useful for ensuring the robustness of blocking schemes and a reasonable heuristic for practitioners. Note that although choosing too many knots is likely to be penalized less than choosing too few, choosing an excessive number of knots can negatively impact the mixing of the underlying chain.

Naturally, for more general target laws \(\mathbbm {P}\) it might be useful to consider using irregularly spaced anchors (and so relaxing Assumption 1). Heuristically, we may wish to place more knots in areas in which the proposal law does not approximate the target law well. Developing more general theory to support the use of an irregular spacing of anchors is likely to require more knowledge of the specific diffusion under study. Of course, from a methodological perspective this motivates future research looking at how to place knots by assessing proposal-target discrepancy, or developing adaptive schemes.

Finally, it is worth recalling that within the context of Bayesian inference for discretely observed diffusion processes, the full chain in this setting is a Gibbs sampler that alternates between updating the unknown parameters and imputing the unobserved path. Since the mixing time of the unobserved path influences the mixing time of the parameter chain, then in light of the work in this paper it may as a future extension also be possible to study the mixing behaviour of the parameter chain.