1 Introduction

In this work, we aim at modelling variability of shapes using a theory of stochastic perturbations consistent with the action of the diffeomorphism group underlying the large deformation diffeomorphic metric mapping framework (LDDMM, see [65]). In applications, such variability arises and can be observed, for example, when human organs are influenced by disease processes, as analysed in computational anatomy [66]. Spatially independent white noise contains insufficient information to describe these large-scale variabilities of shapes. In addition, the coupling of the spatial correlations of the noise must be adapted to a variety of transformation properties of the shape spaces. The theory developed here addresses this problem by introducing spatially correlated transport noise which respects the geometric structure of the data. This method provides a new way of characterizing stochastic variability of shapes using spatially correlated noise in the context of the standard LDDMM framework.

We will show that this specific type of noise can be used for all the data structures to which the LDDMM framework applies. The LDDMM theory was initiated by [6, 12, 19, 46, 60] based on the pattern theory of [23]. LDDMM models the dynamics of shapes by the action of diffeomorphisms (smooth invertible transformations) on shape spaces. It gives a unified approach to shape modelling and shape analysis that is valid for a range of structures such as landmarks, curves, surfaces, images, densities or even tensor-valued images. For any such data structure, the optimal shape deformations are described via the Euler–Poincaré equation of the diffeomorphism group, usually referred to as the EPDiff equation [26, 27, 66]. In this work, we will show how to obtain a stochastic EPDiff equation valid for any data structure, and in particular for the finite-dimensional spaces of landmarks. For this, we will follow the LDDMM derivation in [8] based on geometric mechanics [24, 43]. This view is based on the existence of momentum maps, which are characterized by the transformation properties of the data structures for images and shapes. These momentum maps persist in the process of introducing noise into the EPDiff equation, and they thereby preserve most of the technology developed for shape analysis in the deterministic context and in computational anatomy.

This work is not the first to consider stochastic evolutions in LDDMM. Indeed, [61, 64] and more recently [44] have already investigated the possibility of stochastic perturbations of landmark dynamics. In these works, the noise is introduced into the momentum equation, as though it was an external random force acting on each landmark independently. In [44], an extra dissipative force was added to balance the energy input from the noise and to make the dynamics correspond to a certain type of heat bath used in statistical physics. Refs. [55, 56] considered evolutions on the landmark manifold with stochastic parts being Brownian motion with respect to a Riemannian metric and estimated parameters of the models from observed data. Here, we will introduce Eulerian noise directly into the reconstruction relation used to find the deformation flows from the velocity fields, which are solutions of the EPDiff equation [26, 65]. As we will see, this derivation of stochastic models is compatible with variational principles, preserves the momentum map structure and yields a stochastic EPDiff equation with a novel type of multiplicative noise, depending on the gradient of the solution, as well as its magnitude. This model is based on the previous works [2, 25], where, respectively, stochastic perturbations of infinite- and finite-dimensional mechanical systems were considered. The Eulerian nature of the noise discussed here implies that the noise correlation depends on the image position and not, as for example in [44, 61], on the landmarks themselves. Consequently, the present method for the introduction of noise is compatible with any data structure, for any choice of its spatial correlation. We also mention the conference paper [3] in which the basic theory underlying the present work was applied to shape transformations of the corpus callosum. We discuss possibilities for including Lagrangian noise advected with the flow in contrast to the present Eulerian case, and possibilities for including nonstationary correlation statistics that responds to the evolution of advected quantities, in the conclusion of the paper.

Fig. 1
figure 1

In this figure, we compare the deterministic evolution of landmarks arranged in an ellipse (black line) with a translated ellipse as final position (black dashed line), to two different stochastically perturbed evolutions. The radius for the landmark kernel is twice their average initial distances. In blue is the stochastic perturbation developed in this paper. The black dots represent the J Eulerian noise fields arranged in a grid configuration. In magenta is the evolution resulting from additive noise in the momentum equation, different for each landmark but with the same amplitude as the Eulerian noise. We run three initial value simulations to compare the limit of a large number of landmarks and small noise correlation. The Eulerian noise model (blue) is robust to the continuum limit and can reproduce the general behaviour of the additive noise model. Furthermore, the choice of the noise fields provides an additional freedom in parameterization which will be studied and exploited in this work. a Low resolution and large noise correlation (100 landmarks, \(6\times 6\) noise fields), b high resolution and large noise correlation (200 landmarks, \(6\times 6\) noise fields), c high resolution and small noise correlation (200 landmarks, \(12\times 12\) noise fields) (Color figure online)

To illustrate this framework and give an immediate demonstration of stochastic landmark dynamics, we display in Fig. 1 three experiments which compare the proposed model with a stochastic forcing model, of the type studied in [61]. The proposed model introduces the following stochastic Hamiltonian system for the positions of the landmarks, \({\mathbf {q}}_i\), and their canonically conjugate momenta, \(\mathbf p_i\),

$$\begin{aligned} \begin{aligned} \mathrm{d} {\mathbf {q}}_i&= \frac{\partial h}{\partial {\mathbf {p}}_i} \mathrm{d}t + \sum _l\sigma _l({\mathbf {q}}_i) \circ \mathrm{d} W_t^l \,,\\ \mathrm{d} {\mathbf {p}}_i&= -\frac{\partial h}{\partial {\mathbf {q}}_i} \mathrm{d}t - \sum _l \frac{\partial }{\partial {\mathbf {q}}_i}\left( {\mathbf {p}}_i \cdot \sigma _l({\mathbf {q}}_i)\right) \circ \mathrm{d} W_t^l \,. \end{aligned} \end{aligned}$$
(1.1)

In (1.1), the \(\sigma _l\) are prescribed functions of space which represent the spatial correlations of the noise. In Fig. 1, the \(\sigma _l\) fields are Gaussians whose variance is equal to twice their separation distance and locations are indicated by black dots. We compare this model with the system,

$$\begin{aligned} \mathrm{d} {\mathbf {q}}_i^\alpha&= \frac{\partial h}{\partial \mathbf p_i^\alpha } \mathrm{d}t \qquad \mathrm {and} \qquad \mathrm{d} {\mathbf {p}}_i^\alpha = -\frac{\partial h}{\partial {\mathbf {q}}_i^\alpha } \mathrm{d}t + \sigma \mathrm{d}W_t^i , \end{aligned}$$
(1.2)

where \(\sigma \) is a constant. In this case, the noise corresponds to a stochastic force acting on the landmarks, whose corresponding Brownian motion is different for each landmark. We show on the first panel of Fig. 1 that for a small number of landmarks and a large range of spatial correlations of the noise, both types of stochastic deformations in (1.1) and (1.2) visually coincide. This is shown for a simple experiment in translating a circle (from the black circle to the black dashed circle). By doubling the number of landmarks (middle panel), the dynamics of (1.2) results in small-scale noise correlation (magenta), whereas the proposed model (blue) remains equivalent to the first experiment. This figure illustrates shape evolution when the noise is Eulerian and independent of the data structure. Indeed, the limit of a large number of landmarks corresponds to a certain continuum limit, in this case corresponding to curve dynamics. Finally, in the right-most panel, we reduce the range of the spatial correlation of the noise by adding more noise fields. This arrangement allows us to qualitatively reproduce the dynamics of the equation (1.2) with the same number of landmarks as the amount of noise and its spatial correlation is similar in both cases. Indeed, the spatial correlations are dictated by the Eulerian functions \(\sigma _l\) defined in fixed space for our model, and by the density of landmarks in the stochastically forced landmark model.

Modelling large-scale shape variability with noise is of interest for applications in computational anatomy, in which sources of variability include natural ageing, the influence of diseases such as Alzheimer’s disease, and intra-subject population scale variations. In the LDDMM context, these effects are sometimes modelled using the random orbit model [45]. The random orbit approach models variability in the observed data by using an ensemble of initial velocities in matching a template to a set of observations via geodesic flows, see [62]. The randomness is confined to the initial velocity as opposed to the evolving stochastic processes used in the present work. A prior can be defined by assuming a distribution of the initial velocities, and Bayesian approaches can then be used for inference of the template shape as well as additional unknown parameters [1, 41, 67]. The stochastic model developed here can also be applied to model random warps and to generate distributions used in Bayesian shape modelling, and for coupling warps and functional variations such as those in [40, 51]. Indeed, because the proposed probabilistic approach assigns a likelihood to random deformations, the model can be used for general likelihood-based inference tasks.

In the present model, the observed shape variability indicates the required spatial correlation of the noise, which must be specified or inferred for each application. As this correlation is generally unknown, estimating the parameters of the correlation structure becomes an important part of the framework. We will address the problem of inferring the noise parameters by considering two different methods in the context of the representation of shapes by landmarks: The first method is based on estimating the time evolution of the probability distribution of each landmark. We will derive a set of differential equations approximating the time evolution of the complete distribution via its first moments. We can then solve the inverse problem of estimating the noise correlation from known initial and final distribution of landmarks by minimization of a certain cost function, solved using a genetic algorithm. The second method is based on an expectation-maximization (EM) algorithm which can infer unknown parameters for a parametric statistical model from observed data. In this context, since only initial and final landmarks positions are observed, the full stochastic trajectories are regarded as missing information. For this algorithm, we need to estimate the likelihood of stochastic paths connecting sets of observed landmarks. We achieve this by adapting the theory of diffusion bridges to the stochastic landmark equation. As discussed in the concluding remarks, inference methods for other data structures, in particular for infinite-dimensional shape representations, are not treated in this paper and left as outstanding problems for future work.

Finally, we wish to mention that multiple additional approaches for shapes analysis exist outside the LDDMM context, particularly exemplified by the Kendall shape spaces [37], see also [18]. We focus this paper on the LDDMM framework leaving possibilities for extending the presented methods to include stochastic dynamics and noise inference in other shape analysis approaches to future work.

1.1 Plan of this Work

We begin by developing a general theory of stochastic perturbations for inexact matching in Sect. 2. We then focus on exact landmark matching in Sect. 3, which is the simplest example of this theory. In particular, we derive the Fokker–Planck equation in Sect. 3.2 and diffusion bridge simulation in Sect. 3.3. In Sect. 4, we describe the two methods we use for estimating parameters of the noise from observations. The Fokker–Planck based method is discussed in Sect. 4.2 and the expectation-maximization algorithm is treated in Sect. 4.3. We end the paper with numerical examples in Sect. 5, in which we investigate the effect of the noise on landmark dynamics and compare the two methods for estimating the noise amplitude.

2 Stochastic Large Deformation Matching

In this section, we will first review the geometrical framework of LDDMM, following [8], and then introduce noise following [25] to preserve the geometrical structure of LDDMM. The key ingredient for both topics is the momentum map, which we will use as the main tool for reducing the infinite-dimensional equation on the diffeomorphism group to equations on shape spaces.

2.1 The Deterministic LDDMM Model

Here, we will briefly review the theory of reduction by symmetry, as applied to the LDDMM context, following the presentation of [8]. We detail the proof of the formulas below in the next section when we include noise. Define an energy functional E by

$$\begin{aligned} E(u_t) = \int _0^1 l( u_t) \mathrm{d}t + \frac{1}{2\lambda ^2} \Vert g_1.I_0 - I_1\Vert ^2, \end{aligned}$$
(2.1)

where \(I_0,I_1\in V\) are shapes represented in a vector space V on which the diffeomorphism group \(\mathrm {Diff}({\mathbb {R}}^d)\) acts, \(u_t\) is a time-dependent vector field, and \(\lambda \) is a weight, or tolerance, which allows the matching to be inexact. The flow \(g_t\in \mathrm {Diff}({\mathbb {R}}^d)\) corresponding to \(u_t\) is found by solving the reconstruction relation

$$\begin{aligned} \partial _t g_t = u_t g_t, \end{aligned}$$
(2.2)

and \(I_0\) is matched against \(I_1\) through the action \(g_1.I_0\) of \(g_1\) on \(s_0\). The vector field \(u_t\) can be considered an element of the Lie algebra \({\mathfrak {X}}({\mathbb {R}}^d)\). In the case of \(I_0,I_1\) being images \(I: {\mathbb {R}}^d\rightarrow {\mathbb {R}}\), the action is by push-forward, \(g.I=I\circ g^{-1}\), and when I represents N landmarks with positions \({\mathbf {q}}_i\in {\mathbb {R}}^d\), the action is by evaluation \(g.{\mathbf {q}} = (g({\mathbf {q}}_1),\ldots , g(\mathbf q_N))\) (see [8] for more details). The group elements can act on various additional shape structures such as tensor fields.

Remark 2.1

(Nonlinear shape structures) This framework can be extended to structures that are not represented by a vector space V, such as curves or surfaces. We leave this extension for future work.

Using the calculus of variations for the functional (2.1) results in the equation of motion for \(u_t\) of the form

$$\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t}\frac{\delta l}{\delta u} + \mathrm {ad}^*_{u_t} \frac{\delta l}{\delta u}=0, \end{aligned}$$
(2.3)

which is called the Euler–Poincaré equation. The operation \(\mathrm {ad}^*\) is the coadjoint action of the Lie algebra of vector fields associated with the diffeomorphism group. The operation \(\mathrm {ad}^*\) acts on the variations \({\delta l}/{\delta u}\), which are 1-form densities, in the dual of the Lie algebra of vector fields, under the \(L^2\) pairing. When l(u) is a norm, this equation is the geodesic equation for that norm, in the case that \(\lambda =\infty \); that is, with exact matching. We will focus on this case later in Sect. 3 when discussing landmark dynamics. Here, the inexact matching term constrains the form of the momentum \(m= \frac{\partial l}{\partial u}\) to depend on the geodesic path. Following the notation of [8], the momentum map is defined as

$$\begin{aligned} m(t)= - \frac{1}{\lambda ^2} J_t^0\diamond ( g_{t,1}(J_1^0-J_1^1)^\flat ), \end{aligned}$$
(2.4)

where \(g_{t,s}\) is the solution of (2.2) at time t with initial conditions at time s, while \(J_t^0 = g_{t,0} I_0\) and \(J_t^1= g_{t,1} I_1\). The value \(J_1^0\) corresponds to the initial shape, pushed forward to time \(t=1\), and \(J_1^1= I_1\) is the target shape.

The operations \(\diamond \) and \(\flat \) in the momentum map formula (2.4) are defined, as follows. The Lagrangian l in (2.1) may be taken as kinetic energy, which defines a scalar product and norm \(l(u) = \langle u,L u\rangle _{L^2}= \Vert u\Vert ^2_{L^2}\) on the space of vector fields \({\mathfrak {X}}(\mathbb R^d)\). The quantity \(Lu={\delta l}/{\delta u}\) may then be regarded as the momentum conjugate to the velocity u. Similarly, for the image data space V, we define the dual space \(V^*\) with the \(L^2\) pairing \(\langle f,I\rangle = \int _\Omega f(x)I(x) \mathrm{d}x\), where \(f\in V^*\) and \(\Omega \) is the image domain \(\Omega \in {\mathbb {R}}^d\). This identification defines the \(\flat \) operator as \(\flat : V\rightarrow V^*\). When an element \(g_t\) of the diffeomorphism group acts on V by push-forward, \(I_t=g_t.I_0 = (g_t)_*I_0\), the corresponding infinitesimal action of the velocity u in the Lie algebra of vector fields \(u\in {\mathfrak {X}}({\mathbb {R}}^d)\) is given by \(u.I:= [g_t^*\frac{d}{dt}(g_t)_*I_0]_{t=0}\). In terms of this infinitesimal action, we can then define the operation \(\diamond :V\times V^*\rightarrow {\mathfrak {g}}^*\) as

$$\begin{aligned} \langle I\diamond f, u\rangle _{{\mathfrak {g}} \times {\mathfrak {g}}^*}:= \langle f, u.I\rangle _{V\times V^*}\, . \end{aligned}$$
(2.5)

A detailed derivation of this formula for the momentum map can be found in [8].

Remark 2.2

(Solving this equation) We will just add here the important remark that the relation (2.4) introduces nonlocality into the problem, as the momentum implicitly depends on the value of the group at later times. This is exactly what is needed in order to solve the boundary value problem coming from the matching of images \(I_1\) and \(I_0\). The optimal vector field can be found with a shooting method or a gradient descent algorithm on the energy functional (2.1), see [6]. For more information about the relation of the momentum map approach of [8] to the LDDMM approach of [6], see [9].

2.2 Stochastic Reduction Theory

The aim here is to introduce noise in the Euler–Poincaré equation (2.3) while preserving the momentum map (2.4); so that the noise descends to the shape spaces. Following [25], we introduce noise in the reconstruction relation (2.2) and proceed with the theory of reduction by symmetry. We will focus on a noise described by a set of J real-valued independent Wiener processes \(W^i_t\) together with J associated vector fields \(\sigma _i\in {\mathfrak {X}}({\mathbb {R}}^d)\) on the data domain. We will later discuss particular forms of these fields and methods for estimating unknown parameters of the fields in the context of landmark matching.

Remark 2.3

(Dimension of the noise) We proceed here with a finite number of J associated vector fields and finite-dimensional noise while leaving possible extension to infinite-dimensional noise such as done by [64] for later works.

We replace the reconstruction relation (2.2) by the following stochastic process

$$\begin{aligned} \mathrm{d}g_t = u_t g_t \mathrm{d}t + \sum _{l=1}^J \sigma _l g_t \circ \mathrm{d}W^l_t, \end{aligned}$$
(2.6)

where \(\circ \) denotes Stratonovich integration. That is, the Lie group trajectory \(g_t\) is now a stochastic process. With this noise construction, the previous derivations of (2.3) and (2.4) in [8] still apply and we obtain the following result for the stochastic vector field, \(u_t\).

Proposition 2.4

Under stochastic perturbations of the form (2.6), the momentum map (2.4) persists, and the Euler–Poincaré equation takes the form

$$\begin{aligned} \mathrm{d} \frac{\delta l}{\delta u} + \mathrm {ad}^*_{u_t} \frac{\delta l}{\delta u}\mathrm{d}t+ \sum _{l=1}^J \mathrm {ad}^*_{\sigma _l} \frac{\delta l}{\delta u}\circ \mathrm{d}W^l_t=0\, . \end{aligned}$$
(2.7)

Proof

We first show that the momentum map formula (2.4) persists in the presence of noise. The key step in its computation is to prove the formula in lemma 2.5 of [8] which is given by \(\partial _t (g^{-1} \delta g ) = \mathrm {Ad}_g\delta u\), where \(\mathrm {Ad}\) is the adjoint action on the diffeomorphism group on its Lie algebra. We first compute the variations of (2.6)

$$\begin{aligned} \delta \mathrm{d} g_t = \delta u g \mathrm{d}t + u \delta g \mathrm{d}t + \sum _{l=1}^J \sigma _l \delta g \circ \mathrm{d}W_t^l, \end{aligned}$$

and then prove this formula by a direct computation

$$\begin{aligned} \mathrm{d} ( g^{-1} \delta g)&= - g^{-1} \mathrm{d}g g^{-1}\delta g + g^{-1} \mathrm{d} \delta g \\&= - g^{-1}( u \mathrm{d}t {+} \sum _{l=1}^J \sigma _l\circ \mathrm{d}W_t^l ) \delta g + g^{-1}( \delta u g \mathrm{d}t + u \delta g \mathrm{d}t {+} \sum _{l=1}^J \sigma _l \delta g \circ \mathrm{d}W_t^l) \\&= g^{-1} \delta u g\, \mathrm{d}t\\&:= \mathrm {Ad}_g \delta u\, \mathrm{d}t \, . \end{aligned}$$

This key formula is the same as in [6] and [8] for the deterministic case. In particular, it does not explicitly depend on the Wiener processes \(W_t^l\). This ensures that the momentum map formula (2.4) remains the same as in the deterministic case. The last step of the proof is to derive the stochastic Euler–Poincaré equation (2.7). This is done by computing the stochastic evolution of the momentum, given by

$$\begin{aligned} \frac{\delta l}{\delta u} =\mathrm {Ad}^*_{g^{-1}} ( I_0\diamond (g_1^{-1} \pi )), \quad \mathrm {where} \quad \pi = \frac{1}{\lambda ^2} (g_1 I_0 - I_1)^\flat \, . \end{aligned}$$

The only time dependence is in the coadjoint action, and, by the standard formula

$$\begin{aligned} \mathrm{d} \mathrm {Ad}^*_{g^{-1}} \eta = - \mathrm {ad}^*_{dg g^{-1}} \mathrm {Ad}^*_{g^{-1}} \eta , \end{aligned}$$

we obtain the result

$$\begin{aligned} \mathrm{d}\frac{\delta l }{\delta u}&= - \mathrm{d} \mathrm {Ad}^*_{g^{-1}} ( I_0\diamond (g_1^{-1} \pi ))\\&= \mathrm {ad}^*_{\mathrm{d}g g^{-1}}\mathrm {Ad}^*_{g^{-1}} ( I_0\diamond (g_1^{-1} \pi ))\\&= \mathrm {ad}^*_{\mathrm{d}g g^{-1}}\frac{\delta l }{\delta u}, \end{aligned}$$

where we have used the stochastic reconstruction relation (2.6) in the form

$$\begin{aligned} \mathrm{d}g g^{-1}= u \mathrm{d}t + \sum _{l=1}^N \sigma _l \circ \mathrm{d}W_l^t\,. \end{aligned}$$

\(\square \)

In summary, this stochastic perturbation of the LDDMM framework preserves the form of momentum map (2.4), although it does affect the reconstruction relation (2.6) and the Euler–Poincaré equation (2.7). As shown in [8], various data structures fit into this framework including landmarks, images, shapes, and tensor fields. In practice, for inexact matching, a gradient descent algorithm can be used to minimize the energy functional (2.1). The noise will only appear in the evaluation of the matching cost via the reconstruction relation. The algorithm of [6] then directly applies, provided the stochastic reconstruction relation can be integrated with enough accuracy. We will not treat the full inexact matching problem here. Instead, we will study the simpler case of exact matching, where the energy functional consists only of the kinetic term.

The exact matching problem in computational anatomy possesses many parallels with the geometric approach to classical mechanics and ideal fluid dynamics. In particular, Poincaré’s fundamental paper in 1901, which started the field of geometric mechanics in finite dimensions, has recently been generalized to the stochastic case [14]. In addition, the fundamental analytical properties of Euler’s fluid equations have been shown to extend to the stochastic case in [13].

We expect that these advances in the analysis of SPDEs occurring in fluid dynamics and other parallel fields will inform computational anatomy, and eventually will apply to infinite-dimensional representations of shape. One reason for our optimism is that the fundamental analytical properties of incompressible Euler fluid dynamics in three dimensions have already been found in [13] to persist under the introduction of the present type of stochasticity. Namely, the properties of local-in-time existence and uniqueness, as well as the Beal-Kato-Majda criterion for blow-up for the deterministic 3D Euler fluid motion equations, all persist in detail for stochastic Euler fluid motion, under the introduction of the type of stochastic Lie transport by cylindrical Stratonovich noise that we have proposed here for stochastic shape analysis.

The persistence of deterministic analytical properties in passing to the SPDEs governing stochastic 3D incompressible continuum fluid dynamics is a type of infinite-dimensional result that has not yet been proven for the evolution of shapes. The corresponding results in the analysis of SPDEs for embeddings, immersions and curves representing data structures for shape evolution, for example, have not yet been discovered, and they remain now as outstanding open problems. However, we believe that the prospects for successfully performing the necessary analysis are hopeful because the type of noise we propose here preserves the fundamental properties of diffeomorphic flow for both continuum fluids and shapes. For example, the momentum maps for the deterministic and stochastic evolution of shapes of any data structure are identical. Thus, the only difference in the present approach from the deterministic case is that the diffeomorphic time evolution of the various shape momentum maps proceeds by the action of Lie derivative by a stochastic vector field, instead of a deterministic one. Since the stochastic part of the vector field is as smooth as we wish, we are hopeful that the analytical properties for the deterministic evolution of a large class of infinite-dimensional representations of shape (such as smooth embeddings) will also persist under the introduction of the type of stochastic transport proposed here. For the remainder of the paper, we restrict ourselves to the treatment of stochastic landmark dynamics.

3 Exact Stochastic Landmark Matching

In this section, we apply the previous ideas of stochastic deformation of LDDMM to exact matching with landmark dynamics. This is the simplest data structure in the LDDMM framework, and it will serve to give interesting insights into the effect of the noise in this context. Since exact matching means that the energy functional contains only a kinetic energy, the optimal vector field is found from a boundary value problem with the Euler–Poincaré equation (2.3). For exact matching, the singular momentum map for landmarks takes the simple familiar form for the reduction of the EPDiff equation (see [11, 26])

$$\begin{aligned} {\mathbf {m}}(x,t)= \sum _{i=0}^N {\mathbf {p}}_i(t) \delta (\mathbf x-{\mathbf {q}}_i(t)), \end{aligned}$$
(3.1)

for N landmarks with momenta \({\mathbf {p}}_i\) and positions \(\mathbf q_i\), with \(i=1,2,\dots ,N\). A direct substitution of \(u= K*m\) into the stochastic Euler–Poincaré equation (2.7) gives the stochastic landmark equations in (3.6). Here, K is a given kernel corresponding to the Green’s function of the differential operator L used to construct the Lagrangian. Below, we take a different approach and proceed from a variational principle in which the stochastic landmark dynamics is constrained. We refer the interested reader to, e.g., [34] for a detailed exposition of this derivation in the deterministic context.

3.1 Stochastic Landmarks Dynamics

Recall that for N landmarks in \({\mathbb {R}}^d\), the diffeomorphism group elements g act on the landmarks by evaluation of their position \(g.{\mathbf {q}}= (g(q_1),\ldots , g(q_N))\), and the associated momentum map is (3.1). The original action functional (2.1) can be equivalently written as a constrained variational principle where the \({\mathbf {p}}_i\) play the role of Lagrange multipliers enforcing the stochastic reconstruction relation (2.6). This procedure is based on the Clebsch action principle, which for landmark dynamics has been studied for one-dimensional motion of landmarks on the real line in [32]

$$\begin{aligned} S({\mathbf {u}}, {\mathbf {q}},{\mathbf {p}}) = \iint l({\mathbf {u}})\, \mathrm{d}\mathbf x\, \mathrm{d}t + \sum _i \int {\mathbf {p}}_i\cdot \left( \circ \mathrm{d}{\mathbf {q}}_i - {\mathbf {u}}( {\mathbf {q}}_i)\, \mathrm{d}t + \sum _l \sigma _l({\mathbf {q}}_i)\circ \mathrm{d}W_t^l\right) \, . \end{aligned}$$
(3.2)

Notice that only the Lagrangian depends on the spatial (Eulerian) variable \({\mathbf {x}}\) on the image domain. We now use the singular momentum map (3.1) which provides us with the relation

$$\begin{aligned} 2\,l({\mathbf {u}})= \int {\mathbf {m}}({\mathbf {q}},{\mathbf {p}})({\mathbf {x}})\cdot {\mathbf {u}}({\mathbf {x}}) \mathrm{d}{\mathbf {x}} = \sum _i {\mathbf {p}}_i \cdot \mathbf u({\mathbf {q}}_i)\, . \end{aligned}$$

This relation reduces the action functional (3.2) to the finite-dimensional space of landmarks. We arrive at the action integral

$$\begin{aligned} S({\mathbf {q}},{\mathbf {p}})&= \int h({\mathbf {q}},{\mathbf {p}}) \,\mathrm{d}t + \sum _i \int {\mathbf {p}}_i\cdot \left( \circ \,\mathrm{d}{\mathbf {q}}_i+ \sum _{l} \sigma _l({\mathbf {q}}_i) \circ \mathrm{d}W_t^l\right) \, , \end{aligned}$$
(3.3)

where the Hamiltonian only depends on the landmark variables, as

$$\begin{aligned} h({\mathbf {q}},{\mathbf {p}}) = \frac{1}{2} \sum _{i,j=1}^N ({\mathbf {p}}_i\cdot {\mathbf {p}}_j) K({\mathbf {q}}_i-{\mathbf {q}}_j)\, . \end{aligned}$$
(3.4)

The action integral in (3.3) involves the phase space Lagrangian (3.4) and the stochastic potential, given by

$$\begin{aligned} \phi _l({\mathbf {q}},{\mathbf {p}}):= \sum _i {\mathbf {p}}_i\cdot \sigma _l({\mathbf {q}}_i)\, . \end{aligned}$$
(3.5)

Taking free variations of (3.3) gives the stochastic Hamilton equations in the form

$$\begin{aligned} \begin{aligned} \mathrm{d} {\mathbf {q}}_i&= \frac{\partial h}{\partial {\mathbf {p}}_i} \mathrm{d}t + \sum _l \frac{\partial \phi _l}{\partial {\mathbf {p}}_i} \circ \mathrm{d} W_t^l\,,\\ \mathrm{d} {\mathbf {p}}_i&= -\frac{\partial h}{\partial {\mathbf {q}}_i} \mathrm{d}t - \sum _l\frac{\partial \phi _l}{\partial {\mathbf {q}}_i} \circ \mathrm{d} W_t^l\, . \end{aligned} \end{aligned}$$
(3.6)

Explicitly, we have

$$\begin{aligned} \begin{aligned} \mathrm{d} {\mathbf {q}}_i&= \sum _j {\mathbf {p}}_jK({\mathbf {q}}_i-{\mathbf {q}}_j) \mathrm{d}t + \sum _l\sigma _l({\mathbf {q}}_i) \circ \mathrm{d} W_t^l \,,\\ \mathrm{d} {\mathbf {p}}_i&= -\sum _j {\mathbf {p}}_i\cdot {\mathbf {p}}_j\frac{\partial }{\partial {\mathbf {q}}_i}K({\mathbf {q}}_i-{\mathbf {q}}_j)\ \mathrm{d}t - \sum _l \frac{\partial }{\partial {\mathbf {q}}_i}\left( {\mathbf {p}}_i \cdot \sigma _l({\mathbf {q}}_i)\right) \circ \mathrm{d} W_t^l \,. \end{aligned} \end{aligned}$$
(3.7)

In coordinates, the stochastic equations (3.6) become

$$\begin{aligned} \begin{aligned} \mathrm{d} q_i^\alpha&= \frac{\partial h}{\partial p_i^\alpha } \mathrm{d}t + \sum _l \sigma _l^\alpha ({\mathbf {q}}_i) \circ \mathrm{d} W_t^l \,, \\ \mathrm{d} p_i^\alpha&= -\frac{\partial h}{\partial q_i^\alpha } \mathrm{d}t - \sum _{l,\beta }\frac{\partial \sigma _l^\beta ({\mathbf {q}}_i)}{\partial q_i^\alpha } p_i^\beta \circ \mathrm{d} W_t^l , \end{aligned} \end{aligned}$$
(3.8)

where \(\alpha , \beta \) run through the domain directions, \(\alpha ,\beta =1,\ldots ,d\).

In order to have a unique strong solution of this stochastic differential equation, we need the drift and volatility to be Lipschitz functions with a linear growth condition after converting to Itô form, and for the volatility to be uniformly bounded, see [36]. This requirement is achieved when the functions \(\sigma _l\) are twice continuously differentiable and uniformly bounded in the position variable. The latter property will hold with these functions being \(C^2\) kernel functions. The particular form of the stochastic potential in (3.5) arises from the Legendre transformation of (3.2). The solutions of (3.8) represent the singular solutions of the stochastic EPDiff equation, corresponding to a stochastic path in the diffeomorphism group. In previous works such as [44, 61, 64], noise has been introduced additively and only in the momentum equation, corresponding to a stochastic force. Also, the noise has typically been taken to be different for each landmark, and one can interpret it having been attached to each landmark. In the present case, the noise is not additive and the Wiener processes are not related to the landmarks, but to the domain of the image. Nearby landmarks will thus be affected by a similar noise, controlled by the spatial correlations of the noise. We refer to Fig. 1 in the Introduction for a numerical experiment demonstrating this effect.

Remark 3.1

(Geometric noise) The geometric origin of the Hamiltonian stochastic equations in (3.6) deserves a bit more explanation. In the position equation (3.6), the noise arises as the infinitesimal transformation by the action of the stochastic vector field in (2.6), namely \(\mathrm{d}g g^{-1}= u \mathrm{d}t + \sum _l \sigma _l \circ \mathrm{d}W_l^t\), on the manifold of positions of the landmarks, which is generated by the J stochastic potentials, \(\Phi _l(\mathbf q_i,{\mathbf {p}}_i):= {\mathbf {p}}_i \cdot \sigma _l({\mathbf {q}}_i) )\). Since this stochastic Hamiltonian is linear in the canonical momenta, the noise perturbing the evolution of the landmark positions is independent of the landmark momenta. On the other hand, the noise in the momentum equations arises as the cotangent lift of the action of the stochastic vector field \(dg g^{-1}\) on the positions of the landmarks. This cotangent lift determines the action on the momentum fibres attached to the perturbed position of each of the landmarks in phase space. The cotangent lift transformation is given explicitly by the product of the momentum and the gradient of the spatial fields \(\sigma _l\) with respect to the position \({\mathbf {q}}_i\) of the i-th landmark. This difference increases the effect of the noise in regions where the \(\sigma _l\) fields have large spatial gradients, provided the landmarks are moving rapidly enough for their momenta to be nonnegligible. We will see in the example that in certain cases this balance in the product of the momentum and the spatial gradient of the noise parameters can significantly affect the dynamics of the landmarks.

3.2 The Fokker–Planck Equation

In this section, we study the evolution of the probability density function of the stochastic landmarks by using the Fokker–Planck equation. This study is possible in the case of landmarks because the associated phase space is finite-dimensional.

We will denote the probability density by \({\mathbb {P}}(\mathbf q,{\mathbf {p}},t)\), on the phase space \({\mathbb {R}}^{2dN}\) at time t. The Fokker–Planck equation can be computed using standard procedures and is given in the following proposition.

Proposition 3.2

The Fokker–Planck equation associated with the stochastic process (3.6) for the probability distribution \({\mathbb {P}}:\mathbb R^{2dN}\times {\mathbb {R}}\rightarrow {\mathbb {R}}\) is given by

$$\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t}{\mathbb {P}}({\mathbf {q}},{\mathbf {p}},t ) = \{h,\mathbb P\}_\mathrm {can} + \frac{1}{2} \sum _l \{\phi _l,\{\phi _l,\mathbb P\}_\mathrm {can}\}_\mathrm {can}:= {\mathscr {L}}^* {\mathbb {P}}, \end{aligned}$$
(3.9)

where \(\{F,G\}_\mathrm {can} = \nabla F^T {\mathbb {J}} \nabla G\) is the canonical Poisson bracket with \({\mathbb {J}}=\left( \begin{matrix} 0 &{}1\\ -1 &{}0 \end{matrix}\right) \) and \(\phi _l({\mathbf {q}},{\mathbf {p}})= \sum _i {\mathbf {p}}_i\cdot \sigma _l({\mathbf {q}}_i)\) are the stochastic potentials. This formula also defines the forward Kolmogorov operator, \({\mathscr {L}}^*\).

Proof

The proof follows the standard derivation of the Fokker–Planck equation, by taking into account the geometrical structure of the stochastic process (3.6). The time evolution of an arbitrary function \(f:{\mathbb {R}}^{2dN}\rightarrow {\mathbb {R}}\) can be written as

$$\begin{aligned} \mathrm{d}f({\mathbf {p}},{\mathbf {q}})= \{f,h\}_\mathrm {can} \mathrm{d}t + \sum _l \{f,\phi _l\}_\mathrm {can}\circ \mathrm{d}W_t^l, \end{aligned}$$

as both drift and volatility have the same Hamiltonian form in the Stratonovich formulation. We then compute the Itô correction of this stochastic process, which is can be written as a double Poisson bracket form; namely, \(\frac{1}{2} \sum _l \{\{ f,\phi _l\}_\mathrm {can},\phi _l\}_\mathrm {can}\mathrm{d}t\). The Itô correction is the quadratic variation of the Stratonovich term in the stochastic differential equation, which equals the nonstochastic part of one half of the time derivative of the volatility (where a square Brownian motion becomes \(\mathrm{d}t\)). We refer to [2, 14] for a more detailed derivation of this formula in a general setting. Taking the expectation of the Itô process then removes the noise term and defines the forward Kolmogorov operator such that \(\dot{f} = \mathscr {L} f\). By pairing this formula with the density function \(\mathbb P({\mathbf {q}},{\mathbf {p}},t)\) over the phase space \(({\mathbf {q}},\mathbf p)\) by using the usual \(L^2\) pairing, as

$$\begin{aligned} \int {\mathbb {P}}({\mathbf {q}} ,{\mathbf {p}},t) \mathscr {L}f({\mathbf {q}} ,{\mathbf {p}}) \mathrm{d}{\mathbf {q}} \mathrm{d}{\mathbf {p}} = \int {\mathscr {L}}^* \mathbb P({\mathbf {q}} ,{\mathbf {p}},t) f({\mathbf {q}} ,{\mathbf {p}}) \mathrm{d}{\mathbf {q}} \mathrm{d}{\mathbf {p}}, \end{aligned}$$

we obtain the Fokker–Planck equation \(\dot{ {\mathbb {P}}}= {\mathscr {L}}^* {\mathbb {P}}\), which is explicitly given by (3.9) as the double bracket term is self-adjoint and the advection term anti-self-adjoint. Notice that here we have used a special property of the Poisson bracket; namely, that the Poisson bracket is also a symplectic 2-form, which is exact and whose integral over the whole phase space vanishes, provided we choose suitable boundary conditions. We again refer to [2, 14] for more details about this derivation. \(\square \)

Of course, the direct study of this equation is not possible, even numerically, because of its high dimensionality. The main use here of the Fokker–Planck equation will be to understand the time evolution of uncertainties around each landmark. Indeed, for each landmark \({\mathbf {q}}_i\), the corresponding marginal distribution (integrating \({\mathbb {P}}\) over all the other variables) will represent the time evolution of the error on the mean trajectory of this landmark. We will show in the next section how to approximate the Fokker–Planck equation with a finite set of ordinary differential equations which describe the dynamics of the first moments of the distribution \({\mathbb {P}}\). This will then be used to estimate parameters of the noise fields \(\sigma _l\) for given sets of initial and final landmarks.

Remark 3.3

(On ergodicity) The question of ergodicity of the process (3.6) is not relevant here, as we will only consider this process for finite times, usually between \(t=0\) and \(t=1\). The existence of stationary measures of the Fokker–Planck equation via Hörmander’s theorem is thus not needed. Nevertheless, we will rely on a notion of reachability in the landmark position in the next section, where we will show how to sample diffusion bridges for landmarks with fixed initial and final positions. This ensures that there exists a noise realization which can bring any set of landmarks to any other set of landmarks. This property is weaker than the Hörmander condition and was introduced in [58].

3.3 Diffusion Bridges

The transition probability and solution to the Fokker–Planck equation \({\mathbb {P}}({\mathbf {q}},{\mathbf {p}},t)\) can also be estimated by Monte Carlo sampling of diffusion bridges. This approach will, in particular, be natural for maximum likelihood estimation of parameters of landmark processes using the expectation-maximization (EM) algorithm that will involve expectation over unobserved landmark trajectories, or for direct optimization of the data likelihood. The EM estimation approach will be used in Sect. 4.3. Here, we develop a theory of conditioned bridge processes for landmark dynamics which we will employ in the estimation. The approach is based on the method of [15] with two main modifications. The scheme and its modifications will be detailed after a short description of the general situation. Alternative methods for simulating conditioned diffusion bridges can be found in, e.g. [7, 50, 52].

In [15], a Girsanov formula [22], generalized to account for unbounded drifts, is used to show that when the diffusion field \(\Sigma ({\mathbf {x}},t)\) of an \({\mathbb {R}}^d\)-valued diffusion process

$$\begin{aligned} \mathrm{d}{\mathbf {x}} = b({\mathbf {x}},t)\mathrm{d}t + \Sigma ({\mathbf {x}},t)\mathrm{d}W \ ,\ \mathbf x_0={\mathbf {u}} \end{aligned}$$
(3.10)

is uniformly invertible, the corresponding process conditioned on hitting a point \({\mathbf {v}}\in {\mathbb {R}}^d\) at time \(T>0\) is absolutely continuous with respect to an explicitly constructed unconditioned process \(\hat{{\mathbf {x}}}\) that will hit \({\mathbf {v}}\) at time T a.s.. The modified process \(\hat{{\mathbf {x}}}\) is constructed by adding an additional drift term that forces the process towards the target \({\mathbf {v}}\). In [15], this process is constructed as a modification of (3.10)

$$\begin{aligned} \mathrm{d}\hat{{\mathbf {x}}} = b(\hat{{\mathbf {x}}},t)\mathrm{d}t - \frac{\hat{\mathbf x}-{\mathbf {v}}}{T-t} \mathrm{d}t + \Sigma (\hat{{\mathbf {x}}},t)\mathrm{d}W \, . \end{aligned}$$
(3.11)

Letting \(P_{{\mathbf {x}}|{\mathbf {v}}}\) denote the law of \({\mathbf {x}}\) conditioned on hitting \({\mathbf {v}}\) with corresponding expectation \({\mathbb {E}}_{{\mathbf {x}}|{\mathbf {v}}}\), the Cameron–Martin–Girsanov theorem implies that \(P_{{\mathbf {x}}|{\mathbf {v}}}\) is absolutely continuous with respect to \(P_{\hat{{\mathbf {x}}}}\), see for example [49] and the discussion in [50]. An explicit expression for the Radon–Nikodym derivative \(\mathrm{d}P_{{\mathbf {x}}|\mathbf v}/\mathrm{d}P_{\hat{{\mathbf {x}}}}\) can be computed, and this derivative is central for using simulations of the process \(\hat{{\mathbf {x}}}\) to compute expectations over the conditioned process \({\mathbf {x}}|\mathbf v\). In particular, as shown in [15], the conditioned process \({\mathbf {x}}|{\mathbf {v}}\) and the modified process \(\hat{{\mathbf {x}}}\) are related by

$$\begin{aligned} {\mathbb {E}}_{{\mathbf {x}}|{\mathbf {v}}}(f({\mathbf {x}})) = \frac{\mathbb E_{\hat{{\mathbf {x}}}}\Big ( f(\hat{{\mathbf {x}}}) \varphi (\hat{{\mathbf {x}}}) \Big )}{ {\mathbb {E}}_{\hat{{\mathbf {x}}}}(\varphi (\hat{{\mathbf {x}}})) }, \end{aligned}$$
(3.12)

where \(\varphi (\hat{{\mathbf {x}}})\) is a correction factor applied to each stochastic bridge \(\hat{{\mathbf {x}}}\). Notice here that f is a real-valued function of the stochastic path from \(t=0\) to \(t=T\).

Returning to landmark evolutions in the phase space \(\mathbb R^{2dN}\), the process (3.6) has two vector variables \(({\mathbf {q}},{\mathbf {p}})\) that typically will be conditioned on hitting a fixed set of landmark positions \({\mathbf {v}}\) at time T. The conditioning thus happens only in the \({\mathbf {q}}\) variables by requiring \({\mathbf {q}}_T={\mathbf {v}}\). To construct bridges with an approach similar to the scheme of [15], we need to find an appropriate extra drift term and handle the fact that the diffusion field may not be invertible in general. Recall first that the landmark process (3.6) has diffusion field

$$\begin{aligned} \Sigma ({\mathbf {q}},{\mathbf {p}}) = \begin{pmatrix} \Sigma _{{\mathbf {q}}}({\mathbf {q}})\\ \Sigma _{{\mathbf {p}}}({\mathbf {q}},{\mathbf {p}}) \end{pmatrix} := \begin{pmatrix} \sigma _1({\mathbf {q}}), \ldots , \sigma _J({\mathbf {q}}) \\ -\nabla _{{\mathbf {q}}}({\mathbf {p}}\cdot \sigma _1({\mathbf {q}})), \ldots , -\nabla _{{\mathbf {q}}}({\mathbf {p}}\cdot \sigma _J({\mathbf {q}})) \end{pmatrix}, \end{aligned}$$
(3.13)

where \(\sigma _j({\mathbf {q}})\) denotes the vector \((\sigma _j(q_1),\ldots ,\sigma _j(q_N))^T\). Notice that this matrix is not square and has dimension \(2dN\times J\) so that \(\Sigma (\mathbf q,{\mathbf {p}}) \circ \mathrm{d}W_t\) with \(\mathrm{d}W_t\) a J-vector corresponds to the stochastic term of (3.6). Though \(\Sigma ({\mathbf {q}},\mathbf p)\) couples the \({\mathbf {q}}\) and \({\mathbf {p}}\) equation, when the number of noise fields J is sufficiently large, the \({\mathbf {q}}\) part \(\Sigma _{{\mathbf {q}}}({\mathbf {q}})\) will often be surjective as a linear map \({\mathbb {R}}^J\rightarrow {\mathbb {R}}^{dN}\). In this situation, by letting \(\Sigma _{{\mathbf {q}}}({\mathbf {q}})^\dagger \) denote the Moore–Penrose pseudo-inverse of \(\Sigma _{{\mathbf {q}}}({\mathbf {q}})\), we can construct a guiding drift term as

$$\begin{aligned} G({\mathbf {q}}, {\mathbf {p}}): = -\frac{ \Sigma ({\mathbf {q}},\mathbf p)\Sigma _{{\mathbf {q}}}({\mathbf {q}})^\dagger ({\mathbf {q}}-\mathbf v)}{T-t}\, . \end{aligned}$$
(3.14)

This term, when added to the process (3.6) and when measures are taken to control the unbounded drift of (3.6), will ensure that the modified process hits \(\mathbf q_T\) a.s. at time T. The drift term (3.14) is a direct generalization of the term added in (3.11). If \(\Sigma \) had been invertible then \(\Sigma \Sigma ^\dagger =\mathrm{Id}\) resulting in the guiding term of [15] used in equation (3.11). In the current noninvertible case, \(\Sigma \Sigma _{\mathbf {q}}^\dagger ({\mathbf {q}}-{\mathbf {v}})\) uses the difference \({\mathbf {q}}-{\mathbf {v}}\) which only involves the landmark position but affects both the position and the momentum equations. We stress here the fact that the introduction of noise in the \({\mathbf {q}}\) equation in (3.6) is essential in our present approach. When conditioning on the \({\mathbf {q}}\) variable, a guided process could not directly be constructed in this way, if the noise had been introduced only in the \({\mathbf {p}}\) equation, as in [44, 61, 64]. The fact that this term is weighted by \(\Sigma \Sigma ^\dagger \) is also important as it allows the guiding term to be more efficient in the noisy regions of the image, where there is more freedom to deviate from the deterministic path. The guiding term can be interpreted as originating from a time-rescaled gradient flow, and with the guiding term added, the diffusion process can be seen as a stochastically perturbed gradient flow, see [3].

The guiding term (3.14) is, in practice, not always appropriate for landmarks. Because the correction is dependent only on the difference to the target in the position equation, a phenomenon of over-shooting is often observed. In such cases, the landmarks travel too fast initially due to a large momentum, strengthened by the guiding term that forces the landmarks towards \({\mathbf {v}}\). The high initial speed is only corrected when the time approaches T and the guiding term brings the landmark back to their final position. This effect is illustrated in Fig. 4 in Sect. 5.2 and results in low values of the correction factor \(\varphi (\mathbf q,{\mathbf {p}})\) used to compute the expectation in (3.12). This effect tends to produce inefficient samples when approximating (3.12) by Monte Carlo sampling. As an alternative, upon letting \(b({\mathbf {q}},{\mathbf {p}})\) denote the drift term of (3.6), we employ a guided diffusion process of the form

$$\begin{aligned} \begin{pmatrix} \mathrm{d}\hat{{\mathbf {q}}} \\ \mathrm{d}\hat{{\mathbf {p}}} \end{pmatrix} = b(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})\mathrm{d}t - \frac{\Sigma (\hat{\mathbf q},\hat{{\mathbf {p}}})\Sigma _{{\mathbf {q}}}(\hat{\mathbf q})^\dagger (\phi _{t,T}(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})-{\mathbf {v}})}{T-t}\mathrm{d}t + \Sigma (\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}) \circ \mathrm{d}W, \end{aligned}$$
(3.15)

for some appropriately chosen function \(\phi _{t,T}:\mathbb R^{2dN}\rightarrow {\mathbb {R}}^{dN}\) that gives an estimate of the value of \(\hat{{\mathbf {q}}}_T\) using the value of the modified stochastic process \((\hat{{\mathbf {q}}}_t,\hat{{\mathbf {p}}}_t)\) at time t. The hat denotes the solution of the process (3.15), which is different from the original dynamics of the process (3.6) written without the hats. The choice \(\phi _{t,T}(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}):=\hat{{\mathbf {q}}}\) recovers the guiding term (3.14). It would be natural to define \(\phi _{t,T}(\hat{ {\mathbf {q}}},\hat{\mathbf p}):={\mathbb {E}}_{( {\mathbf {q}},{\mathbf {p}})}({\mathbf {q}}_T| ({\mathbf {q}}_t, {\mathbf {p}}_t)=(\hat{ {\mathbf {q}}},\hat{ {\mathbf {p}}}))\). The resulting guiding term will only be driven by the expected amount needed at the endpoint, not from the value at time t. A similar choice but easier to handle is to let \(\phi _{t,T}(\hat{{\mathbf {q}}},\hat{\mathbf p})\) be the solution at time T of the original deterministic landmark dynamics (2.3), obtained from the initial conditions \((\hat{{\mathbf {q}}}_t,\hat{{\mathbf {p}}}_t)=(\hat{{\mathbf {q}}},\hat{\mathbf p})\). We will use this latter choice with a modification to ensure its time derivative is bounded. The approach is visualized in Figure 4. To ensure convergence of \(\hat{{\mathbf {q}}}_t\rightarrow {\mathbf {v}}\) for \(t\rightarrow T\), a bounded approximation \({\tilde{b}}\) will be chosen to replace the original unbounded drift b in (3.15). As it turns out, this choice has little influence in practice.

The matrix \(\Sigma (\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})\Sigma _{\mathbf q}(\hat{{\mathbf {q}}})^\dagger \) in (3.15) only accounts for the \({\mathbf {q}}\) dynamics in the pseudo-inverse \(\Sigma _{\mathbf q}(\hat{{\mathbf {q}}})^\dagger \). When the momentum is high and the noise fields \(\sigma _j\) have high gradients, this fact can again lead to improbable sample paths. In such cases, the scheme can be further generalized by using a guiding term of the form

$$\begin{aligned} \frac{1}{T-t} \Sigma (\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}) \Big ( D_h\big ( \phi _{t,T}(\Sigma (\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}){\mathbf {h}}) \big )|_{{\mathbf {h}}={\mathbf {0}}} \Big )^{\dagger }(\phi _{t,T}(\hat{\mathbf q},\hat{{\mathbf {p}}})-{\mathbf {v}}) \, . \end{aligned}$$
(3.16)

The matrix \( D_h\big ( \phi _{t,T}(\Sigma (\hat{{\mathbf {q}}},\hat{\mathbf p}){\mathbf {h}}) \big )|_{{\mathbf {h}}={\mathbf {0}}} \) is a linear approximation of the expected endpoint dynamics as a function of the noise vector \({\mathbf {h}}\in {\mathbb {R}}^J\). Again, with \(\phi _{t,T}(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}):=\hat{{\mathbf {q}}}\), the original guiding term (3.14) is recovered, and the term is close to the guiding term of (3.15) when the momentum or gradients of \(\sigma _j\) are low. We use this term for the experiments in Sect. 5.2 involving high momentum dynamics, e.g. Fig. 6.

The following result is an extension of [15, Theorem 5] and [42, Theorem 3] to the modified guided SDE (3.15). It is the basis for the EM approach for estimating the parameters of the landmark processes developed in Sect. 4.3. Please note that the Girsanov theorem [15, Thm. 1] which relates the modified and original process, does not assume that \(\Sigma \) is invertible. The main analytic consequence of the noninvertibility is that the process is semi-elliptic and the transition density, therefore, cannot be bounded by Aronson’s estimation [4]. Instead, we here assume continuity and boundedness of the density of \({\mathbf {q}}\) in small intervals of (0, T] in the sense of the assumption below. We write \({\mathbb {P}}({\mathbf {q}}_0, {\mathbf {p}}_0; {\mathbf {q}}, {\mathbf {p}}, t)\) for the transition density at time t of a solution \(({\mathbf {q}},{\mathbf {p}})\) to (3.6) started at \(({\mathbf {q}}_0, {\mathbf {p}}_0)\). Similarly, when conditioning only on \({\mathbf {q}}\), we write \({\mathbb {P}}({\mathbf {q}}_0, {\mathbf {p}}_0;\mathbf q,t) = \int _{{\mathbb {R}}^{dN}} {\mathbb {P}}({\mathbf {q}}_0, {\mathbf {p}}_0; {\mathbf {q}},{\mathbf {p}},t)\mathrm{d}{\mathbf {p}}\).

Assumption 1

For any \(({\mathbf {q}}_0,{\mathbf {p}}_0)\) and \(({\mathbf {q}},{\mathbf {p}})\), the process \(({\mathbf {q}}_t,{\mathbf {p}}_t)\) has a density \({\mathbb {P}}({\mathbf {q}}_0, {\mathbf {p}}_0; {\mathbf {q}},\mathbf p,t)\) and the map \(({\mathbf {q}},t)\mapsto \int _{\mathbb R^{dN}}g_0({\mathbf {q}}_0,{\mathbf {p}}_0){\mathbb {P}}({\mathbf {q}}_0,\mathbf p_0; {\mathbf {q}},t)\mathrm{d}({\mathbf {q}}_0,{\mathbf {p}}_0)\) is continuous in t and \({\mathbf {q}}\) and bounded on sets \(\{({\mathbf {q}},t)|s-\epsilon \le t\le s\}\) for \(s\in (0,T]\), sufficiently small \(\epsilon >0\), and any integrable function \(g_0\).

The interpretation of Assumption 1 is that, given any distribution of initial conditions \(({\mathbf {q}}_0,\mathbf p_0)\) with density \(g_0\), the resulting \({\mathbf {q}}\)-transition density of the process is continuous and bounded in \({\mathbf {q}}\) and t. As shown in Lemma A.2, Assumption 1 can be slightly weakened if Theorem 3.4 is only used to approximate the transition density at time T as opposed to expectations \(\mathbb E[f({\mathbf {q}},{\mathbf {p}})|{\mathbf {q}}_T={\mathbf {v}}]\) for arbitrary measurable functions f.

We let \(W({\mathbb {R}}^{2dN})\) denote the Wiener space of continuous paths \([0,T]\rightarrow {\mathbb {R}}^{2dN}\).

Theorem 3.4

Assume \(\Sigma _{{\mathbf {q}}}({\mathbf {q}}):{\mathbb {R}}^J\rightarrow \mathbb R^{dN}\) is surjective for all \({\mathbf {q}}\) with \(\Sigma _{\mathbf q}({\mathbf {q}})^\dagger \) bounded, and that \(\Sigma \) is \(C^{1,2}\), bounded, and with bounded derivatives. Let \({\tilde{b}}_{\mathbf {q}}\) be a bounded approximation of the \({\mathbf {q}}\)-part of the drift b, and set \({\tilde{b}}=b+\Sigma ({\mathbf {q}},{\mathbf {p}})\Sigma _\mathbf q({\mathbf {q}})^\dagger ({\tilde{b}}_{\mathbf {q}}-b_{\mathbf {q}})\). Let \(v\in {\mathbb {R}}^{dN}\) be a point with \({\mathbb {P}}({\mathbf {q}}_0,\mathbf p_0; {\mathbf {v}},t)\) positive, and let \(P_{({\mathbf {q}},\mathbf p)|{\mathbf {v}}}\) be the law of \(({\mathbf {q}},{\mathbf {p}})\,|\,\mathbf q_T={\mathbf {v}}\). Let \((\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})\) be solution to (3.15), \((\hat{{\mathbf {q}}}_0,\hat{\mathbf p}_0)=({\mathbf {q}}_0,{\mathbf {p}}_0)\) with \(\varphi _{t,T}:\mathbb R^{2dN}\rightarrow {\mathbb {R}}^{dN}\) a map with \(\frac{\varphi _{t,T}({\mathbf {q}},{\mathbf {p}})-{\mathbf {q}}}{T-t}\) bounded on [0, T). Then, for positive measurable \(f:W(\mathbb R^{2dN})\rightarrow {\mathbb {R}}\),

$$\begin{aligned} {\mathbb {E}}_{({\mathbf {q}},{\mathbf {p}})|{\mathbf {v}}}[f({\mathbf {q}},{\mathbf {p}})] = \lim _{t\rightarrow T} \frac{ {\mathbb {E}}_{(\hat{\mathbf q},\hat{{\mathbf {p}}})}\left[ f(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}) \varphi (\hat{{\mathbf {q}}},\hat{{\mathbf {p}}},t) \right] }{\mathbb E_{(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})}[\varphi (\hat{\mathbf q},\hat{{\mathbf {p}}},t)]}, \end{aligned}$$
(3.17)

with

$$\begin{aligned}&\log \varphi ({{\mathbf {q}}},{{\mathbf {p}}},t) = -\int _0^t\frac{({\mathbf q}-{\mathbf {v}})^TA({{\mathbf {q}}}){\tilde{b}}({{\mathbf {q}}},{{\mathbf {p}}})\mathrm{d}s }{T-s} \\&\qquad -\int _0^t\frac{ ({{\mathbf {q}}}-{\mathbf {v}})^T\big (dA({\mathbf q})\big )({{\mathbf {q}}}-{\mathbf {v}}) }{2(T-s)} -\sum _{i,j}\int _0^t\frac{d[A_{ij}({{\mathbf {q}}}),({{\mathbf {q}}}-\mathbf v)_i({{\mathbf {q}}}-{\mathbf {v}})_j)]}{T-s} \\&\qquad +\int _0^t(b_{\mathbf {q}}({{\mathbf {q}}},{\mathbf p})-{\tilde{b}}_{\mathbf {q}}({{\mathbf {q}}},{{\mathbf {p}}}))^T\Sigma _\mathbf q({{\mathbf {q}}})^{\dagger ,T}dW\\&\qquad -\frac{1}{2} \int _0^t \Vert \Sigma _\mathbf q({{\mathbf {q}}})^{\dagger } (b_{\mathbf {q}}({{\mathbf {q}}},{\mathbf p})-{\tilde{b}}_{\mathbf {q}}({{\mathbf {q}}},{{\mathbf {p}}})) \Vert ^2 \mathrm{d}s \\ {}&\qquad +\int _0^t\frac{(\varphi _{t,T}({\mathbf {q}},{\mathbf {p}})-\mathbf q)^T\Sigma _{\mathbf {q}}({\mathbf {q}})^{\dagger ,T}dW}{T-t} -\frac{1}{2}\int _0^t\left\| \frac{ \Sigma _{\mathbf {q}}({\mathbf {q}})^{\dagger } (\varphi _{t,T}({\mathbf {q}},{\mathbf {p}})-{\mathbf {q}})}{T-t}\right\| ^2\mathrm{d}s , \end{aligned}$$

where \(A({{\mathbf {q}}})=\big (\Sigma _{{\mathbf {q}}}({\mathbf q})\Sigma _{{\mathbf {q}}}({{\mathbf {q}}})^T\big )^{-1}\). In addition,

$$\begin{aligned} {\mathbb {P}}({\mathbf {q}}_0,{\mathbf {p}}_0; {\mathbf {q}},T) = \left( \frac{\left| A({\mathbf {v}})\right| }{2\pi T}\right) ^{\frac{d}{2}} e^{-\frac{\Vert \Sigma _{\mathbf {q}}({\mathbf {q}}_0)^\dagger (\mathbf q_0-v)\Vert ^2}{2T}} \lim _{t\rightarrow T} {\mathbb {E}}_{(\hat{\mathbf q},\hat{{\mathbf {p}}})} [\varphi (\hat{{\mathbf {q}}},\hat{{\mathbf {p}}},t)] \ . \end{aligned}$$
(3.18)

In the Theorem, \([\cdot ,\cdot ]\) denotes the quadratic variation of semimartingales. As mentioned above, a bounded approximation \({\tilde{b}}\) must be used to replace the original drift term b in (3.15). The last integrals in the expression for \(\log \varphi ({\mathbf {q}},{\mathbf {p}},t)\) are results of this approximation and the use of the map \(\varphi _{t,T}\).

The result is proved in “Appendix A”. If \(\Sigma \) had been invertible and if the guidance scheme (3.11) was used, the result of [15] would imply that the right-hand side limit of (3.17) would equal

$$\begin{aligned} \frac{{\mathbb {E}}_{(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})}\left[ f(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}) \varphi (\hat{\mathbf q},\hat{{\mathbf {p}}},T) \right] }{{\mathbb {E}}_{(\hat{\mathbf q},\hat{{\mathbf {p}}})}[\varphi (\hat{{\mathbf {q}}},\hat{{\mathbf {p}}},T)]}\, . \end{aligned}$$

Extending the convergence argument to the present noninvertible case is nontrivial, and we postpone investigating this to future work. For numerical computations, \(\varphi (\hat{\mathbf q},\hat{{\mathbf {p}}},t)\) can be approximated by finite differences. As described later in the paper, we do this using a framework that allows symbolic evaluation of gradients and thus subsequent optimization for parameters of the processes.

4 Estimating the Spatial Correlation of the Noise

We now assume a set of n observed landmark configurations \(\mathbf q^1,\ldots ,{\mathbf {q}}^n\) at time T, i.e. the observations are considered realizations of the stochastic process at some positive time T. From this data, we aim at inferring parameters of the model. This can be both parameters of the noise fields \(\sigma _l\) and parameters for the initial configuration \(({\mathbf {q}}(0),\mathbf p(0))\). The initial configuration can be deterministic with fixed known or unknown parameters, or it can be randomly chosen from a distribution with known or unknown parameters. We develop two different strategies for performing the inference. The first inference method in Sect. 4.2 is a shooting method based on solving the evolution of the first moments of the probability distribution of the landmark positions while the second method in Sect. 4.3 is based on the expectation-maximization (EM) algorithm. The discussion is here in the context of landmarks, although these ideas may also apply in the more general context of Sect. 2.

4.1 The Noise Fields

We start by discussing the form of the unknown J noise fields \(\sigma _l\). To estimate them from a finite amount of observed data, we are forced to require the fields to be specified by a finite number of parameters. A possible choice for a family of noise fields is to select J linearly independent elements \(\sigma _1,\ldots ,\sigma _J\) from a dense subset of \(C^1(\mathbb R^d,{\mathbb {R}}^d)\). We here use a kernel k with length scale \(r_l\) and a noise amplitude \(\lambda _l\in {\mathbb {R}}^d\), that is

$$\begin{aligned} \sigma _l^\alpha ({\mathbf {q}}_i) = \lambda _l^\alpha k_{r_l}(\Vert \mathbf q_i-\delta _l\Vert )\,, \end{aligned}$$
(4.1)

where \(\delta _l\) denotes the kernel positions. Possible choices for the kernel include Gaussians \(k_{r_l}(x)=e^{-x^2/(2r_l^2)}\), or cubic B-splines \(k_{r_l}(x)=S_3(x/r_l)\). The Gaussian kernel has the advantage of simplifying calculations of the moment equations, whereas the B-spline representation is compactly supported and gives a partition of unity when used in a regular grid. Other interesting choices may include a cosine or a polynomial basis of the image domain.

In principle, the methods below allow all parameters of the noise fields to be estimated given sufficient amount of data. However, for simplicity, we will fix the length scale and the position of the kernels. The unknown parameters for the noise can then be specified in a single vector variable \(\theta =(\lambda _1,\ldots ,\lambda _K)\). The aim of the next sections will be to estimate this vector, possibly in addition to the initial configuration \((\mathbf q(0),{\mathbf {p}}(0))\), from data using the method of moments in Sect. 4.2 and EM in Sect. 4.3, respectively.

Remark 4.1

For the bridge simulation scheme, we required \(\Sigma _{\mathbf q}({\mathbf {q}})\) to be surjective as a linear map \(\mathbb R^J\rightarrow {\mathbb {R}}^{dN}\). This assumption can be satisfied when the number of landmarks is low relative to the number of noise fields having nonzero support in the area where the landmarks reside. On the other hand, if the number of landmarks is increased while the number of noise fields is fixed, the assumption eventually cannot be satisfied. Intuitively, in such cases, the extra drift added to the bridge SDE must guide through a nonlinear submanifold of the phase space to ensure the landmarks will hit the target point \({\mathbf {v}}\) exactly. This limitation can be handled in three ways: (1) The method of moments as described below avoids matching individual point configurations, and it can, therefore, be used in situations where the surjectivity condition is not satisfied. (2) As discussed in Remark 2.3, the noise can be made infinite dimensional. This can be done while keeping correlation structure similarly to the case with finite J. See also [3] for a discussion of noise in the form of a Gaussian process. (3) The bridge matching can be made inexact mimicking the inexact matching pursued in deterministic LDDMM. This could potentially relax the requirements on the extra drift term to only ensure convergence towards a given distance of the target. Inexact observations of stochastic processes are for example treated in [63].

4.2 Method of Moments

We describe here our first method for estimating the parameters \(\theta \) by solving a shooting problem on the space of first and second-order moments. Given an estimate of the endpoint distributions \({\mathbb {P}}({\mathbf {q}},{\mathbf {p}}, T)\), we will solve the inverse problem which consists in using the Fokker–Planck equation (3.9) to find the values of \(\theta \) such that we can reproduce the observed final distribution. Solving the Fokker–Planck equation directly is infeasible due to its high dimensionality. Instead, we will derive a set of finite-dimensional equations approximating the solution of the Fokker–Planck equation (3.9) for the probability distribution \({\mathbb {P}}\) in terms of its first moments. This approach has been developed in the field of plasma physics for the Liouville equation, which is similar to the Fokker–Planck equation (3.9).

Remark 4.2

(Geometric moment equation) As the Fokker–Planck (3.9) is written in term of the canonical bracket, we could expect to be able to apply a geometrical version of the method of moments such as the one developed by [28]. Although this method seems to fit the present geometric derivation of the stochastic equations, we will not use it as it is not in our case practically useful. Indeed, it requires the expansions of the Hamiltonian functions in term of the moments, but we cannot obtain here a valid expansion with a finite number of terms. This is due to the fact that the LDDMM kernel and the noise kernels cannot generally be globally approximated by finite polynomials with bounded approximation error for large distances. This would, in turn, produce spurious strong interactions between distant landmarks.

The method for approximating the Fokker–Planck that we will use here is the following. We first define the moments

$$\begin{aligned} \langle q_i^\alpha \rangle&:= \int q_i^\alpha {\mathbb {P}}_\theta ({\mathbf {q}},{\mathbf {p}},t)\, \mathrm{d}{\mathbf {q}} \mathrm{d}{\mathbf {p}} \end{aligned}$$
(4.2)
$$\begin{aligned} \langle q_i^\alpha p_j^\beta \rangle&:= \int q_i^\alpha p_j^ \beta {\mathbb {P}}_\theta ({\mathbf {q}},{\mathbf {p}},t)\, \mathrm{d}{\mathbf {q}} \mathrm{d}{\mathbf {p}}\,, \end{aligned}$$
(4.3)

where we have written only two possible moments, although any combinations of p and q at any order are possible. In this work, we will only consider moments up to the second order, that is the moments \({\langle q_i^\alpha \rangle ,\langle p_i^\alpha \rangle ,\langle q_i^\alpha q_j^\beta \rangle ,\langle q_i^\alpha p_j^\beta \rangle }\) and \({\langle p_i^\alpha p_j^\beta \rangle }\). Notice that the first moment are (1, 1)-tensors, and the second moments are (2, 2)-tensors, although we will only use index notation here.

We illustrate this method with the first moment \({\langle q_i^\alpha \rangle }\), which represents the mean position of the landmarks. We compute its time derivative and use the property of the Kolmogorov operator \({\mathscr {L}}\) defined in (3.9) to obtain

$$\begin{aligned} {\frac{\mathrm{d}}{\mathrm{d}t} \langle q_i^\alpha \rangle = \int q_i^\alpha \mathscr {L}^*{\mathbb {P}}_\theta \, \mathrm{d}{\mathbf {q}} \mathrm{d}{\mathbf {p}} = \int \mathscr {L}q_i^\alpha {\mathbb {P}}_\theta \, \mathrm{d}{\mathbf {q}} \mathrm{d}\mathbf p=\left\langle \mathscr {L}q_i^\alpha \right\rangle \, .} \end{aligned}$$
(4.4)

We thus first need to apply the Kolmogorov operator \({\mathscr {L}} \) to \(q_i^\alpha \) to obtain

$$\begin{aligned} \begin{aligned} \mathscr {L}q_i^\alpha&= -\{h,q_i^\alpha \}_\mathrm {can} + \frac{1}{2}\sum _l \{\phi _l,\{\phi _l,q_i^\alpha \}_\mathrm {can}\}_\mathrm {can}\\&= \frac{\partial h}{\partial p_i^\alpha } +\frac{1}{2}\frac{\partial \sigma _l^\alpha ({\mathbf {q}}_i) }{\partial q_i^\beta } \sigma _l^\beta ({\mathbf {q}}_i) , \end{aligned} \end{aligned}$$
(4.5)

which corresponds to the q part of the drift of the stochastic process with Itô correction. Similarly, for the momentum evolution, we obtain

$$\begin{aligned} \begin{aligned} {\mathscr {L}} p_i^\alpha&= -\{h,p_i^\alpha \}_\mathrm {can} + \frac{1}{2}\sum _l \{\phi _l,\{\phi _l,p_i^\alpha \}_\mathrm {can}\}_\mathrm {can}\\&= - \frac{\partial h}{\partial q_i^\alpha } +\frac{1}{2} p_i^\gamma \frac{\partial \sigma _l^\gamma ({\mathbf {q}}_i)}{\partial q_i^\beta }\frac{\partial \sigma _l^\beta ({\mathbf {q}}_i)}{\partial q_i^\alpha } - \frac{1}{2} p_i^\beta \frac{\partial ^2\sigma _l^\beta ({\mathbf {q}}_i) }{\partial q_i^\alpha \partial q_i^\gamma } \sigma _l^\gamma ({\mathbf {q}}_i)\, . \end{aligned} \end{aligned}$$
(4.6)

Most of the terms on the right-hand side of (4.5) and (4.6) are nonlinear; so their expected value cannot be written in terms of only the first moments. This is the usual closure problem of moment equations, such as the BBGKY problem arising in many-body problems in quantum mechanics. The solution to this problem is to truncate the hierarchy of moments for a given order and consider the system of ODEs as an approximation of the complete Fokker–Planck solution. Here, we will apply the so-called cluster expansion method described in [38]. We refer to “Appendix B.1” for more details about this method.

Apart from the first approximation \({\langle q_i^\alpha q_j^\beta \rangle \rightarrow \langle q_i^\alpha \rangle \langle q_j^\beta \rangle }\), the next order of approximation is to keep track of the correlations

$$\begin{aligned} {\Delta _2 \langle q_i^\alpha q_j^\beta \rangle := \langle q_i^\alpha q_j^\beta \rangle -\langle q_i^\alpha \rangle \langle q_j^\beta \rangle \,. }\end{aligned}$$
(4.7)

This quantity is also called a centred second moment as for \(i=j\) it corresponds to the covariance of the probability distribution for the landmark i. In general, it corresponds to the correlation between the positions of two landmarks. The dynamical equation for this correlation is found from the equation of the second moment, which gives

$$\begin{aligned} \frac{\partial }{\partial t} \Delta _2\langle q_i^\alpha q_j^\beta \rangle&= \frac{\partial }{\partial t} \langle q_i^\alpha q_j^\beta \rangle - \langle q_i^\alpha \rangle \frac{\partial }{\partial t} \langle q_j^\beta \rangle + T \\&= \sum _l \left\langle \sigma _l^\alpha ({\mathbf {q}}_i) \sigma _l^\beta ({\mathbf {q}}_j) \right\rangle + \left\langle q_i^\alpha \frac{\partial h}{\partial p_j^\beta } \right\rangle - \left\langle q_i^\alpha \right\rangle \left\langle \frac{\partial h}{\partial p_j^\beta } \right\rangle \\&\quad +\frac{1}{2} \sum _l\left( \left\langle q_i^\alpha \sigma _l^\gamma ({\mathbf {q}}_j) \frac{\partial \sigma _l^\beta (\mathbf q_j)}{\partial q_j^\gamma } \right\rangle - \left\langle q_i^\alpha \right\rangle \left\langle \sigma _l^\gamma ({\mathbf {q}}_j) \frac{\partial \sigma _l^\beta ({\mathbf {q}}_j)}{\partial q_j^\gamma } \right\rangle \right) \\&\quad + (i\leftrightarrow j) , \end{aligned}$$

where \((i\leftrightarrow j) \) stands for the same term but with i and j exchanged. This equation is interesting to study in more detail, as it already gives us information about the nature of the dynamics for the spatial covariance of landmarks. Indeed, we have three types of terms with the following effects.

  1. (1)

    The\(\sigma _l\)-dependent terms. This first term is quadratic in the \(\sigma \)’s, not proportional to any linear or quadratic polynomial in q or p. This term is a direct contribution from the noise in the q equation and will have the effect of almost linearly increasing the centred covariance, wherever a \(\sigma _l>0\).

  2. (2)

    Theh-dependent terms. From the form of this term, we expect it to be proportional to a correlation. It will thus have an exponential effect on the dynamics, triggered by the linear contribution of the first term. Notice that this term only depends on the Hamiltonian, and, thus, on the interaction between landmarks. If two landmarks interact, we expect their covariance to be averaged. This term will capture their averaged covariance.

  3. (3)

    The\(\nabla _q\sigma _l\)-dependent terms. These terms are related to the noise in the p equation and will account for the effect on the landmark position of the interaction of the momentum of the landmark with the gradients of the noise.

Notice that the last two types of terms describe second order effects with respect to the spatial covariance of the landmarks, as they depend linearly on the correlations. In the expansion of these nonlinear terms, the other correlations involving p will appear. This means that all of the possible second-order correlations must be computed. This computation is done in “Appendix B”, where we also approximate the expected value of the kernels as \({\langle K({\mathbf {q}}) \rangle \approx K(\langle {\mathbf {q}} \rangle )}\). As we will see in the numerical examples in Sect. 5, these approximations can give a reliable estimate of the landmark covariance, but this should be rigorously justified to obtain a precise estimate of the errors. Such a study is beyond the scope of this work and is left open.

Given the equations for the moment evolution, we can estimate the parameters \(\theta \) by minimizing the cost function

$$\begin{aligned} {C(\langle {\mathbf {p}} \rangle (0), \lambda _l) = \frac{1}{\gamma _1} \left\| \langle {\mathbf {q}} \rangle - \langle {\mathbf {q}} \rangle (T)\right\| ^2 + \frac{1}{\gamma _2}\left\| \Delta _2 \langle \mathbf {qq} \rangle - \Delta _2\langle \mathbf {qq} \rangle (T)\right\| ^2, }\end{aligned}$$
(4.8)

where \(\gamma _1\) and \(\gamma _2\) are weights. We denote by \({\langle {\mathbf {q}} \rangle }\) and \({\Delta _2\langle \mathbf {qq} \rangle }\) the target first and second moments and by \(\langle {\mathbf {q}} \rangle (T)\) and \(\Delta _2\langle \mathbf {qq} \rangle (T)\) the estimated moments which implicitly depend on the parameters of the noise and the initial momentum. The choice of the norm is free here, and we chose a norm which only considers \(i=j\) and normalizes each term to 1 so that all the covariance of the landmarks contribute equally to the cost. Other choices could be made, depending on applications. Also, the cost function may depend on other parameters, but this would make its minimization more difficult.

To minimize the cost (4.8), we can use gradient-based methods such as the BFGS algorithm. Such methods require the evaluation of the Jacobian of C with respect to all of its arguments. Usually, for the estimation of the initial momentum, a linear adjoint equation is used. However, the derivative with respect to the parameters of the noise cannot be computed in this way. We will evaluate the gradients symbolically by using the Theano library in Python [59]. To improve the efficiency of the algorithm, we first match the mean final position, by only updating the initial momentum. Then, with this initial condition, we match for both first and second moments and update the initial momentum as well as the parameters \(\lambda _l\). As we will see in the numerical experiments in section 5, gradient-based methods are not optimal, and genetic algorithms, such as the differential evolution algorithm of [57] designed for global minimizations, turn out to perform better.

4.3 Maximum Likelihood and Expectation-Maximization

We now describe how to estimate the unknown parameters collected in the variable \(\theta \) by a maximum likelihood estimation based on the expectation-maximization (EM) algorithm of [16]. The likelihood of n independent observations \(({\mathbf {q}}^1,\ldots ,{\mathbf {q}}^{n})\) at time T given parameters \(\theta \) takes the form

$$\begin{aligned} {\mathcal {L}}({\mathbf {q}}^1,\ldots ,{\mathbf {q}}^{n},\theta ) = \prod _{i=1}^{n} {\mathbb {P}}_\theta ({\mathbf {q}}^i,T) = \prod _{i=1}^{n} \int _{{\mathbb {R}}^{dN}} {\mathbb {P}}_\theta ({\mathbf {q}}^i,{\mathbf {p}},T) \mathrm{d}{\mathbf {p}} \, . \end{aligned}$$
(4.9)

The parameters \(\theta \) can be estimated by maximizing the likelihood, that is by letting

$$\begin{aligned} {\hat{\theta }} \in \mathrm {argmax}_\theta \, {\mathcal {L}}(\theta ; {\mathbf {q}}^1,\ldots ,{\mathbf {q}}^{n}) \, . \end{aligned}$$

For this, the likelihood could be directly computed by numerical approximation of \({\mathbb {P}}_\theta ({\mathbf {q}}_i,T)\) using an approximation of the Fokker–Planck equation (3.9). Alternatively, the fact that the stochastic process is only sampled at time T suggests a missing data approach that marginalizes out the unobserved trajectories up to time T. Let \(({\mathbf {q}},\mathbf p; \theta )\) denote the stochastic landmark process with parameters \(\theta \), and let \(P({\mathbf {q}},{\mathbf {p}}; \theta )\) denote its law. Let \({\mathcal {L}}({\mathbf {q}},{\mathbf {p}}; \theta )\) denote the likelihood of the entire stochastic path for a given realization of the noise, and computed with respect to the parameter \(\theta \). Notice that this likelihood is only defined for finite time discretizations of the process and there is no notion of path density for the infinite-dimensional process. We thus proceed formally, while noting that the approach can be justified rigorously, see e.g. [17]. An alternative approach is to optimize the likelihood (4.9) directly using (3.18). This is pursued in [55].

The EM algorithm finds a sequence of parameter estimates \(\{\theta _k\}\) converging to a \({\hat{\theta }}\) by iterating over the following two steps:

  1. (1)

    Expectation: Compute the expected value of the log-likelihood given the previous parameter estimate \(\theta _{k-1}\):

    $$\begin{aligned} \begin{aligned} Q(\theta | \theta _{k-1})&:= {\mathbb {E}}_{\theta _{k-1}} [ \log {\mathcal {L}}({\mathbf {q}},{\mathbf {p}};\theta ) \,|\, {\mathbf {q}}^1,\ldots ,{\mathbf {q}}^{n} ] \\&= \sum _{i=1}^{n} {\mathbb {E}}_{\theta _{k-1}} [ \log {\mathcal {L}} ({\mathbf {q}},{\mathbf {p}}; \theta |{\mathbf {q}}^i) ] \, . \end{aligned} \end{aligned}$$
    (4.10)

    The expectation (4.10) over the process conditioned on the observations \({\mathbf {q}}_i\) integrates the likelihood over all sample paths reaching \({\mathbf {q}}_i\). For this, we employ the bridge simulation approach developed in Sect. 3.3. For each \({\mathbf {q}}^i\), we thus exchange \(({\mathbf {q}}_t,{\mathbf {p}}_t; \theta )\) with a guided process \((\hat{{\mathbf {q}}},\hat{{\mathbf {p}}}; \theta , {\mathbf {q}}^i)\) and use the equality (3.17) from Theorem 3.4. The expectation on the right-hand side of (3.17) can be approximated by drawing samples from the guided process. Note that the correction factor \(\varphi (\mathbf q,{\mathbf {p}}|\theta _{k-1}, {\mathbf {q}}_i)\) makes the approach equal to importance sampling of the conditioned process with the guided process as proposal distribution.

  2. (2)

    Maximization: Find the new parameter estimate

    $$\begin{aligned} \theta _k= \mathrm {argmax}_\theta \, Q(\theta |\theta _{k-1})\, . \end{aligned}$$
    (4.11)

    The maximization step can be approximated by updating \(\theta _k\) such that it increases \(Q(\theta | \theta _{k-1})\) instead of maximizing it. This is the approach of the generalized EM algorithm [48]. The update of \(\theta \) is thus computed by taking a gradient step

    $$\begin{aligned} \theta _k=\theta _{k-1}+\epsilon \nabla _\theta Q(\theta | \theta _{k-1}), \end{aligned}$$
    (4.12)

    where \(\epsilon >0\). The gradient which is evaluated for each of the sampled paths can be computed symbolically using the Theano library [59]. Theano allows the entire computational chain from the definition of the Hamiltonian and noise fields to the time-discrete stochastic integration to be specified symbolically. The framework can therefore automatically derive gradients symbolically before the expressions are compiled to efficient numerical code. See also [39] for more details on the use of Theano for differential geometric and stochastic computations.

The resulting estimation algorithm is listed in Algorithm 1. For each \({\mathbf {q}}^i\), the expectation \({\mathbb {E}}_{\theta _{k-1}}[\log {\mathbb {P}}_\theta ({\mathbf {q}},\mathbf p|{\mathbf {q}}_i)]\) is estimated by sampling \(N_{\mathrm {bridges}}\) bridges. The algorithm can perform a fixed number K of updates to the estimate \(\theta _k\) or stop at convergence.

figure a

5 Numerical Examples

We now present several numerical tests of the stochastic perturbation of the landmark dynamics. In particular, we want to illustrate aspects of the effect of the noise on the landmarks and test the algorithms for estimation of the spatial correlation of the noise. We will focus here on synthetic examples and refer to [3] for an application of the methods on a dataset of Corpora Callosa shapes represented by 77 landmarks. The numerical simulations of this work have been done in Python, using the symbolic computation framework Theano [59]. The code is available from the public repository https://bitbucket.org/stefansommer/stochlandyn. See also [39] for additional details.

5.1 Solution of the Fokker–Planck Equation

We first consider a simple experiment with a single landmark, subjected to a square array of noise fields with Gaussian noise kernel. To a first-order approximation, the mean trajectory of the landmark is a straight line with constant momenta as the Hamiltonian is a pure kinetic energy.

Fig. 2
figure 2

We plot on the left panel three simulations of a single landmark dynamics subject to an array of Gaussian noise fields. Their parameters are either \(\lambda _l= (0.08,0)\) or \(\lambda _l= (0,0.08)\). We used three different length scales \(r_l\) for the noise fields to analyse the effects of small or large Gaussian fields \(\sigma _l\) on the mean path of the landmark (with Gaussian kernels) and final covariance (ellipses). We used 2000 timesteps to integrate the moment equation forwards from \(t=0\) to \(t=1\). The initial momenta were found using a shooting method in the deterministic landmark equation. We display on the other two panels a zoom on two of the simulations of the left panel and compare the estimation of the final covariance from a Monte Carlo sampling of 10, 000 realizations (magenta) and from the solution of the moment equation (red) for two values of \(r_l\). The black density represents the probability distribution of the landmark estimated from samples, and the dashed lines two level sets. a Moment dynamics, b\(r_l=0.5\), c\(r_l=0.03\) (Color figure online)

This experiment is displayed in Fig. 2a where we used two arrays of four by four noise fields with either \(\lambda _l= (0.08,0)\) or \(\lambda _l= (0,0.08)\) and three values of the noise radius \(r_l=0.5,0.05,0.03\). For large values of \(r_l\), the noise is mostly uniform and the gradients of the \(\sigma _l\) are negligible. The only term contributing to the final covariance of the landmark is therefore \(\left\langle \sigma _l^\alpha ({\mathbf {q}}_i) \sigma _l^\beta ({\mathbf {q}}_j) \right\rangle \). Notice that because there is only one landmark, thus a linear drift, the deterministic part does not affect the covariance. This term only has a linear effect on the covariance which is thus an ellipse proportional to the noise fields. Here the noise has equal strength in both the x and y coordinate thus we observe a circle. For smaller values of \(r_l\), the gradient of \(\sigma _l\) is large enough for the other term in the momentum equation which couples the momentum and the gradient of \(\sigma \) to affect the moment dynamics. This effect is shown in Fig. 2a where the covariance has a larger value in the direction of the gradient of \(\sigma _l\) than in the other directions. This is explained by the fact that this coupling is of the form \(\frac{\partial }{\partial q_i}(\sigma _l(\mathbf q_i)\cdot {\mathbf {p}}_i)\), thus the ellipse is in the direction of the gradient, not the momenta. Notice that there should be some noise in the direction of the momenta for this term to have an effect.

Using the same experiment, we compared the estimation of the covariance from the moment equation with a direct sampling obtained by solving the stochastic landmarks equations. We did this experiment for \(r_l= 0.5\) and \(r_l=0.03\) in Figure 2b, c. The left panel with \(r_l=0.5\) shows an excellent agreement between the two methods but the right panel with \(r_l=0.03\) shows differences. This type of error in the estimation of the covariance is explained by the fact that the final distribution has a large skewness. This effect is not captured by the moment equations as we neglected the effects of order higher than 2, and the skewness is a third order effect described by terms such as \(\Delta _3\langle q_i^\alpha q_j^\beta q_k^\gamma \rangle \). Nevertheless, the final covariance is close enough to the correct one to be able to use it in the estimation of the noise fields. This demonstrates that even in rather extreme cases, which are not realistic for applications, the second-order approximation used to derive the moment equation still produces reliable results.

Fig. 3
figure 3

In these two panels, we present a study similar to Fig. 2a but with 5 interacting landmarks. On the left, we illustrate the effect of varying the landmark length scale \(\alpha \), and, on the right, we compare the result of the moment equation and a Monte Carlo simulation in the case of \(\alpha =0.1\), with also 10, 000 realizations. As before, the black density plot shows the probability density of the landmarks, the magenta curve the covariance from sampling and the red curve the covariance from the moment equation. a Landmark interaction, b sampling comparison

We did a similar experiment but with 5 interacting landmarks arranged in an ellipse configuration and with initial conditions obtained from the deterministic shooting method such that the endpoint of the deterministic landmark equations match another ellipse. We display these experiments in Fig. 3 with the same noise as in the previous tests and with \(r_l = 0.2\). We modified here the landmark interaction length scale \(\alpha \) from \(\alpha =0.02\) (no-interactions) to \(\alpha =0.2\) (neighbours interactions) to see the effect of the noise with the landmark interactions. Due to the different length scales, the trajectories to the target ellipse are slightly different so the landmarks will be subject to different noise. The larger length scale has the effect of reducing the differences between the covariances of interacting landmarks.

5.2 Bridge Sampling

Here, we aim at visualizing the effect of the constructed bridge sampling scheme. In Fig. 4, the effect of the guiding term is visualized on a sample path. At \(t=T/2\), the predicted endpoint \(\phi _{t,T}(\hat{{\mathbf {q}}}(t),\hat{\mathbf p}(t))\) is calculated and the difference \(\phi _{t,T}(\hat{\mathbf q},\hat{{\mathbf {p}}})-{\mathbf {v}}\) is used to guide the evolution of the path towards the target \({\mathbf {v}}\). The guiding term ensures that \(\hat{{\mathbf {q}}}\) will hit \({\mathbf {v}}\) almost surely at time T. Notice that the difference \(\phi _{t,T}(\hat{{\mathbf {q}}},\hat{\mathbf p})-{\mathbf {v}}\) is generally much smaller than the difference \(\hat{{\mathbf {q}}}-{\mathbf {v}}\). The introduction of \(\phi _{t,T}\) therefore implies that the process is modified less giving more likely bridges. Without \(\phi _{t,T}\), the process is generally attracted too quickly towards the target as can be seen by the landmarks at \(t=0.5\) being almost at their final positions in Fig. 4b. The path thus overshoots the target. This effect is not present when using \(\phi _{t,T}\) in Fig. 4a.

Fig. 4
figure 4

a Visualization of the process (3.15). From the initial landmark configuration \({\mathbf {q}}(0)\) (blue crosses), the target \({\mathbf {v}}\) (blue dots) is hit using the modified process \((\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})\) (black lines: \(\hat{{\mathbf {q}}})\). At time \(t=T/2\), \(\phi _{t,T}(\hat{{\mathbf {q}}}(t),\hat{{\mathbf {p}}}(t))\) is calculated (green dots) and the process is guided by \(-(T-t)^{-1}\Sigma \Sigma _{\mathbf q}^{\dagger }(\phi _{t,T}(\hat{{\mathbf {q}}},\hat{{\mathbf {p}}})-{\mathbf {v}})\) (\({\mathbf {q}}\) part: green arrows, length doubled for visualization). The use of \(\phi _{t, T}\) implies small guiding and high-probability sample bridges. b Similar setup but using the guiding term \(-(T-t)^{-1}\Sigma \Sigma _{{\mathbf {q}}}^\dagger (\hat{{\mathbf {q}}}-\mathbf v)\) without \(\phi _{t,T}\). The momentum couples with the guiding term, and, intuitively, the path travels too fast towards the target (\({\mathbf {q}}\) at \(t=T/2\) much closer than halfway towards \(\mathbf v\)) and overshoots. This effect gives low probability sample bridges and the guiding term (green arrows) is much larger than in a. a Guided process using \(\phi _{t,T}\), b guided process without \(\phi _{t,T}\) (Color figure online)

5.3 Estimating the Noise Amplitudes

We here aim at estimating the noise amplitude from sampled data using both the method of moments and maximum likelihood.

Fig. 5
figure 5

This figure shows the results of estimating parameters of the \(\sigma _l\) fields with the moment equation. Black arrows: The original \(\sigma _l\). Blue arrows: The estimated \(\sigma _l\). The error in the final covariance for the differential equation genetic algorithm is of the order of \(10^{-10}\) and for the BFGS algorithm, it is of the order of \(5\cdot 10^{-2}\). a Genetic algorithm, b gradient descent (Color figure online)

We first use the genetic algorithm of [57] called differential evolution algorithm to minimize the cost function C in (4.8). This algorithm has in experiments proven successful in avoiding local minima during the optimization. We compared it with the standard BFGS gradient descent algorithm with a single landmark in Fig. 5. This algorithm relies on the Jacobian of the cost functional computed symbolically using the Theano package of [59]. It is able to estimate the noise amplitude along the trajectory of the landmark where the signal from the gradient of C is the strongest. For the other regions of the image, the algorithm cannot detect any signal to update the noise fields. The genetic algorithm can overcome this issue as it is based on evolving a population of solutions which uniformly cover the entire parameter space. In this way, the solution obtained is a better approximation of the global minimum of C. It is interesting that even if the final moment of Fig. 5 is well matched with the genetic algorithm, the noise amplitude is not perfectly recovered. This illustrates the expected degeneracy of this model for a low number of landmarks. When more landmarks are added, the noise amplitude estimation is closer to the expected one, see Fig. 7. In these experiments, we set the initial variance of the momentum, and the position/momentum correlation to 0 for simplicity, and because we used these values to generate the final variance. In practice, one may expect to have other prior for the initial variance of the momenta or can try to find it as an unknown parameter of the problem. Having them as unknown may result in a large parameter space, thus simplifications such as all the landmarks have the same initial variance in the momentum could be used. We leave such investigation for later when applied to real data, with a possible meaningful prior.

Fig. 6
figure 6

The noise amplitudes are here estimated using maximum likelihood with the bridge sampling scheme. We assume \(\lambda _l\) are equal for all l resulting in two parameters for the noise. Thus by assumption, the estimated noise will be uniform over the domain. a The parameters are estimated correctly in the low momentum setting. b While the sample covariance matches the covariance of the original data in the high momentum case, the estimated parameters are different from the original. a Landmark and estmated noise, low momentum, b landmark and estmated noise, high momentum

In Fig. 6, the same experiment is performed with MLE and the bridge sampling scheme. The noise kernels are in this experiment cubic B-splines placed in a grid providing a partition of unity. In the optimization, \(\lambda _l\) are fixed to be equal for all \(l=1,\ldots , J\) implying that the total noise variance will be uniform at each point of the domain. The figure shows the experiment performed with low momentum (Fig. 6a) and high momentum (Fig. 6b). In the low momentum case, the noise parameters are estimated correctly and the sample covariance with the estimated parameters matches the covariance of the original samples. The SDE (3.15) is here used for the bridge sampling scheme. In contrast to the previous method, the algorithm is now optimizing for the maximum likelihood of the samples and not directly for matching the final covariance. A higher difference in the endpoint covariance is, therefore, to be expected.

With higher initial momentum, the coupling between the guidance and noise makes the scheme (3.15) overestimate the variance. Instead, the guidance term (3.16) is used. Notice that even though the sample covariance with the estimated parameters matches the covariance of the original samples, the estimated \(\lambda _l\) are different than the original values. This indicates that the maximum likelihood estimate of the parameters may not match the original setting in the highly nonlinear case occurring when the coupling between noise and momentum is high. Because of the nonlinearity, the noise is able to generate horizontal variation in the position of the final the landmark even though the variation with the estimated parameters are mainly vertical along the trajectory.

Fig. 7
figure 7

This figure shows the result of noise estimation using the moment equation as in Fig. 5 but for the ellipse experiment. The error in the final covariance for the differential equation genetic algorithm is of the order of \(10^{-9}\) and for the BFGS algorithm it is of the order of \(5\cdot 10^{-3}\). a Genetic algorithm, b gradient descent

Figures 7 and 8 show the result of noise estimation using different configurations of the ellipse and both the method of moments and MLE. The noise parameters \(\lambda _l\) are allowed to vary with l in both cases giving spatially nonuniform noise amplitude. The algorithms find the correct noise parameters in the areas covered by the landmark trajectories.

Fig. 8
figure 8

a Setup as Fig. 6 but with five landmarks in an ellipsis configuration. b Examples of simulated bridges as used in the approximation of the Q function in the EM procedure. a Landmarks and estmated noise, b EM iterations

6 Discussion and Outlook

As the first topic of this work, we raised the issue of how to include stochasticity and uncertainty in the framework of large deformation matching in a systematic and geometrically consistent way. In Sect. 2, we exposed a general theory of stochastic deformations in the LDDMM framework, based on the momentum map representation of images in [8], by introducing spatially correlated time-dependent noise in the reconstruction relation that is used to compute the deformation map from its velocity field. By taking this approach, we have preserved most of the advantages of the theory of reduction by symmetry. In particular, we have preserved the capability of applying this stochastic model to general data structures. The dynamical equation is the stochastic EPDiff equation, in which the noise appears in a certain multiplicative form with spatial correlations encoded in a set of spatially dependent functions \(\sigma _l\). The key feature of this noise is that the structure of the original equation provided by the theory of reduction by symmetry still remains. In particular, the persistence of the momentum map allows for both exact and inexact matching in this stochastic context.

The question of local-in-time existence and uniqueness of this equation is important, but it is not treated in this work. We refer to [10] for such a study for the 2D Euler equation and to [13] for the 3D case. Another possible extension would be to consider an infinite number of \(\sigma _l\) fields with an infinite-dimensional Wiener process for the stochastic EPDiff equation as investigated in [64], also in the context of stochastic shape analysis. We considered have time-independent \(\sigma _l\) fields. However, there are several approaches for making these fields time dependent besides simply prescribing them as functions of time. Some of these other approaches were derived by [21] in the context of stochastic fluid dynamics. In particular, the idea of having the noise fields being carried by the deformation could be of interest in this context as well. Yet another possibility would be to introduce two different types of noise fields, one modelling small-scale noise correlations and the other one for larger scale noise correlations. In this case, it would make sense for the small scale variability to be advected by the large-scale deformation, as in the multi-scale model of [30].

After defining the general model in Sect. 2, we applied it to exact landmark matching in Sect. 3, which is the simplest nontrivial application of the LDDMM framework. This approach allowed investigation of the effects of the noise on large deformation matching in a finite-dimensional model. Introducing the noise in both the momentum and the position equations of the landmarks made the landmark trajectories rougher than they would have been, otherwise, had the noise been only in the momentum equation. The noise in the position equation also increased the flexibility for controlling the landmark trajectories. This flexibility was used to derive a scheme for simulating diffusion bridges with corresponding sampling correction factor that allowed evaluation of expectations with respect to the original conditioned landmark dynamics. In addition, we used the finite dimensionality of the system to derive the Fokker–Planck equation and apply it to the dynamics of moments of the probability distribution function.

Some modifications to the standard theory of diffusion bridges were made to accommodate the case of landmark dynamics and to improve the speed and accuracy of the estimation of expectations over conditioned landmarks trajectories. The landmarks represent the simplest cases for numerical shape analysis, especially in the context of stochastic systems. We used a simple Heun method to solve the stochastic landmark equations. Higher order integration schemes could have been used, such as the stochastic variational integrators of [31]. The next step in extending the landmark example is to allow for inexact matching and to study the trade-off between the effect of noise and the tolerance of the matching.

Several issues regarding ergodicity and other properties of the Kolmogorov operator were left open in this paper, whose future treatments could add to the theoretical understanding of the model. Finally, the stochastic LDDMM framework can be applied to other types of data structures, in particular to images with inexact matching as originally done in [6]. Studying the effects of the stochastic model on other nonlinear data structures such as curves or surfaces would also be of great interest for future works.

As a second topic, we raised the issue of determining the noise correlation from data sets which would allow the theory of stochastic deformations to be used with observed data. We developed two independent methods which we implemented and applied to several test examples. First, the moment equation allows matching of the sample moments. It is deterministic, making optimization of the noise parameters stable and efficient, and it does not require special conditions on the noise fields. Its accuracy depends on the approximation order in the moment equation. Scaling the moment equation to a large number of landmarks or continuous shapes such as curves may be challenging as well as optimizing for a high number of unknown parameters. In the landmark experiments we presented above, this approach allowed us to reliably estimate the underlying noise, but an extension of this method to infinite-dimensional representations of shapes is not possible unless a discretized version of the equations is used. For this method, we also made two approximations that could possibly be improved elsewhere. One of them is the truncation to retain only second-order terms in the moment equation, and the other is to approximate the expectation of a kernel function as the kernel of the expected values. Both approximations were shown to work well for cases with small enough noise, which would be the case in most applications. Finally, it is also important to notice that we did not use the freedom of the initial value of the variance of the momentum and the position/momentum correlation. These parameters could either be inferred using this scheme (with a larger parameter space) or be obtained by using other information about the data.

The second method is the MLE optimization, a Monte Carlo method which evaluates expectations over conditioned stochastic trajectories. The bridge sampling scheme we used requires the noise fields to span the entire \({\mathbf {q}}\)-space to allow guiding the landmarks towards their target. With high nonlinearity as may happen with large initial momentum and high gradients of the noise fields, guiding the trajectories towards their target with high-probability bridges can be challenging. In general, the stochastic nature of the algorithm makes it harder to control than the matching provided by the moment equation. The bridge sampling scheme can be interpreted as a gradient flow, as discussed in [3] when applied to images. It allows the likelihood of observed images to be evaluated without a prior image registration step. The method may thus be applicable to image analysis problems, and more generally for inexact matching of shapes in which case the requirement of the noise to span the \({\mathbf {q}}\)-space may be relaxed.

The inference of noise parameters treated here can be extended to more general statistical inference problems on shape spaces. Inferring the initial \({\mathbf {q}}_0\) positions can be regarded as estimating a most-likely mean, thereby drawing similarities to the Frechét mean [20] and to means defined by the maximum likelihood of probability distributions in nonlinear spaces [54]. When generalized to images, the approach can be used for simultaneous estimation of template images [35], possible time-dependent transformations in the momentum as caused for example by disease processes [47], and population variation in the spatial noise correlation.

It is possible to generalize the stochastic equations we have introduced here to allow for time-dependent noise amplitude as done in [21] for fluid dynamics. In this case, the noise fields could be advected by the diffeomorphism and only the initial condition of the noise field would have to be inferred. This requires the choice of a meaningful advection scheme. By construction of its metric LDDMM is right-invariant, and the flow energy is therefore measured in Eulerian coordinates. This leads us to define stochastic flows that are compatible with this right-invariant geometry thus giving noise in Eulerian coordinates. In the deterministic setting, left-invariant metrics [53] provide a Lagrangian view of the metric that thus, in a medical context, follows the advected anatomy. We leave it as an open and very relevant problem to consider advected, or left-invariant Lagrangian noise.

Extending the inference methods presented here to other data structures, in particular to infinite-dimensional shapes spaces, would again constitute an interesting future direction. As discussed in detail at the end of Sect. 2, we believe that the methods presented here with suitable modifications can be applied also for infinite-dimensional representations of shapes, and that additional methods could be introduced, such as stochastic filtering for further data assimilation of the results in infinite-dimensional cases, see e.g. [5].