1 Introduction

Deformable registration computes a dense correspondence between two images, and is fundamental to many medical image analysis tasks. Traditional methods solve an optimization over the space of deformations, such as elastic-type models [4], B splines [25], dense vector fields [27], or discrete methods [8, 12]. Constraining the allowable transformations to diffeomorphisms ensures certain desirable properties, such as preservation of topology. Diffeomorphic transforms have seen extensive methodological development, yielding state-of-the-art tools, such as LDDMM [6, 29], DARTEL [2], and SyN [3]. However, these tools often demand substantial time and computational resources for a given image pair.

Recent methods have proposed to train neural networks that map a pair of input images to an output deformation. These approaches usually require ground truth registration fields, often derived via more conventional registration tools, which can introduce a bias and necessitate significant preprocessing [23, 26, 28]. Some preliminary papers [9, 18] explore unsupervised strategies that build on the spatial transformer network [15], but are only demonstrated with constrained deformation models such as affine or small displacement fields. Furthermore, they have only been validated on limited volumes, such as 3D patches or 2D slices. A recent paper avoids these pitfalls, but still does not provide topology-preserving guarantees or probabilistic uncertainty estimates, which yield meaningful information for downstream image analysis [5].

In this paper we present a formulation for registration as conducting variational inference on a probabilistic generative model. This framework naturally results in a learning algorithm that uses a convolutional neural network with an intuitive cost function. We introduce novel diffeomorphic integration layers combined with a transform layer to enable unsupervised end-to-end learning for diffeomorphic registration. We present extensive experiments, demonstrating that our algorithm achieves state of the art registration accuracy while providing diffeomorphic deformations, fast runtime and estimates of registration uncertainty.

1.1 Diffeomorphic Registration

Although the method presented in this paper applies to a multitude of deformable representations, we choose to work with diffeomorphisms, and in particular with a stationary velocity field representation [2]. Diffeomorphic deformations are differentiable and invertible, and thus preserve topology. Let \(\varvec{\phi }: R^3 \rightarrow R^3\) represent the deformation that maps the coordinates from one image to coordinates in another image. In our implementation, the deformation field is defined through the following ordinary differential equation (ODE):

$$\begin{aligned} \frac{\partial \varvec{\phi }^{(t)}}{\partial t} = \varvec{v}(\varvec{\phi }^{(t)}) \end{aligned}$$
(1)

where \(\varvec{\phi }^{(0)} = Id\) is the identity transformation and t is time. We integrate the stationary velocity field \(\varvec{v}\) over \(t=[0,1]\) to obtain the final registration field \(\varvec{\phi }^{(1)}\).

We compute the integration numerically using scaling and squaring [1], which we briefly review here. The integration of a stationary ODE represents a one-parameter subgroup of diffeomorphisms. In group theory, \(\varvec{v}\) is a member of the Lie algebra and is exponentiated to produce \(\varvec{\phi }^{(1)}\), which is a member of the Lie group: \(\varvec{\phi }^{(1)} =\text {exp}(\varvec{v})\). From the properties of one-parameter subgroups, for any scalars t and \(t'\), \(\exp ((t+t')\varvec{v}) = \exp (t\varvec{v}) \circ \exp (t'\varvec{v})\), where \(\circ \) is a composition map associated with the Lie group. Starting from \(\varvec{\phi }^{(1/2^T)} =\varvec{p}+ \varvec{v}(\varvec{p})/2^T\) where \(\varvec{p}\) is a map of spatial locations, we use the recurrence \(\varvec{\phi }^{(1/2^{t-1})} = \varvec{\phi }^{(1/2^t)} \circ \varvec{\phi }^{(1/2^t)}\) to obtain \(\varvec{\phi }^{(1)} = \varvec{\phi }^{(1/2)} \circ \varvec{\phi }^{(1/2)}\). T is chosen so that \(\varvec{v}/2^T \approx 0\).

2 Methods

Let \(\varvec{x}\) and \(\varvec{y}\) be 3D images, such as MRI volumes, and let \(\varvec{z}\) be a latent variable that parametrizes a transformation function \(\varvec{\phi }_{\varvec{z}}: R^3 \rightarrow R^3\). We use a generative model to describe the formation of \(\varvec{x}\) by warping \(\varvec{y}\) into \(\varvec{y}\circ \varvec{\phi }_{\varvec{z}}\). We propose a variational inference method that uses a neural network of convolutions, diffeomorphic integration, and spatial transform layers. We learn the network parameters in an unsupervised fashion, i.e., without access to ground truth registrations. We describe how the network yields fast diffeomorphic registration of a new image pair \(\varvec{x}\) and \(\varvec{y}\), while providing uncertainty estimates.

2.1 Generative Model

We model the prior probability of \(\varvec{z}\) as:

$$\begin{aligned} p(\varvec{z}) = \mathcal {N}(\varvec{z}; \varvec{0}, \varvec{\varSigma }_z), \end{aligned}$$
(2)

where \(\mathcal {N}(\cdot ;\varvec{\mu },\varvec{\varSigma })\) is the multivariate normal distribution with mean \(\varvec{\mu }\) and covariance \(\varvec{\varSigma }\). Our work applies to a wide range of representations \(\varvec{z}\). For example, \(\varvec{z}\) could be a low-dimensional embedding of a dense displacement field, or even the displacement field itself. In this paper, we let \(\varvec{z}\) be a stationary velocity field that specifies a diffeomorphism through the ODE (1). We let \(\varvec{L}= \varvec{D}- \varvec{A}\) be the Laplacian of a neighborhood graph defined on a voxel grid, where \(\varvec{D}\) is the graph degree matrix, and \(\varvec{A}\) is a voxel neighbourhood adjacency matrix. We encourage spatial smoothness of \(\varvec{z}\) by letting \(\varvec{\varSigma }_z^{-1} = \varvec{\Lambda }_z = \lambda \varvec{L}\), where \(\varvec{\Lambda }_z\) is a precision matrix and \(\lambda \) denotes a parameter controlling the scale of the velocity field \(\varvec{z}\).

We let \(\varvec{x}\) be a noisy observation of warped image \(\varvec{y}\):

$$\begin{aligned} p(\varvec{x}| \varvec{z}; \varvec{y})&= \mathcal {N}(\varvec{x}; \varvec{y}\circ \varvec{\phi }_{\varvec{z}}, \sigma ^2\varvec{\mathbbm {I}}), \end{aligned}$$
(3)

where \(\sigma ^2\) reflects the variance of additive image noise.

We aim to estimate the posterior registration probability \(p(\varvec{z}|\varvec{x};\varvec{y})\). Using this, we can obtain the most likely registration field \(\varvec{\phi }_{\varvec{z}}\) for a new image pair \((\varvec{x}, \varvec{y})\) via MAP estimation, along with an estimate of uncertainty for this registration.

2.2 Learning

With our assumptions, computing the posterior probability \(p(\varvec{z}|\varvec{x}; \varvec{y})\) is intractable. We use a variational approach, and introduce an approximate posterior probability \(q_{\varvec{\psi }}(\varvec{z}|\varvec{x}; \varvec{y})\) parametrized by \(\varvec{\psi }\). We minimize the KL divergence

$$\begin{aligned} \min _{\psi } \text {KL}&\left[ q_{\varvec{\psi }}(\varvec{z}|\varvec{x}; \varvec{y}) || p(\varvec{z}|\varvec{x}; \varvec{y}) \right] \nonumber \\&= \min _{\psi } \mathrm{IE}_{q} \left[ \log q_{\varvec{\psi }}(\varvec{z}|\varvec{x}; \varvec{y}) - \log p(\varvec{z}|\varvec{x}; \varvec{y}) \right] \nonumber \\&= \min _{\psi } \mathrm{IE}_{q} \left[ \log q_{\varvec{\psi }}(\varvec{z}|\varvec{x}; \varvec{y}) - \log p(\varvec{z}, \varvec{x}, \varvec{y}) \right] + \log p(\varvec{x}; \varvec{y}) \nonumber \\&= \min _{\psi } \text {KL}\left[ q_{\varvec{\psi }}(\varvec{z}|\varvec{x}; \varvec{y}) || p(\varvec{z}) \right] - \mathrm{IE}_{q} \left[ \log p(\varvec{x}| \varvec{z}; \varvec{y}) \right] , \end{aligned}$$
(4)

which is the negative of the variational lower bound of the model evidence [16]. We model the approximate posterior \(q_{\varvec{\psi }}(\varvec{z}| \varvec{x}; \varvec{y})\) as a multivariate normal:

$$\begin{aligned} q_{\varvec{\psi }}(\varvec{z}| \varvec{x}; \varvec{y}) = \mathcal {N}(\varvec{z}; \varvec{\mu }_{z | x, y}, \varvec{\varSigma }_{z |x, y}), \end{aligned}$$
(5)

where \(\varvec{\varSigma }_{z | x, y}\) is diagonal.

We estimate \(\varvec{\mu }_{z | x, y}\), and \(\varvec{\varSigma }_{z | x, y}\) using a convolutional neural network \(\text {def}_{\varvec{\psi }}(\varvec{x},\varvec{y})\) parameterized by \(\varvec{\psi }\), as described below. We can therefore learn the parameters \(\varvec{\psi }\) by optimizing the variational lower bound (4) using stochastic gradient methods. Specifically, for each image pair \(\{\varvec{x}, \varvec{y}\}\) and samples \(\varvec{z}_k\sim q_{\psi }(\varvec{z}|\varvec{x}; \varvec{y})\), we can compute \(\varvec{y}\circ \varvec{\phi }_{z_k}\), with the resulting loss:

$$\begin{aligned}&\mathcal {L}(\varvec{\psi }; \varvec{x}, \varvec{y}) = - \mathrm{IE}_{q} \left[ \log p(\varvec{x}| \varvec{z}; \varvec{y}) \right] + \text {KL}\left[ q_{\varvec{\psi }}(\varvec{z}|\varvec{x}; \varvec{y}) || p(\varvec{z}) \right] \\&= \frac{1}{2\sigma ^2K} \sum _k ||\varvec{x}- \varvec{y}\circ \varvec{\phi }_{z_k}||^2 + \frac{1}{2} \left[ \text {tr}(\lambda \varvec{D}\varvec{\varSigma }_{z|x;y} - \log |\varvec{\varSigma }_{z|x;y}|) + \varvec{\mu }_{z | x, y}^T \varvec{\Lambda }_z \varvec{\mu }_{z | x, y} \right] + \text {const}, \nonumber \end{aligned}$$
(6)

where K is the number of samples used. In our experiments, we use \(K=1\). The first term encourages the warped image \(\varvec{y}\circ \varvec{\phi }_{z_k}\) to be similar to \(\varvec{x}\). The second term encourages the posterior to be close to the prior \(p(\varvec{z})\). Although the variational covariance \(\varvec{\varSigma }_{z|x,y}\) is diagonal, the last term spatially smoothes the mean: \(\varvec{\mu }_{z | x, y}^T \varvec{\Lambda }_z \varvec{\mu }_{z | x, y} = \frac{\lambda }{2} \sum \sum _{j\in N(I)} (\varvec{\mu }[i] - \varvec{\mu }[j])^2\), where N(i) are the neighbors of voxel i. We treat \(\sigma ^2\) and \(\lambda \) as fixed hyper-parameters.

Fig. 1.
figure 1

Overview of end-to-end unsupervised architecture. The first part of the network, \(\text {def}_{\psi }(\varvec{x}, \varvec{y})\) takes the input images and outputs the approximate posterior probability parameters representing the velocity field mean, \(\varvec{\mu }_{z|x;y}\), and variance, \(\varvec{\varSigma }_{z|x;y}\). A velocity field \(\varvec{z}\) is sampled and transformed to a diffeomorphic deformation field \(\varvec{\phi }_z\) using novel differentiable squaring and scaling layers. Finally, a spatial transform warps \(\varvec{y}\) to obtain \(\varvec{y}\circ \varvec{\phi }_z\).

2.3 Neural Network Framework

We design the network  \(\text {def}_{\varvec{\psi }}(\varvec{x},\varvec{y})\) that takes as input \(\varvec{x}\) and \(\varvec{y}\) and outputs \(\varvec{\mu }_{z|x,y}\) and \(\varvec{\varSigma }_{z|x,y}\), based on a 3D UNet-style architecture [24]. The network includes a convolutional layer with 16 filters, four downsampling layers with 32 convolutional filters and a stride of two, and finally three upsampling convolutional layers with 32 filters. All convolutional layers use LeakyReLu activations and a 3\(\,\times \,\)3 kernel.

To enable unsupervised learning of parameters \(\varvec{\psi }\) using (6), we must form \(\varvec{y}\circ \varvec{\phi }_z\) to compute the data term. We first implement a layer that samples a new \(\varvec{z}_k \sim \mathcal {N}(\varvec{\mu }_{z|x,y}, \varvec{\varSigma }_{z|x,y})\) using the “re-parameterization trick” [16].

We propose novel scaling and squaring network layers to compute \(\varvec{\phi }_{z_k} = \exp (\varvec{z}_k)\). Specifically, these involve compositions within the neural network architecture using a differentiable spatial transformation operation. Given two 3D vector fields \(\varvec{a}\) and \(\varvec{b}\), for each voxel p this layer computes \((\varvec{a}\circ \varvec{b})(p) = \varvec{a}(\varvec{b}(p))\), a non-integer voxel location \(\varvec{b}(\varvec{p})\) in \(\varvec{a}\), using linear interpolation. Starting with \(\varvec{\phi }^{(1/2^T)} = \varvec{p}+ \varvec{z}_k/2^T\), we compute \(\varvec{\phi }^{(1/2^{t+1})} = \varvec{\phi }^{(1/2^t)} \circ \varvec{\phi }^{(1/2^t)}\) recursively using these layers, leading to \(\varvec{\phi }^{(1)} \triangleq \varvec{\phi }_{z_k} = \exp (\varvec{z}_k)\). In our experiments, we use \(T=7\).

Finally, we use a spatial transform layer to warp volume \(\varvec{y}\) according to the computed diffeomorphic field \(\varvec{\phi }_{z_k}\). This network results in three outputs, \(\varvec{\mu }_{z|x,y}, \varvec{\varSigma }_{z|x,y}\) and  \(\varvec{y}\circ \varvec{\phi }_{z_k}\), which are used in the model loss (6).

In summary, the neural network takes as input \(\varvec{x}\) and \(\varvec{y}\), computes \(\varvec{\mu }_{z|x,y}\) and \(\varvec{\varSigma }_{z|x,y}\), samples a new \(\varvec{z}_k \sim \mathcal {N}(\varvec{\mu }_k, \varvec{\varSigma }_k)\), computes a diffeomorphic \(\varvec{\phi }_{z_k}\) and applies it to \(\varvec{y}\). Since all the steps are designed to be differentiable, we learn the network parameters using stochastic gradient descent based methods on the loss (6). The framework is summarized in Fig. 1. We implement our method as part of the VoxelMorph package using Keras with a Tensorflow backend.

2.4 Registration and Uncertainty

Given learned parameters, we approximate registration of a new scan pair \((\varvec{x}, \varvec{y})\) using \(\varvec{\phi }_{\hat{z}_k}\). We first obtain \(\hat{\varvec{z}}_k\) using

$$\begin{aligned} \hat{\varvec{z}}_k&= \arg \max _{\varvec{z}_k} p(\varvec{z}_k | \varvec{x}; \varvec{y}) = \varvec{\mu }_{z|x;y}, \end{aligned}$$
(7)

by evaluating the neural network \(\text {def}_\psi (\varvec{x}, \varvec{y})\) on the two input images. We then compute \(\varvec{\phi }_{\hat{z}_k}\) using the scaling and squaring network layers. We also obtain \(\varvec{\varSigma }_{z|x,y}\), enabling an estimation of the uncertainty of the velocity field \(\varvec{z}\) at each voxel j:

$$\begin{aligned} H(\varvec{z}[j])&\approx \mathrm{IE}\left[ -\log q_{\varvec{\psi }}(\varvec{z}|\varvec{x},\varvec{y}) \right] = \frac{1}{2} \log 2 \pi \varvec{\varSigma }_{z|x;y}[j,j]. \end{aligned}$$
(8)

We also estimate uncertainty in the deformation field \(\varvec{\phi }_z\) empirically. We sample several representations \(\varvec{z}_{k'} \sim q_{\psi }(\varvec{z}| \varvec{x}; \varvec{y})\), propagate them through the diffeomorphic layers to compute \(\varvec{\phi }_{z_k'}\), and compute the empirical diagonal covariance \(\hat{\varvec{\varSigma }}_{\varvec{\phi }_z}[j,j]\) across samples. The uncertainty is then \(H(\varvec{\phi }[j]) \approx \frac{1}{2} \log 2 \pi \hat{\varvec{\varSigma }}_{\phi _z}[j,j]\).

3 Experiments

We focus on 3D atlas-based registration, a common task in population analysis. Specifically, we register each scan to an atlas computed using external data [11].

Fig. 2.
figure 2

Example MR slices of input moving image, atlas, and resulting warped image for our method and ANTs, with overlaid boundaries of ventricles, thalami and hippocampi. Our resulting registration field is shown as a warped grid and RGB image, with each channel representing dimension. Due to space constraints, we omit VoxelMorph examples, which are visually similar to our results and ANTs.

Data and Preprocessing. We use a large-scale, multi-site dataset of 7829 T1-weighted brain MRI scans from eight publicly available datasets: ADNI [22], OASIS [19], ABIDE [10], ADHD200 [21], MCIC [13], PPMI [20], HABS [7], and Harvard GSP [14]. Acquisition details, subject age ranges and health conditions are different for each dataset. We performed standard pre-processing steps on all scans, including resampling to 1mm isotropic voxels, affine spatial normalization and brain extraction for each scan using FreeSurfer [11]. We crop the final images to \(160\times 192 \times 224\). Segmentation maps including 29 anatomical structures, obtained using FreeSurfer for each scan, are used in evaluating registration results. We split the dataset into 7329, 250, and 250 volumes for train, validation, and test sets respectively, although we underscore that the training is unsupervised.

Evaluation Metric. To evaluate a registration algorithm, we register each subject to an atlas, propagate the segmentation map using the resulting warp, and measure volume overlap using the Dice metric. We also evaluate the diffeomorphic property, a focus of our model. Specifically, the Jacobian matrix \(J_{\phi }(p) = \nabla \varvec{\phi }(p) \in \mathcal {R}^{3\times 3}\) captures the local properties of \(\varvec{\phi }\) around voxel p. The local deformation is diffeomorphic, both invertible and orientation-preserving, only at locations for which \(|J_{\phi }(p)| > 0\) [2]. We count all other voxels, where \(|J_{\phi }(p)| \le 0\).

Table 1. Summary of results: mean Dice scores over all anatomical structures and subjects (higher is better), mean runtime; and mean number of locations with a non-positive Jacobian of each registration field (lower is better). All methods have comparable Dice scores, while our method and the original VoxelMorph are orders of magnitude faster than ANTs. Only our method achieves both high accuracy and fast runtime while also having nearly zero non-negative Jacobian locations and providing uncertainty prediction.
Fig. 3.
figure 3

Boxplots indicating Dice scores for anatomical structures for ANTs, VoxelMorph, and our algorithm. Left and right hemisphere structures are merged for visualization, and ordered by average ANTs Dice score. We include the brain stem (BS), thalamus (Th), cerebellum cortex (CblmC), lateral ventricle (LV), cerebellum white matter (CblmWM), putamen (Pu), cerebral white matter (CeblWM), ventral DC (VDC), caudate (Ca), pallidum (Pa), hippocampus (Hi), 3rd ventricle (3V), 4th ventricle (4V), amygdala (Am), CSF (CSF), cerebral cortex (CeblC), and choroid plexus (CP).

Baseline Methods. We compare our approach with the popular ANTs software package using Symmetric Normalization (SyN) [3], a top-performing algorithm [17]. We found that the default ANTs settings were sub-optimal for our task, so we performed a wide parameter and similarity metric search across a multitude of datasets. We identified top performing parameter values on the Dice metric and used cross-correlation as the ANTs similarity measure. We also test our recent CNN-based method, VoxelMorph, which aims to produce fast registration but does not yield diffeomorphic results or uncertainty estimates [5]. We sweep the regularization parameter using our validation set, and use the optimal parameters in our results.

Results on Test Set: Figure 2 shows representative results. Figure 3 illustrates Dice results on several anatomical structures, and Table 1 gives a summary of the results. Not only does our algorithm achieve state of the art Dice results and the fastest runtimes, but it also produces diffeomorphic registration fields (having nearly no non-negative Jacobian voxels per scan) and uncertainty estimates.

Specifically, all methods achieve comparable Dice results on each structure and overall. Our method and VoxelMorph require a fraction of the ANTs runtime to register two images: less than a second on a GPU, and less than a minute on a CPU (for our method). To the best of our knowledge, ANTs does not have a GPU implementation. Algorithm runtimes were computed for an NVIDIA TitanX GPU and a Intel Xeon (E5-2680) CPU, and exclude preprocessing common to all methods. Importantly, while our method achieves positive Jacobians at nearly all voxels, the flow fields resulting from the baseline methods contain a few thousand locations of non-positive Jacobians. This can be alleviated with increased spatial regularization, but this in turn leads to a drop in performance on the Dice metric.

Uncertainty. Figure 4 shows representative uncertainty maps, unique to our model. The velocity field is more certain near anatomical structure edges, and less confident in homogenous regions, such as the white matter or ventricle interior.

Parameter Analysis. We perform a grid search for the two fixed hyper-parameters \(\lambda \) and \(\sigma ^2\). We train a model for each parameter pair and evaluate Dice on the validation set. We search 30 values within two orders of magnitude around meaningful initial values for both parameters: \(\sigma ^2 \sim (0.07)^2\), the variance of the intensity difference between an affinely aligned image and the atlas, and \(\lambda = 10000\), equivalent to a diagonal standard deviation of 1 voxel for \(\varvec{\phi }_z\). We found \(\sigma ^2 \sim (0.035)^2\) and \(\lambda \in (20000, 100000)\) to perform well, and set \(\lambda =70,000\).

Fig. 4.
figure 4

Example velocity field uncertainty \(H(\varvec{z})\) (left) indicates low uncertainty near structure boundaries, as seen in the line graph (middle). This correlation is less obvious in the final registration field uncertainty \(H(\varvec{\phi }_z)\) (right).

4 Conclusion

We propose a probabilistic model for diffeomorphic image registration and derive a learning algorithm that makes use of a convolutional neural network and an intuitive resulting loss function. To achieve unsupervised, end-to-end learning for diffeomorphic registrations, we introduce novel scaling and squaring differentiable layers. Our derivation is generalizable. For example, \(\varvec{z}\) can be a low dimensional embedding representation of a deformation field, or the displacement field itself. Our algorithm can infer the registration of new image pairs in under a second. Compared to traditional methods, our method is significantly faster, and compared to recent learning based methods, our method offers diffeomorphic guarantees, and provides natural uncertainty estimates for resulting registrations.