1 Introduction

Registration, the process of aligning images, is an important technique which allows visual inspection and computational analysis of images in a common coordinate system. For fetal abnormality screening, registered Magnetic Resonance (MR)/Ultrasound (US) images may assist diagnosis as the two modalities capture complementary anatomical information. For example, in the fetal brain, MR images have better contrast between important structures such as cortical Grey Matter (GM) and White Matter (WM), whereas the higher spatial resolutions of US gives better discrimination between fine structures such as the septum pellucidum and the choroid plexus [7].

A voxel-wise image similarity measure or cost function is commonly used in medical imaging to register images. This function quantifies the alignment of images, where an extremum gives the optimum alignment between images. Unfortunately, image similarity-based methods are ill-suited to the challenging task of US/MR image registration as there is no global intensity relationship between the two modalities. Primarily this is due to the imaging artefacts present in US images, such as view-dependent shadows, speckle noise, anisotropy, attenuation, reverberation and refraction. Popular similarity measures developed specifically for other multi-modal registration problems in the past, such as Normalised Mutual Information (NMI), often fail, even with a good initialisation [12].

Consequently, an alternative approach has arisen for registration of images with non-global intensity relationships whereby image intensities are first transformed to a modality independent representation. These are typically derived from hand-crafted descriptors which capture structural information from images such as edges and corners. Representations used by previous authors include local gradient orientation [4], local phase [9] and local entropy [16]. Notably, [5] use the concept of self-similarity, computing the similarity of small image patches in a local neighboured within an image, which achieved state-of-the-art performance on a challenging US/MRI registration dataset. Another approach to this problem is modality synthesis, which aims to transform image intensities from one modality to another allowing the registration task to be treated as a mono-modal problem. [7] made use of this approach to register the fetal brain imaged by US and MR for the first time.

More recently, deep neural networks have been applied to the problem of registration. Two common strategies for registration with deep learning include estimating a similarity measure [2, 15] and predicting transformations directly [1, 13]. An advantage of the first approach is that it allows established transformation models and optimizers to be used, however, this could be a hindrance if the learnt similarity function is not smooth or convex. The second approach, predicting the parameters of a transformation model directly, is receiving more research focus recently as it allows more robust transformation estimates.

1.1 Proposed Method

In this work, we adopt a deep learning approach to tackle the challenging task of paired 3D MR/US fetal brain registration. Our Long Short-Term Memory (LSTM) network simultaneously predicts a joint isotropic rescaling plus independent rigid transformations for both MR/US images, aligning them to a dual-modality spatio-temporal atlas (Sect. 2.6). Transformation estimates are refined iteratively over time, allowing for higher accuracy. For this, we extend the iterative spatial transformer [8] for co-transformation of multiple images (see Fig. 1). The main contributions of this work are as follows:

  • A network architecture inspired by spatial transformer networks [6] for group-wise registration of images to a common pose.

  • A loss function which encourages convergence and fine alignment of images.

Fig. 1.
figure 1

Proposed LSTM spatial co-transformer for coalignment of 3D MR/US images. Flow of image intensities is shown in blue while flow of transformation parameters is shown in red. An LSTM network predicts residual transformations \(\mathbf {M}_{mr}^{\delta },\,\mathbf {M}_{us}^{\delta }\) conditioned on the current warped images \(O^{us},\, O^{mr}\), iteratively refining their alignment. (Color figure online)

2 Methods

2.1 Overview

The spatial transformer module [6] allows geometric transformation of network inputs or feature maps within a network, conditioned on the input or feature map itself. Importantly, the spatial transformer module is differentiable, allowing end-to-end training of any network it is inserted into. This allows reorientation of an image into a canonical pose, simplifying the task of subsequent layers. [8] proposed an elegant iterative version of the spatial transformer that passes composed transformation parameters through the network instead of warped images, preserving image intensities until the final transformation. The same geometric predictor with a much simpler network architecture can be used in a recurrent manner, for more accurate alignment.

In this work, we propose a novel extension the recurrent/LSTM spatial co-transformer”, which allows simultaneous transformation of multiple images to a common pose. Commonly, registration algorithms estimate a warp from one image (the source) towards another (the target). However, we found that fine alignment is more easily learnt between images in a common pose. Thus, we simultaneously co-align pairs of MR/US images to a common atlas-space (Sect. 2.6), which will also facilitate future computational image analysis.

Additionally, we propose an LSTM-based parameter prediction network (Fig. 2) and a temporally varying loss function (Sect. 2.5) for more accurate alignments.

2.2 Recurrent Spatial Co-transformer

The recurrent spatial co-transformer consists of three main components: (1) the warper, (2) the residual parameter prediction network and (3) the composer. The first component, the warper, is the computational machinery needed to transform an image and does not contain any learnable parameters. For simplicity of discourse, we treat this as a single function \(f_{warp}\) and refer the reader to [6] for a detailed description of grid transformation and differentiable interpolation. The second component, the parameter prediction network, \(f_{predict}\), predicts residual transformations conditioned on the current warped output images. Finally, the third component, the composer, updates the transformation estimates. The recurrent spatial co-transformer iterates between three steps, which will now be described in more detail.

Step 1 - Image Warping. For iteration t, Let \(\mathcal {I}=(I^{0},\, I^{1},\,\dots \,,\, I^{N})\) denote an N-tuple of input images, \(\varTheta _{t}=(\theta _{t}^{0},\theta _{t}^{1},\,\dots \,,\,\theta _{t}^{N})\) denote an N-tuple of corresponding transformation estimates and \(\mathcal {O}_{t}=(O_{t}^{0},O_{t}^{1},\,\dots \,,\, O_{t}^{N})\) denote an N-tuple of corresponding warped output images. Then each input image \(I^{i}\) is first warped independently given its last transformation estimate \(\theta _{t-1}^{i}\)

$$\begin{aligned} O_{t-1}^{i}=f_{warp}(I^{i},\,\mathbf {G},\,\theta _{t-1}^{i})\quad \forall i\in [1,\,\dots \,,\, N]. \end{aligned}$$
(1)

Here, \(\mathbf {G}=[\mathbf {g}_{1},\,\dots \,,\mathbf {g}_{g}]\in \mathbb {R}^{4\times g}\) is a matrix of homogeneous grid coordinates.

Step 2 - Residual Parameter Prediction. Warped images \(\mathcal {O}_{t-1}\) are concatenated along the channel axis and passed as a single tensor to \(f_{predict}\) which simultaneously predicts an N-tuple of corresponding residual transformations \(\varDelta _{t}=(\delta _{t}^{0},\delta _{t}^{1},\,\dots \,,\,\delta _{t}^{N})\)

(2)

\(f_{predict}\) can take any form but typically consists of a feed-forward network with several interleaved convolutional and max pooling layers followed by a fully connected layer and a final fully connected regression layer with the number of units equalling the number of model parameters.

Step 3 - Parameter Composition. Finally, each transformation estimate \(\theta _{t-1}^{i}\), is composed with its residual transformation estimate \(\delta _{t}^{i}\), yielding a new transformation estimate \(\theta _{t}^{i}\)

$$\begin{aligned} \theta _{t}^{i}=f_{update}(\theta _{t-1}^{i},\,\delta _{t}^{i})\quad \forall i\in [1,\,\dots \,,\, N]. \end{aligned}$$
(3)

The composition function \(f_{update}\) will vary depending on the transformation model. For example, if \(\theta \) parametrises a homogeneous transformation matrix, \(f_{update}\) would be matrix multiplication.

2.3 LSTM Spatial Co-transformer

For more accurate parameter prediction, we propose an LSTM network architecture for \(f_{predict}\). LSTMs are an extremely powerful network architecture capable of storing information in a cell state allowing them to learn long term dependencies in sequential data much better than recurrent neural networks. For this we modify the prediction function \(f_{predict}\) (Eq. 2) so that it now takes a feature vector \(\mathbf {x}_{t}\), and a cell state vector \(\mathbf {c}_{t}\)

(4)

Here \(f_{extract}\) is a function that extracts the feature vector \(\mathbf {x}_{t}\) from the concatenation of the output images, . For this, we chose a neural network with a series of convolutions and max pooling operations followed by a flattening procedure (see Fig. 2 for a schematic, however any network architecture may be used that produces a vector). At each iteration t, the cell state \(\mathbf {c}_{t}\) is updated by a linear blend of the previous cell state \(\mathbf {c}_{t-1}\) and a vector of candidate values \(\tilde{\mathbf {c}}_{t}\) [3]

(5)

Here, is the Hadamard or element-wise product and \(\mathbf {f}_{t}\) is the forget mask, a real valued vector that determines which information is forgotten from the cell state and which candidate values are added. We define \(\mathbf {f}_{t}\) as the result of a single function \(f_{forget}\) that takes the extracted feature vector \(\mathbf {x}_{t}\) and also the previous cell state \(\mathbf {c}_{t-1}\). We implement both the forget and candidate functions as a sequence of two dense layers with weight matrices \(\mathbf {W}_{f1}\), \(\mathbf {W}_{f2}\) and \(\mathbf {W}_{c1}\), \(\mathbf {W}_{c2}\), respectively

$$\begin{aligned} \mathbf {f}_{t}=f_{forget}(\mathbf {c}_{t-1},\,\mathbf {x}_{t})=\sigma (\mathbf {W}_{f2}\,.\text {max}(\mathbf {W}_{f1}\,.\,\left[ \mathbf {c}_{t-1},\,\mathbf {x}_{t}\right] ,\,0)), \end{aligned}$$
(6)
$$\begin{aligned} \tilde{\mathbf {c}}_{t}=f_{candidate}(\mathbf {c}_{t-1},\,\mathbf {x}_{t})=\text {tanh}(\mathbf {W}_{c2}\,.\text {max}(\mathbf {W}_{c1}\,.\,\left[ \mathbf {c}_{t-1},\,\mathbf {x}_{t}\right] ,\,0)). \end{aligned}$$
(7)
Fig. 2.
figure 2

LSTM parameter prediction architecture for rigid alignment of MR/US images. The image feature extractor encodes a dual-channel image as a vector that is passed into an LSTM network which predicts a residual transformation. Fourteen parameters are predicted: three for rotation, three for translation and one for isotropic scale, per modality (note, weights for scaling are shared between modalities).

2.4 Rigid Parameter Prediction

For rigid coalignment, our network predicts seven residual update parameters per image: an isotropic log scaling s, three rotation parameters \(r_{x}\), \(\, r_{y}\), \(r_{z}\) and three translation parameters \(t_{x}\), \(t_{y}\), \(t_{z}\). Here, \([r_{x}\), \(r_{y}\), \(r_{z}]\) gives an axis of rotation, while \(\phi =\left\| [r_{x},\, r_{y},\, r_{z}]\right\| _{2}\), gives the angle of rotation. Note, weights are shared between images for scaling parameters. Our transformation parameters now become rigid transformation matrices \(\delta _{t}=\mathbf {M}_{t}^{\delta }\), \(\theta _{t}=\mathbf {M}_{t}\). Note, for simplicity, transformations \(\mathbf {M}\) are applied to the target grid \(\mathbf {G}\) before resampling, i.e. the inverse transformation. For consistency, we define \(\mathbf {M}^{\delta }\) as the inverse update and \((\mathbf {M}^{\delta })^{-1}\) as the forward update. Learning a series of forward update transformations is inherently easier for the network, thus we post-multiply the current transformation matrix by the residual matrix, \(\mathbf {M}\leftarrow \mathbf {M}\mathbf {M^{\delta }}\). This is equivalent to updating the forward transformation as follows \(\mathbf {M}^{-1}\leftarrow (\mathbf {M}^{\delta })^{-1}\mathbf {M}^{-1}\). The forward update transformation is composed as a translation, followed by a rotation, followed by an isotropic rescaling, \((\mathbf {M}^{\delta })^{-1}=\mathbf {S}\mathbf {R}\mathbf {T}\). In practice, we predict the inverse of the update directly by reversing the composition and inverting the operations \(\mathbf {M}^{\delta }=\mathbf {T}^{-1}\mathbf {R}^{-1}\mathbf {S}^{-1}\).

2.5 Training and Loss Function

Let \(\mathbf {X}=\{\mathcal {I}_{0},\mathcal {I}_{1},\,\dots \,,\mathcal {I}_{n}\}\) denote a training set of n aligned image tuples. Images in the training set are initially aligned to a common pose (in our case we affinely align our MR and US images to a dual-modality atlas, see Sect. 2.6). For each training iteration, an image tuple is selected \(\mathcal {I}=(I^{0},\, I^{1},\,\dots \,,\, I^{N})\) and each image \(I^{i}\) is transformed by a randomly generated matrix \(\mathbf {D}^{i}\), before being fed into the network. \(\mathbf {D}^{i}\) incorporates an affine augmentation (shared across the input tuple) and an initial rigid disorientation. For augmentation, we randomly sample and compose a shearing, an anisotropic scaling and an isotropic scaling. For disorientation, we compose a random rotation and translation. Crucially, the use of a recurrent network allows us to back-propagate errors through time. We took advantage of this by designing a temporally varying loss function comprising of a relative and an absolute term, which allows our network to learn a long term strategy for alignment. For k alignment iterations of N images, we define our loss

$$\begin{aligned} \mathcal {L}=\sum _{i=1}^{N}\sum _{t=1}^{k}d(\mathbf {M}_{t}^{i}\,\mathbf {D}^{i})/\text {d}(\mathbf {M}_{t-1}^{i}\,\mathbf {D}^{i})+\lambda \dfrac{t}{k}\, d(\mathbf {M}_{t}^{i}\,\mathbf {D}^{i}). \end{aligned}$$
(8)

Here, d is a distance function of a transformation matrix from the identity and \(\lambda \) is a weighting between the loss terms. The first loss term rescales distance errors \(d(\mathbf {M}_{t}^{i}\,\mathbf {D}^{i})\), relative to the previous distance error, \(\text {d}(\mathbf {M}_{t-1}^{i}\,\mathbf {D}^{i})\). This encourages the network to learn fine alignments and convergence. Note, \(\text {d}(\mathbf {M}_{t-1}^{i}\,\mathbf {D}^{i})\) is treated as a constant here. The second term penalises the absolute error with increasing weight, encouraging initial exploration but still penalising poor final alignments. The distance function \(d(\mathbf {M})\) is computed by first decomposing matrix \(\mathbf {M}\) into a isotropic scale s, a translation vector \(\mathbf {t}\) and a rotation matrix \(\mathbf {R}\). We then compute \(d(\mathbf {M})\) as a sum of separate distance measures for each of these components

$$ d(\mathbf {M})=d_{scale}(s)+d_{rotate}(\mathbf {R})+d_{translate}(\mathbf {t}),\,\,\;\text {where}\quad d_{translate}(\mathbf {t})=\left\| \mathbf {t}\right\| _{2}, $$
$$\begin{aligned} d_{scale}(s)=\mu \left| \text {log}(s)\right| \,\text {and}\quad d_{rotate}(\mathbf {R})=\dfrac{1}{g}\sum _{i=1}^{g}\left\| \mathbf {g}_{i}-\mathbf {R}\mathbf {g}_{i}\right\| _{2}. \end{aligned}$$
(9)

Here, \(\mu \) weights \(d_{scale}\) relative to the other two distance measures. Rotation distance, \(d_{rotate}\), is given by the mean distance between transformed grid points \(\mathbf {R}\mathbf {g}_{i}\) and their initial locations \(\mathbf {g}_{i}\). This gives a natural weighting between translation and rotation components.

2.6 Joint Affine MR/US Spatio-Temporal Atlas (Ground Truth)

We followed the approach of [14], by constructing average image intensity templates for each week of gestation (20–31 weeks), from 166 3D reconstructed MR/3D US image pairs. A set of templates was constructed for each modality separately with a final registration step between templates to establish correspondences across modalities. This process comprised of three parts: (1) manual reorientation (2) age-dependant template bootstrapping and (3) unified template bootstrapping. All images were carefully manually reoriented to a standard pose with the yz plane aligned with the brain midline and the top of the brain stem centred at the origin. Averaging reoriented image intensities yielded an initial template estimate which was refined using a bootstrapping procedure. This involved alternating between two steps: (1) affinely registering images to the current template and (2) averaging registered image intensities. The bootstrapping procedure was then repeated between templates to establish correspondences across time. MR templates were constructed first, allowing us to fix the shearing and scaling parameters for US template construction. For US registration, we restrict the optimisation to three degrees of freedom, rotation around x, and translation along y and z, thus respecting the manual definition of the midline. With additional masking, this allowed robust registration of US images for template construction using [10].

3 Results and Discussion

3.1 Alignment Error

To demonstrate the accuracy of our method (LSTM ST) we compute registration errors with respect to two ground truth alignments: the first, derived from our spatio-temporal atlas and the second, derived from anatomical landmarks picked by clinical experts (fourteen per image), which offers an unbiased alternative. For comparison, two image similarity-based registration methods were chosen, NMI with block-matching (NMI+MI) [10] and self-similarity context descriptors with discrete optimisation (SSC+DO) [5]. Both of these methods were developed for robust registration and have been used for multi-modality registration tasks previously. To compare the accuracy of the methods and also their ability to register highly misaligned images, we created three test sets with different ranges of disorientation: [, ], [, ] and [, ].

Table 1. Mean alignment error. Mean rotation and translation errors over our test set are shown for three automated registration methods, relative to two ground truth alignments.
Fig. 3.
figure 3

Median (blue) and 95th percentile (red) alignments by rotation error for SSC+DO and LSTM ST. Alignments for other methods are shown for comparison. Each column shows the same MR image for a subject from our test set with its corresponding US image thresholded, colour-mapped, overlayed and aligned, by each of the automated methods. (Color figure online)

Fig. 4.
figure 4

Template sharpness. Templates are constructed by averaging image intensities for US images registered to an MR template via their corresponding MR images (see Sect. 3.2). Higher Variance of the Laplacian (VAR) indicates sharper templates and better registration accuracy, while higher Peak Signal-to-Noise Ratio (PSNR) indicates greater similarity with the atlas ground truth template.

As we can see from Table 1 our method outperforms both similarity-based methods for all disorientation levels and both ground truth datasets. Furthermore, our method converges to the same alignment for each image pair, irrespective of initial orientation and positioning, which explains the very similar mean errors seen for the three disorientation levels. Conversely, similarity-based methods failed to register images for higher levels of disorientation. All pairs of images registered by our method were visually inspected and a reasonable alignment was found in all cases (see Fig. 3 for example alignments). The worst rotation and translation errors seen were \(7.9^{\circ }\) and \(1.8\,\text {mm}\) respectively, showing our method is relatively robust.

3.2 Mean Templates

We construct US mean templates by first registering each US image to its corresponding MR image, rigidly, then affinely transforming the image pair to the MR atlas space and finally averaging the intensities for all transformed US images. If registration between modalities is accurate, then the constructed US template should be crisp. To evaluate the constructed templates, we compute two measures, Peak Signal-to-Noise Ratio (PSNR) with respect to our ground truth US template (Sect. 2.6) and the Variance of the image Laplacian (VAR), which provides an unbiased measure of sharpness [11]. Figure 4 shows that our method produces the sharpest template as measured by VAR and also has the highest PNSR. Furthermore, templates for our method have the same sharpness for any level of initial disorientation.

4 Conclusion

In this work, we proposed the LSTM spatial co-transformer, a deep learning-based method for group-wise registration of images to a standard pose. We applied this to the challenging task of fetal MR/US brain image registration. Our method automatically coaligns brain images with a dual-modality spatio-temporal atlas, where future computational image analysis may be performed. Our results show that our method registers images more accurately than state-of-the-art similarity-based registration method “self-similarity context descriptors” [5]. Furthermore, it is able to robustly register highly misaligned images, where similarity-based will fail.