1 Introduction

Registration of images with focus on the ROI is essential in fusion and atlas-based segmentation (e.g. [9]). Traditional algorithms try to compute the dense mapping between two images by minimizing an objective function with regard to some similarity criterion. However, besides challenges of solving the ill-posed and non-convex problem many approaches have difficulties in handling large deformations or large variability in appearance. Recently, promising results using deep representation learning have been presented for learning similarity metrics [8], predicting the optical flow [1] or the large deformation diffeomorphic metric mapping-momentum [10]. These approaches either only partially remove the above-mentioned limitations as they stick to an energy minimization framework (cf. [8]) or rely on a large number of training samples derived from existing registration results (cf. [1, 10]).

Inspired by the recent works in reinforcement learning [2, 6], we propose a reformulation of the non-rigid registration problem following a similar methodology as in 3-D rigid registration of [4]: in order to optimize the parameters of a deformation model we apply an artificial agent – solely learned from experience – that does not require explicitly designed similarity measures, regularization and optimization strategy. Trained in a supervised way the agent explores the space of deformations by choosing from a set of actions that update the parameters. By iteratively selecting actions, the agent moves on a trajectory towards the final deformation parameters. To decide which action to take we present a deep dual-stream neural network for implicit image correspondence learning. This work generalizes [4] to non-rigid registration problems by using a larger number of actions with a low-dimensional parametric deformation model. Since ground-truth (GT) deformation fields are typically not available for deformable registration and training based on landmark-aligned images as in rigid registration (cf. [4]) is not applicable, we propose a novel GT generator combining synthetically deformed and real image pairs. The GT deformation parameters of the real training pairs were extracted by constraining existing registration algorithms with known correspondences in the ROI in order to get the best possible organ-focused results. Thus, the main contributions of this work are: (1) The creation and use of a low-dimensional parametric statistical deformation model for organ-focused deep learning-based non-rigid registration. (2) A ground truth generator which allows generating millions of synthetically deformed training samples requiring only a few (<1000) real deformation estimations. (3) A novel way of fuzzy action control.

2 Method

2.1 Training Artificial Agents

Image registration consists in finding a spatial transformation \(\mathcal {T}_\theta \), parameterized by \(\theta \in \mathbb {R}^d\) which best warps the moving image \(\mathbf {M}\) as to match the fixed image \(\mathbf {F}\). Traditionally, this is done by minimizing an objective function of the form: \({{\mathrm{arg\,min}}}_\theta \mathcal {F}(\theta ,\mathbf {M},\mathbf {F})= \mathcal {D}\left( \mathbf {F},\mathbf {M} \,{\circ }\, \mathcal {T}_\theta ) + \mathcal {R}(\mathcal {T}_\theta \right) \) with the image similarity metric \(\mathcal {D}\) and a regularizer \(\mathcal {R}\). In many cases, an iterative scheme is applied where at each iteration t the current parameter value \(\theta _t\) is updated through gradient descent: \(\theta _{t+1}=\theta _t+\lambda \nabla \mathcal {F}(\theta _t,\mathbf {M}_t,\mathbf {F})\) where \(\mathbf {M}_t\) is the deformed moving image at time step t: \(\mathbf {M} \,{\circ }\, \mathcal {T}_{\theta _t}\).

Inspired by [4], we propose an alternative approach to optimize \(\theta \) based on an artificial agent which decides to perform a simple action \(a_t\) at each iteration t consisting in applying a fixed increment \(\delta \theta _{a_t}\): \(\theta _{t+1}=\theta _{t}+\delta \theta _{a_t}\). If \(\theta \) is a d-dimensional vector of parameters, we define 2d possible actions \(a\in \mathcal {A}\) such that \(\delta \theta _{2i}[j]= \epsilon _i \delta _i^j\) and \(\delta \theta _{2i+1}[j]= -\epsilon _i \delta _i^j\) with \(i \in \{0 \ldots d-1\}\). In other words the application of an action \(a_t\) increases or decreases a specific parameter within \(\theta _t\) by a fixed amount where \(\delta _i^j\) is an additional scaling factor per dimension that is set to 1 in our experiments but could be used e.g. to allow larger magnitudes first and smaller in later iterations for fine-tuning the registration.

The difficulty in this approach lies into selecting the action \(a_t\) as function of the current state \(s_t\) consisting of the fixed and current moving image: \(s_t=(\mathbf {F},\mathbf {M}_t)\). To this end, the framework models a Markov decision process (MDP), where the agent interacts with an environment getting feedbacks for each action. In reinforcement learning (RL) the best action is selected based on the maximization of the quality function \(a_t = {{\mathrm{arg\,max}}}_{a\in {\mathcal A}} Q^\star (s_t,a)\). In the most general setting, this optimal action-value function is computed based on the reward function defined between two states \(\mathcal {R}(s_1,a,s_2)\) which serves as the feed-back signal for the agent to quantify the improvement or worsening when applying a certain action. Thus, \(Q^\star (s_t,a)\) may take into account the immediate but also future rewards starting from state \(s_t\), as to evaluate the performance of an action a.

Recently, in RL powerful deep neural networks have been presented that approximate the optimal \(Q^\star \) [6]. Ghesu et al. [2] used deep reinforcement learning (DRL) for landmark detection in 2-D medical images. In the rigid registration approach by Liao et al. [4] the agent’s actions are defined as translation and rotation movements of the moving image in order to match the fixed image.

In this work, the quality function \(\mathbf {y}_a(s_t)\approx Q^\star (s_t, a)\) is learned in a supervised manner through a deep regression network. More precisely, we adopt a single-stage MDP for which \(Q^\star (s_t,a)=\mathcal {R}(s_t,a, s_{t+1})\), implying that only the immediate reward, i.e. the next best action, is accounted for. During training, a batch of random states, pairs of \(\mathbf {F}\) and \(\mathbf {M}\), is considered with known transformation \(\mathcal {T}_{\theta _{GT}}\) (with \(\mathbf {F}\approx \mathbf {M} \,{\circ }\, \mathcal {T}_{\theta _{GT}}\)). The target quality is defined such that actions that bring the parameters closer to its ground truth value are rewarded:

$$\begin{aligned} Q^\star (s_t,a)=\mathcal {R}(s_t,a,s_{t+1}) = \Vert \theta _{GT}-\theta _{s_t}\Vert _2 - \Vert \theta _{GT}-\theta _{s_{t+1}}^{a}\Vert _2 . \end{aligned}$$
(1)

The training loss function consists of the sum of \(L_2\)-norms between the explicitly computed Q-values (Eq. 1) for all actions \(a \in \mathcal {A}\) and the network’s quality predictions \(\mathbf {y}_a(s_t)\) per action. Having a training batch \(\mathcal {B}\) with random states \(s_b\) the loss is defined as: \(L = \sum _{s_b \in \mathcal {B}} { \sum _{a \in \mathcal {A}}{\left\| \mathbf {y}_a(s_b) - Q^\star (s_b, a)\right\| ^2}} .\)

In testing, the agent iteratively selects the best action, updates the parameter \(\theta _t\) and warps the moving image \(\mathbf {M}_t\) as to converge to a final parameter set representing the best mapping from moving to fixed image (see Fig. 1b).

Fig. 1.
figure 1

(a) Training Data Generation: Synthetic deformations (blue arrows) and inter-subject GT deformations (black) are used for intra- (green) and inter-subject (red) image pairs for training. (b) Dual-stream network used for Q-value prediction \(\mathbf {y}_a\) including complete single-stage Markov Decision Process for testing (blue background).

2.2 Statistical Deformation Model

One challenge of the proposed framework is to find a low dimensional representation of non-rigid transformations to minimize the number of possible actions (equal to 2d), while keeping enough degrees of freedom to correctly match images. In this work, we base our registration method on statistical deformation models (SDM) defined from Free Form Deformations (FFD). Other parametrizations could work as well. Typically, the dense displacement field is defined as the summation of tensor products of cubic B-splines on a rectangular grid. Rueckert et al. [7] proposed to further reduce the dimensionality by constructing an SDM through a principal component analysis (PCA) on the B-spline displacements.

We propose to use the modes of the PCA as the parameter vector \(\theta \) describing the transformation \(\mathcal {T}_{\theta }\) that the agent aims to optimize. The agent’s basic increment per action \(\epsilon _i\) is normalized according to the mean value of each mode estimated in training. To have a stochastic exploration of the parameter space, predicted actions \(a_t\) are selected in a stochastic manner among the 3 best actions with given fixed probabilities (see [4]).

Fuzzy Action Control. Since parameters \(\theta \) are the amplitudes of principal components, the deviation of \(\theta _{2m}\) and \(\theta _{2m+1}\) from the mean \(\mu _m\) should stay within k-times the standard deviation \(\sigma _m\) in testing. In order to keep \(\theta \) inside this reasonable parametric space of the SDM, we propose fuzzy action controlling. Thus, actions that push parameter values of \(\theta \) outside that space, are stochastically penalized – after being predicted by the network. Inspired by rejection sampling, if an action a moves parameter \(\theta _m\) to a value \(f_m\), then this move is accepted if a random number generated between [0, 1] is less than the ratio \(\mathcal {N}(f_m;\mu _m, \sigma _m)/\mathcal {N}(h;\mu _m, \sigma _m)\) where \(h_m=\mu _m + k\sigma _m\), and \(\mathcal {N}\) is the Gaussian distribution function. Therefore, if \(|f_m-\mu _m|\le k\sigma _m\), the ratio is greater than 1 and the action is accepted. If \(|f_m-\mu _m|> k\sigma _m\) then the action is randomly accepted, but with a decreased likelihood as \(f_m\) moves far away from \(\mu _m\). This stochastic thresholding is performed for all actions at each iteration and rejection is translated into adding a large negative value to the quality function \(\mathbf {y}_a\). The factor k controls the tightness of the parametric space and is empirically chosen as 1.5. By introducing fuzzy action control, the MDP gets more robust since the agent’s access to the less known subspace of the SDM is restricted.

2.3 Training Data Generation

Since it is difficult to get trustworthy ground-truth (GT) deformation parameters \(\theta _{GT}\) for training, we propose to generate two different kinds of training pairs: Inter- and intra-subject pairs where in both moving and fixed images are synthetically deformed. The latter pairs serve as a data augmentation method to improve the generalization of the neural network.

In order to produce the ground truth deformations of the available training images, one possibility would be to apply existing registration algorithms with optimally tuned parameters. However, this would imply that the trained artificial agent would only be as good as those already available algorithms. Instead, we make use of manually segmented regions of interest (ROI) available for both pairs of images. By constraining the registration algorithms to enforce the correspondence between the 2 ROIs (for instance by artificially outlining the ROIs in images as brighter voxels or using point correspondences in the ROI), the estimated registration improves significantly around the ROI. From the resulting deformations represented on an FFD grid, the d principal components are extracted. Finally, these modes are used to generate the synthetic training samples by warping the original training images based on randomly drawn deformation samples according to the SDM. Amplitudes of the modes are bounded to not exceed the variations experienced in the real image pairs, similar to [7].

Intra-subject training pairs can be all combinations of synthetically deformed images of the same subject. Since the ground-truth deformation parameters are exactly known, it is guaranteed that the agent learns correct deformations. In the case of inter-patient pairs a synthetic deformed image \(i_{mb}\) of one subject \(I_m\) is allowed to be paired with any synthetic deformed image \(i_{nc}\) of any other subject \(I_n\) with bc denoting random synthetic deformations (see Fig. 1a). Thereby, the GT parameters \(\theta _{GT}\) for image pair \((i_{mb},i_{nc})\) are extracted via composition of the different known deformations such that \(((i_{mb} \,{\circ }\, \mathcal {T}_\theta ^{i_{mb},I_m})\,{\circ }\,\mathcal {T}_\theta ^{I_{m},I_n})\,{\circ }\,\mathcal {T}_\theta ^{I_{n},i_{nc}}\). Note the first deformation would require the inverse of a known deformation that we approximate by its opposite parameters for reasons of computational efficiency. The additional error due to this approximation, computed on a few pairs, remained below 2% in terms of the DICE score.

Mini-batches are created online – during training – via random image pairing where intra- and inter-subject pairs are selected with the same probabilities. Through online random pairing the experience of new pairs is enforced since the number of possible image combinations can be extremely high (e.g. \(10^{12}\)) depending on the number of synthetic deformations.

3 Experiments

We focused on organ-centered registration of MR prostate images in 2-D and 3-D with the use case of image fusion and atlas-based segmentation [9]. The task is very challenging since texture and anatomical appearance can vary a lot. 25 volumes were selected from the MICCAI challenge PROMISE12Footnote 1 and 16 from the Prostate-3T databaseFootnote 2 including prostate segmentations. Same images and the cases with rectal probes were excluded. Randomly 8 cases were chosen for testing (56 pairs), 33 for training. As preprocessing, translation-based registration for all pairs was carried out in 3-D using the elastix-framework [3] with standard parameters followed by cropping and down sampling the images (to 100\(\,\times \,\)100/75\(\,\times \,\)75\(\,\times \,\)20 pixels in 2-D/3-D respectively). For the 2-D experiments, the middle slice of each volume was taken. For the purpose of GT generation mutual information as similarity metric and a bending energy metric was used. The optimization function was further constrained by a Euclidean point correspondence metric. Therefore, equally distributed points were extracted from the given mask surfaces. elastix was used to retrieve the solution with the weights 1, 3 and 0.2 for the above-mentioned metrics and a B-spline spacing of 16\(\,\times \,\)16(\(\times \,\)8) voxels. As a surrogate measure of registration performance we used the DICE score and Hausdorff distance (HD) on the prostate region. The extracted GT resulted in median DICE coefficients of .96 in 2-D and .88 in 3-D. Given the B-spline displacements, the PCA was trained with \(d=15\) modes in 2-D, \(d=25\) in 3-D (leading to 30 respectively 50 actions with a reconstruction error <5% (DICE score) as a compromise to keep the number of modes relatively small.

The network’s two independent processing streams contained 3 convolutional (with 32, 64, 64 filters and kernel size 3) and 2 max-pooling layers for feature extraction. The concatenated outputs of the two streams were processed in 3 fully-connected layers (with 128, 128, 64 knots) resulting in an output with size 2d (equals the number of actions). Batch normalization and ReLu units were used in all layers. The mini-batch size was 65/30 (2-D/3-D). For updating the network weights, we used the adaptive learning rate gradient-based method RMSprop. The learning rate was 0.001 with a decay factor of 0.8 every 10k mini-batch back-propagations. Training took about 12 h/1 day for 2-D and 3-D respectively. All experiments were implemented in Python using the deep learning library Theano including LasagneFootnote 3. DL tasks ran on GPUs (NVIDIA GeForce GTX TITAN X). During testing 200 MDP iterations (incl. resampling of the moving image) took 10 s (GPU) in 2-D and 90 s in 3-D (GPU). The number of testing steps was set empirically since registration results only change marginally when increasing the number of steps. In empirical 2-D experiments with 1000 steps the agent’s convergence was observable.

Table 1. Results of prostate MR registration on the 56 testing pairs. 2-D and 3-D results in comparison to elastix with B-spline spacing of 8 (e8) or 16 (e16) as proposed in [3] and the LCC-Demons[5] algorithm (dem). T are the initial scores after translation registration with elastix. 3-D* are results with perfect rigid alignment T*. nfc are our results with no fuzzy action control (HD in mm).
Fig. 2.
figure 2

2-D and 3-D registration results of extreme cases with segmentation masks overlays (fixed: green, moving: orange) and DICE scores in parenthesis.

For testing, the initial translation registration was done with elastix by registering each of the test images to an arbitrarily chosen template from the training base. Table 1 shows that our method reaches a median DICE coefficient of .88/.76 in 2-D/3-D and therefore shows similar performance as in [3] with the best reported median DICE of .76 on a different data set. However, on our challenging test data our method outperformed the LCC-Demons [5] algorithm with manually tuned parameters and elastix, using similar parameters as proposed for prostate registration [3] using B-spline spacing of 8 and 16 pixels. We found that better rigid registration can significantly improve the algorithm’s performance as shown in the experiments with perfect rigid alignment according to the segmentation (3-D*). Extreme results are visually shown in Fig. 2.

Regarding the results of elastix and LCC-Demons, a rising DICE score was observed while HD increased due to local spikes introduced in the masks (visible in Fig. 2b) as we focused on the DICE scores during optimization for fair comparisons. In the 3-D* setting, DICE scores and HDs improved when applying fuzzy action control compared to not applying any constraints (see Table 1).

4 Conclusion

In this work, we presented a generic learning-based framework using an artificial agent for approaching organ-focused non-rigid registration tasks appearing in image fusion and atlas-based segmentation. The proposed method overcomes limitations of traditional algorithms by learning optimal features for decision-making. Therefore, segmentation or handcrafted features are not required for the registration during testing. Additionally, we proposed a novel ground-truth generator to learn from synthetically deformed and inter-subject image pairs.

In conclusion, we evaluated our approach on inter-subject registration of prostate MR images showing first promising results in 2-D and 3-D. In future work, the deformation parametrization needs to be further evaluated. Rigid registration as in [4] could be included in the network or applied as preprocessing to improve results as shown in the experiments. Besides, the extension to multi-modal registration is desirable.

Disclaimer. This feature is based on research and is not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.