Abstract
Existing deformable registration methods require exhaustively iterative optimization, along with careful parameter tuning, to estimate the deformation field between images. Although some learningbased methods have been proposed for initiating deformation estimation, they are often templatespecific and not flexible in practical use. In this paper, we propose a convolutional neural network (CNN) based regression model to directly learn the complex mapping from the input image pair (i.e., a pair of template and subject) to their corresponding deformation field. Specifically, our CNN architecture is designed in a patchbased manner to learn the complex mapping from the input patch pairs to their respective deformation field. First, the equalized activepoints guided sampling strategy is introduced to facilitate accurate CNN model learning upon a limited image dataset. Then, the similaritysteered CNN architecture is designed, where we propose to add the auxiliary contextual cue, i.e., the similarity between input patches, to more directly guide the learning process. Experiments on different brain image datasets demonstrate promising registration performance based on our CNN model. Furthermore, it is found that the trained CNN model from one dataset can be successfully transferred to another dataset, although brain appearances across datasets are quite variable.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Deformable registration is a fundamental image processing step for many medical image analysis tasks since it can help build anatomical correspondences across images. Among existing deformable registration algorithms, most of them regard image registration as a highdimensional optimization problem, which intends to maximize the similarity between the template and subject images with a regularization upon the deformation field. Generally, these methods often require iterative optimization to estimate the deformation field between images, as well as careful parameter tuning. Moreover, the registration performance may decline significantly when existing large appearance variation between the template and the toberegistered subject.
Some learningbased methods [1,2,3,4] are proposed to predict the initial deformation field or parameters for registration. Then, the roughly predicted deformation field can be refined by adopting existing registration algorithms in an effective manner. Although these methods can partially improve the performance of registration, there are still some limitations. (1) The learning is templatespecific, while changing the template requires retraining from scratch. (2) The prediction models often ignore the intrinsic matching associations between the toberegistered image pair along with their local correspondence. (3) The predicted deformation field still needs further refinement, i.e., by employing a conventional registration method.
Recently, deep learning techniques such as convolutional neural network (CNN) become well known for their strong endtoend learning ability. In this paper, we propose to learn a general CNNbased regression model, in order to directly construct a mapping from the input image pair (e.g., a pair of template and subject) to their final deformation field. Then, in the application stage, we can input an unseen image pair to the CNN and effectively obtain the accurate deformation field between them. Our main contributions and novelties can be summarized as follows.

(1)
To learn a general CNN regression model that is independent of any arbitrary template, we propose to regress from any image pair to their corresponding deformations. In particular, given two patches at the same locations of two different images, the CNN produces the displacement vector to align the two patches. A wholeimage deformation field can then be derived accordingly, which relies on robust machine learning, rather than tedious parameter tuning in optimization.

(2)
In order to bridge the large appearance gap between the pair of template and subject, we introduce an auxiliary contextual cue to guide the learning of the CNN. This cue encodes the easytocompute image patch similarities in a multiscale way, which is shown to be important for successfully establishing the final deformation field and is also robust to large appearance variations.

(3)
To make the CNN regression model more accurate, we introduce the equalized activepoints guided sampling strategy, such that the training set complies well with the distributions of image patches and displacements. This strategy significantly enhances the accuracy when estimating the deformation field, and helps avoid further refinement by conventional registration methods.
2 Method
In this paper, we propose a similaritysteered CNN regression architecture to learn the mapping \( {\mathcal{M}} \) from the image pair (e.g., a template \( { \mathcal{T}} \) and a subject \( {\mathcal{S}} \)) to their final deformation field \( \phi :{\mathcal{M}}:\text{ }({\mathcal{T}},{ \mathcal{S}}) \Rightarrow \phi \). Particularly, the inputs consist of two independent images. Obviously, our learning target is the local matching correspondence between the two input images. This is substantially different from the conventional CNNbased tasks.
As shown in Fig. 1, our CNN model is designed in a patchwise manner to encode both the patch appearance information and the local displacement. First, two patches are extracted from the same locations in template and subject. Then, we generate a multiscale contextual cue to describe the similarity within the patch pair (Network Part I). The patches and the cue are concatenated as the multichannel input to CNN, which regresses the final displacement vector for template patch center (Network part II). Finally, we predict the displacements for many sampled locations, and obtain the dense deformation field by thinplate spline (TPS) interpolation in an iterative manner.
2.1 Training Set Preparation
For a pair of registered template image \( {\mathcal{T}} \) and subject image \( {\mathcal{S}} \) along with their deformation field \( \phi , \) a local patch pair \( (p_{{\mathcal{T}}} \left( u \right),p_{{{\mathcal{S}} }} \left( u \right)) \) is extracted from the center location \( u \). We then obtain a training sample \( S_{i} = \{ (p_{{\mathcal{T}}} \left( u \right),p_{{{\mathcal{S}} }} \left( u \right))\phi (u)\} \), where \( \phi \left( u \right) = [d_{x},\,d_{y},\,d_{z} ] \) is the displacement vector of \( u \).
Obviously, a wellprepared training set is important to the accuracy and the robustness of the learned CNN. Conventional sampling often collects training patches randomly or uniformly in the input image spaces, while ignores the distribution of the displacements in the output space. Figure 2(a) presents the distribution of the displacement magnitudes measured from 20 real deformation fields (excluding background voxels). If the training patches are extracted randomly from the input image space only, the displacement magnitudes for >74% patches are below 1 mm. In this way, the generalization performance of CNN will be confined, which leads to underestimation of the displacement magnitude. An instance is shown in Fig. 2(b) and (c) for comparison. Therefore, we argue that all training patches should be sampled by referring to not only the input image space, but also the output displacement space.
In the input image space, we apply the activepoints guided sampling strategy, where the importance \( I(u) \) of each point \( u \) can be related to gradient magnitude in the template image space. The voxel with rich anatomical information (e.g., strong edges) will have high importance to be sampled. Obviously, the density of the activepoints will be higher on informative brain regions while lower on smooth regions.
In the output displacement space, we adopt the equalized sampling strategy based on the displacement distribution. By incorporating the information from the input image space, we can sample the point \( u \) with the integrated probability \( P(u) \):
Here, \( \omega \) is a parameter to control the sampling probability as well as the sample number, and \( \tau \) is a cutoff threshold. Apparently, the point \( u \) with larger displacement magnitude \( \left\ {\phi \left( u \right)} \right\_{2} \) and importance \( I(u) \) can be more likely to be sampled. However, the very large displacement is unpredictable concerning the limited modeling capability of CNN and the number of training patches. Thus, we apply the cutoff \( \tau \) to saturate all displacements over the threshold.
After the equalized activepoints guided sampling, the distribution of whole training set \( S \) is mostly uniform within \( U(0, \tau ] \). In this paper, we set \( \tau = 7\;{\text{mm}} \). It is worth noting that, the displacement in the final deformation field is not limited by \( \tau \). We iteratively perform the learned CNN model, such that the estimated displacements are accumulated to approximate the final deformation field.
2.2 SimilaritySteered CNN Regression
To bridge the large gap between the input image pair and the output displacement, we introduce the auxiliary contextual to guide CNN training. As shown in Fig. 1, our CNN model consists of two parts: (1) network preparation and (2) network learning.
Network preparation.
The contextual cue is provided by the similarity map, which is the local crosscorrelation from the center location in template patch to the whole subject patch locations, and we use a small image patch to represent each location, as shown in Fig. 3. In our implementation, we conduct it as a convolutional layer incorporated to the whole CNN architecture, in order to effectively obtain the similarity map \( H \):
where “\( * \)” is the convolution operation, and \( k_{{\mathcal{T}}}^{ '} (u) \) is the reversed kernel derived from template patch \( p_{{\mathcal{T}}} \left( u \right) \) at the center voxel \( u. \) For each patch pair, \( k_{{\mathcal{T}}} \left( u \right) \) is fixed, thus the L_{2}norm \( \left {k_{{\mathcal{T}}} \left( u \right)} \right \) is a constant. \( \left\ {p_{{\mathcal{S}}} \left( u \right)} \right\ \) is the L_{2}norm map with the same size as subject patch \( p_{{\mathcal{S}}} \left( u \right) \), where we also fast generate it by another convolution operation, i.e., convolving the self dot product \( (p_{{\mathcal{S}}} \left( u \right) \cdot p_{{\mathcal{S}}} \left( u \right)) \) with the kernel \( k_{1} \). Here, \( k_{1} \) is a kernel with all 1 elements and has the same size as \( k_{{\mathcal{T}}} \left( u \right) \). Equation (2) can be identified as normalized crosscorrelation. It is worth noting that, the kernel in this convolutional layer is derived from the data, so that, the weights are fixed and not trainable.
The similarity map allows us to establish correspondences between the two patches. However, the choice of the kernel affects the distinctiveness of correspondence, as an example shown in Fig. 3. Thus, we provide multiscale similarity cue, corresponding to different sizes of kernels, to guide the training of CNN. In this paper, we use 4 kernel sizes as also illustrated in Fig. 3.
Network learning.
The CNN architecture estimates the final displacement vectors with multichannel inputs, including the patch pair and the similarity maps. Specifically, each convolution layer is followed by ReLU activations. The kernel number is doubled every two convolution layers, which starts from 64 to final 512 with the fixed size \( 3 \times 3 \times 3 \). The subsequent fully connected (FC) layers consist of 3 layers with ReLU activations, and tanh activation for the final FC layer. The loss function is the mean squared error. It is worth noting that, padding operation is not applied in each convolution layer in order to avoid introducing meaningless information. The patch size will gradually decrease, and all neighborhood information of each sample point can be effectively incorporated in high dimensional space to help better represent the samples. Furthermore, only one pooling layer is adopted in order to protect the continuity of the regression model as well as make the network training efficiently.
3 Experiments
Two datasets, i.e. LONI LPBA40 and ADNI (Alzheimer’s Disease Neuroimaging Initiative), are used to evaluate our registration performance. The LONI dataset contains 40 young adult brain MR images with 54 ROI labels, and additional tissue segmentations of gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF). For ADNI, 30 brain MR images are randomly selected from the dataset, each of which has GM, WM and CSF segmentations. After preprocessing and affine registration, all the images of the two datasets are resampled to the same size (\( 2 20 \times 2 20 \times 1 8 4 \)) and same resolution (\( 1 \times 1 \times 1\;{\text{mm}}^{ 3} \)).
The training image data is derived from LONI LPBA40. Specifically, we select 30 images to train, and test upon the remaining 10 images. We further randomly draw 30 image pairs from the training images. In order to obtain the very accurate deformation field of each image pair, SyN [5] is first applied on the intensity images with careful parameter tuning, and then Diffeomorphic Demons [6] is adopted on their tissue segmentation to further refine the registration accuracy. The final deformation field composed by this twostage registration is used as groundtruth for CNN training. Then, 24K training samples are extracted from each image pair via the equalized activepoints guided sampling strategy. In all, we have 720K training samples.
We train our similaritysteered CNN model on an Nvidia GPU by our modified 3D version of Caffe [7]. We start with the learning rate \( \lambda = 0.01 \) and multiply it by 0.5 after every 70K iterations. 20K samples are taken from the whole training set and used as validation data to monitor the overfitting problem. After training, we test the CNN model on the remaining 10 images in LONI LPBA40 and 30 images in ADNI. For each toberegistered image pair, we estimate the displacements on 0.9% of all voxels that are selected by the activepoints guided sampling strategy. The dense deformation field can be obtained by TPS interpolation [8]. We perform the above procedure for two iterations, and the incremental displacements are composed for the estimation of the final deformation field. Two popular stateoftheart registration methods, i.e., SyN [5] and Demons [6], are chosen for comparison.
3.1 LONI Dataset
For the 10 testing subjects in LPBA40 dataset, we perform deformable image registration on each two images and report the averaged results in Fig. 4 and Table 1. Figure 4 shows the Dice similarity coefficient (DSC) on 54 brain ROIs. We observe that our method has better performances on 36/54 ROIs. Among them, 28 ROIs are statisticalsignificantly improved (p < 0.05) regarding both Demons and SyN.
Table 1 provides the DSC on the labels of GM, WM and CSF. Our method achieves significant improvements on GM and WM. In term of symmetric average surface distance (SASD) [9], we also obtain better performance on GM. Although the averaged accuracies of the competing methods are slightly higher than the proposed method in some regions, the differences, however, are not significant in paired ttests.
This means that, we have at least achieved the comparable performance with the stateoftheart deformable registration methods. Note that, our method only uses 0.9% test points to generate the whole deformation field, which leads to the reported performance, without exhaustive iterative optimization and parameter tuning. It suggests that the complex mapping from the image pair to the deformation field is successfully modeled by our proposed method.
3.2 ADNI Dataset
To further evaluate the transferring capability of the learned CNN, we test 30 ADNI images by directly applying the model trained on the LONI dataset. To enlarge the appearance variation between the toberegistered image pair, in this experiment, 3 images are randomly selected from LONI dataset and used as templates. All 30 ANDI subjects are registered to those 3 templates, respectively, with results reported below.
Since only GM, WM and CSF labels are available for both these two datasets, we evaluate the registration performance based on these tissue labels in Table 2, and provide qualitative comparisons in Fig. 5. We observe that our proposed method achieves the best overall performance for this challenging registration task, with statistically significant improvements. Note that, even the image pair has large appearance variation, our proposed method still obtains high performance without any parameter tuning. This indicates that the established CNN model is robust and accurate for complicated registration cases, which makes our method more flexible and applicable.
4 Conclusion
In this paper, we have proposed a novel deformable registration method by directly learning the complex mapping from the input image pair to the final deformation field via CNN regression. The equalized activepoints guided sampling strategy is proposed, which facilitates training the regression model even with small dataset. Then, a similaritysteered CNN architecture is designed, where an additional convolutional layer is established in the whole network to provide similarity guidance during model learning. Experimental results show promising registration performance compared with the stateoftheart methods on different datasets.
References
Wang, Q., et al.: Predict brain MR image registration via sparse learning of appearance and transformation. Med. Image Anal. 20(1), 61–75 (2015)
Yang, X., Kwitt, R., Niethammer, M.: Fast predictive image registration. In: Carneiro, G., et al. (eds.) LABELS/DLMIA 2016. LNCS, vol. 10008, pp. 48–57. Springer, Cham (2016). doi:10.1007/9783319469768_6
Kim, M., et al.: A general fast registration framework by learning deformation–appearance correlation. IEEE Trans. Image Process. 21(4), 1823–1833 (2012)
GutiérrezBecker, B., Mateus, D., Peter, L., Navab, N.: Learning optimization updates for multimodal registration. In: Ourselin, S., Joskowicz, L., Sabuncu, Mert R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9902, pp. 19–27. Springer, Cham (2016). doi:10.1007/9783319467269_3
Avants, B.B., et al.: Symmetric diffeomorphic image registration with crosscorrelation: evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal. 12(1), 26–41 (2008)
Vercauteren, T., et al.: Diffeomorphic demons: efficient nonparametric image registration. NeuroImage 45(1), S61–S72 (2009)
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM (2014)
Zhang, J., et al.: Alzheimer’s disease diagnosis using landmarkbased features from longitudinal structural MR images. IEEE J. Biomed. Health Inform. (2017). doi:10.1109/JBHI.2017.2704614
Cao, X., et al.: Dualcore steered nonrigid registration for multimodal images via bidirectional image synthesis. Med. Image Anal. 41, 18–31 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Cao, X. et al. (2017). Deformable Image Registration Based on SimilaritySteered CNN Regression. In: Descoteaux, M., MaierHein, L., Franz, A., Jannin, P., Collins, D., Duchesne, S. (eds) Medical Image Computing and Computer Assisted Intervention − MICCAI 2017. MICCAI 2017. Lecture Notes in Computer Science(), vol 10433. Springer, Cham. https://doi.org/10.1007/9783319661827_35
Download citation
DOI: https://doi.org/10.1007/9783319661827_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319661810
Online ISBN: 9783319661827
eBook Packages: Computer ScienceComputer Science (R0)