1 Introduction

Acquisition of 4D image data (3D+t images, respiration-correlated data) is an integral part of current radiation therapy (RT) workflows for RT planning and treatment of thoracic and abdominal tumors. Especially 4D CT imaging is meanwhile widespread and currently estimated to be routinely applied in approximately 70% of the RT facilities in the United States [1]. Typical clinical use cases of 4D CT data are (semi-)automated target volume and organ at risk contour propagation; assessment of motion effects on dose distributions (4D RT quality assurance, dose warping) [2]; and 4D CT-based lung ventilation estimation and its incorporation into RT treatment planning [1].

At this, a key step is the application of deformable image registration (DIR) to the phase images of the 4D CT data. Traditional DIR approaches tackle the underlying task of finding an optimal transformation mapping two phase images by minimization of a dissimilarity measure that controls local correspondences of voxel intensities [3]. Yet, the algorithms are time consuming and there exists the risk of getting stuck in local minima during optimization.

Motivated by the exceptional success of deep learning (DL) and especially convolutional neural networks (CNNs) for image segmentation and classification tasks, meanwhile a number of approaches has been proposed to also solve image registration tasks by CNNs – first in the context of optical flow estimation in computer vision [4], and later similarly for medical image registration [3, 5,6,7]. Yang et al. further extended a CNN-based DIR architecture to a probabilistic framework using dropouts [5], resulting in DIR uncertainty maps that could be of great value for RT treatment planning [8].

However, Uzunova et al. noted that “dense 3D registration with CNNs is currently computationally infeasible” [6], and focused on 2D (brain and cardiac) DIR only. To overcome this issue, patch-based approaches have been proposed for, e.g., 3D brain DIR [5], with the side effect that global information about the transformation to learn might be missing [3]. In turn, Rohé et al. indeed proposed using a fully convolutional architecture; with a size of \(64\times 64\times 16\) voxel, their cardiac MR images were, however, not even close to typical sizes of 4D CT images (in the order of \(512\times 512\times 150\) voxel per phase image).

This paper is therefore dedicated to CNN-based registration suitable for application to fast DIR in clinical thoracic 4D CT data. Taking up the aforementioned challenges and trends in current DL-based DIR,

  1. C1

    we propose a general and efficient CNN-based framework for deep learning of dense motion fields in clinical thoracic 4D CT, called GDL-FIRE\(^\text {4D}\),

  2. C2

    build variants of GDL-FIRE\(^\text {4D}\) using common open source DIR frameworks,

  3. C3

    perform a first comprehensive evaluation thereof using publicly available 4D CT data repositories (thereby presenting first respective benchmark baseline results for DL-based DIR in 4D CT data), and

  4. C4

    compare and discuss dropout-generated registration uncertainty maps for the different GDL-FIRE\(^\text {4D}\) variants.

To the best of our knowledge, all aspects C1-C4 are novel contributions in the given application context.

The remainder of the paper is structured as follows: In Sect. 2, the problem formulation and the concept of GDL-FIRE\(^\text {4D}\) are detailed. Applied data sets and performed experiments are described in Sect. 3 and respective results given and discussed in Sect. 4. The paper closes with concluding remarks in Sect. 5.

2 Methods: DL-Based Deformable Image Registration

A 4D CT image is a series \(\left( I_i\right) _{i\in \{1,\dots , n_\text {ph}\}}\) of 3D CT images \(I_i:\varOmega \rightarrow \mathbb {R}\), \(\varOmega \subset \mathbb {R}^3\), representing the patient geometry at different breathing phases i with \(n_\text {ph}\) as number of available images and breathing phases, respectively. The phases i sample the patient’s breathing cycle in time and are usually denoted by cycle fractions, i.e. \(\{1,\dots ,n_\text {ph}\}\equiv \{0\%,\dots ,50\%,\dots \}\) with \(0\%\) as end inspiration and \(50\%\) as end expiration phase. Deformable registration in 4D CT data then aims to estimate a corresponding series of transformations \(\left( \varphi _i\right) _{i\in \{1,\dots ,n_\text {ph}\}}\) between the \(I_i\) and a reference image \(I_\text {ref}\), with \(\varphi _i:\varOmega \rightarrow \varOmega \). For the applications outlined in Sect. 1, \(I_\text {ref}\) usually represents one of the phase images \(I_i\) and the transformation \(\varphi _i\) and vector fields \(u_i:\varOmega \rightarrow \mathbb {R}^3\), \(u_i=\varphi _i-\text {id}\) (\(\text {id}\): identity map) the respiration-induced motion of the image structures between phase i and the reference phase.

2.1 Traditional Deformable Image Registration (DIR) Formulation

In a traditional 4D CT DIR setting, the reference image is considered the fixed image, \(I_\text {ref}\equiv I_\text {F}\), and the phase images as moving images, \(I_i\equiv I_\text {M}\), which are sequentially registered to \(I_\text {F}\) by \(\varphi _i = \arg \min _{\varphi _i^*\in \mathcal {C}^2[\varOmega ]} \mathcal {J}\left[ I_\text {F},I_\text {M};\varphi _i^*\right] \) to compute the sought transformations \(\left( \varphi _i\right) _{i\in \{1,\dots ,n_\text {ph}\}}\). The exact functional \(\mathcal {J}\), i.e. dissimilarity measure, applied regularization approach and considered transformation model, and the optimization strategy vary in the community; see [9] for details.

2.2 Convolutional Neural Networks (CNNs) for DIR

Different to traditional DIR, we now assume a database of \(n_\text {pat}\) training tuples \(\left( I_i^p,I_j^p,\varphi _{ij}^p\right) \), \(i,j\in \{1,\dots ,n_\text {ph}\}\), \(p\in \{1,\dots ,n_\text {pat}\}\) to be given; \(\varphi _{ij}^p = id + u_{ij}^p\) represents a DIR result of the phase images \(I_i\equiv I_\text {F}\) and \(I_j\) of patient p. The goal is to learn the relationship between the input data \(\left( I_i^p,I_j^p\right) \) and \(u_{ij}^p\) by a convolutional neural network.

As noted by Uzunova et al. [6], it is currently computationally not feasible to directly feed the entire images and vector fields into a CNN or GPU memory. Instead, we propose a slab-based approach: Let \(I|_{\hat{x}}:=I|_{\varOmega _{\hat{x}}}\) be the restriction of image I to \(\varOmega _{\hat{x}}=\{\left( x,y,z\right) \in \varOmega \ | \ x={\hat{x}}\}\), i.e. the sagittal slice of I at x-position \(\hat{x}\). Similarly, let \(I|_{[\hat{x}_1,\hat{x}_2]}\) be the restriction of I to \(\varOmega _{[\hat{x}_1,\hat{x}_2]}=\{\left( x,y,z\right) \in \varOmega \ | \ \hat{x}_1\le x \le \hat{x}_2\}\), i.e. an image slab comprising the sagittal slices \(\hat{x}_1,\dots ,\hat{x}_2\) of I. Using this notation, the aforementioned training tuples were converted to slab-based training samples \(\left( I_i^p|_{[x-2,x+2]},I_j^p|_{[x-2,x+2]},u_{ij}^p|_x\right) \) with \(x\in \{x_{\min },\dots ,x_{\max }\}\) covering all sagittal slices of I. The rationale was to represent maximum information along main motion directions inferior-superior and anterior-posterior for each training sample, but also to provide some anatomical context in lateral direction.

Furthermore, the image dynamics were rescaled to [0, 1], the slabs resampled to isotropic resolution of 2 mm and cropped/zero-padded to identical size, and the non-patient background intensity set to zero. Similar pre-processing was applied to the displacement fields (resampling and -sizing of sagittal slices, background set to zero). In addition, x-, y- and z-displacement components were z-transformed on a voxel-level to avoid unintended suppression of small displacements during CNN training. Thus, the CNN aimed to learn normalized 3D-vectors for the individual voxels of sagittal slices, which are back-transformed to actual motion fields during final reconstruction of the fields. The pre-processed slab-based samples \((\tilde{I}_i^p|_{[x-2,x+2]},\tilde{I}_j^p|_{[x-2,x+2]},\tilde{u}_{ij}^p|_x)\) with \(x\in \{x_{\min },\dots ,x_{\max }\}\) of the \(n_\text {pat}\) patients were finally shuffled and used for CNN training.

We tested different CNN architectures, including the classical U-Net [10]. Due to an observed increased robustness for DL-based DIR compared to the U-Net, we finally used an iterative CNN architecture with an Inception-ResNet-v2 [11] embedded in the encoder part of a pre-trained CT autoencoder, see Fig. 1, with MSE (mean squared error) loss function and NADAM optimizer (implemented in Tensorflow). Iterative means that we cascaded copies of the trained networks for improved coverage of large motion patterns.

Fig. 1.
figure 1

CNN architecture implemented for DL-based DIR.

2.3 Probabilistic CNN-Based DIR

As detailed by Yang et al. [5] and references therein, deterministic CNN architectures can be extended to probabilistic using dropouts [12]. Briefly speaking, the dropout layers incorporated into the CNN architecture to prevent overfitting during model training remain enabled during motion prediction. Repeated motion prediction with respectively sampled connections to be dropped eventually enable computing the sought motion field as the mean of the sampled predicted fields; further, corresponding voxel-wise variances can be interpreted as local registration uncertainty estimates [5].

3 Materials and Study Design

All experiments were run on a desktop computer with Intel Xeon CPU E5-1620 and Nvidia Titan Xp GPU. Models and scripts required can be found at

3.1 Training and Testing 4D CT Data Cohorts

For CNN training and model optimization, a cohort of 69 in-house acquired RT treatment planning ten-phase 4D CT data sets of patients with small lung and liver tumors was used (image size: \(512\times 512\times 159\) voxel) and a 85%/15% split into training and testing data performed. The 4D CT images of the open data repositories DIRLAB [13] and CREATIS [14] (see also served as external evaluation cohort of the trained CNNs (i.e. no model optimization performed by means of the external 4D CT cohorts).

Fig. 2.
figure 2

Motion fields estimated by the original DIR algorithms (left column); GDL-FIRE\(^\text {4D}\) with only a single iteration (2nd column); GDL-FIRE\(^\text {4D}\) n iterations (3rd column); and GDL-FIRE\(^\text {4D}\) variant-specific registration uncertainty maps (right column). Data set: DIRLAB case 08, DIR of 0% and 50% phase images.

3.2 Applied DIR Frameworks and Algorithms

To provide motion field training data, the in-house 4D CT data were registered using three common open source DIR frameworks: PlastiMatch [15], NiftyReg [16], and VarReg [17]. All approaches have been proven suitable for 4D CT registration [9]; the applied parameters were similar to respective EMPIRE10 parameters [9]. However, the algorithms are applied in a plug-and-play manner (no data pre-processing or pre-registration, no masks used). For each DIR algorithm, motion fields were provided between the 20% phase image (served as \(I_\text {F}\)) and all other phase images.

3.3 Experiments and Evaluation Measures

For each DIR algorithm, a respective probabilistic GDL-FIRE\(^\text {4D}\) variant was built (up to 4 cascaded CNNs, 20% dropouts). DIR accuracy was evaluated by the target registration error (TRE), computed by means of the landmarks publicly available for the DIRLAB and CREATIS data. In addition, the smoothness of transformations of the different DIR approaches and GDL-FIRE\(^\text {4D}\) variants was analyzed in terms of the standard deviation of transformation Jacobian determinant values of the lung voxels of the evaluation data.

Fig. 3.
figure 3

From left to right: CT image serving as reference image with artifact in liver; difference of motion amplitudes estimated by the NiftyReg and the VarReg GDL-FIRE\(^\text {4D}\) variants, illustrating large across-DIR approach differences; NiftyReg and VarReg GDL-FIRE\(^\text {4D}\) uncertainty maps, showing negligible uncertainties for both variants.

4 Results and Discussion

Motion fields estimated by the original DIR algorithms and respective GDL-FIRE\(^\text {4D}\) variants as well as corresponding registration uncertainty maps are shown in Fig. 2 for DIRLAB case 08 (DIRLAB case with maximum motion amplitude) and phase 50% to phase 0% DIR. The similarity of the original and the GDL-FIRE\(^\text {4D}\) predicted fields is striking, i.e. the CNN obviously learned the DIR-specific transformation properties. This includes that the NiftyReg GDL-FIRE\(^\text {4D}\) variant has (similar to the original DIR) problems to directly cover larger motion amplitudes – and thereby motivates cascading several trained models for iterative CNN-based DIR. The success can be seen in Table 1, where the NiftyReg GDL-FIRE\(^\text {4D}\) outperforms the original NiftyReg DIR in terms of accuracy especially for cases with larger motion.

Still, GDL-FIRE\(^\text {4D}\) DIR accuracy as well as transformation properties for the other DIR approaches also resemble respective values of the traditional registration algorithm – but GDL-FIRE\(^\text {4D}\) offers a reduction of the runtime from approx. 15 min to a few seconds (speedup of approx. 60-fold).

Finally, it can be seen that the computed DIR uncertainty maps differ greatly between the GDL-FIRE\(^\text {4D}\) variants. In Fig. 3, a dataset of our internal testing cohort is shown that exhibits an artifact in the liver. This artifact led to very different motion patterns estimated by the NiftyReg and the VarReg GDL-FIRE\(^\text {4D}\) variant, but almost no measurable uncertainty for both DIR approaches. Being a direct consequence of the concept of probabilistic CNN-based DIR, this does, however, not match our understanding of DIR uncertainty and raises doubts regarding its applicability for RT planning and estimation of uncertainties therein.

Table 1. TRE values (in mm) and transformation smoothness (measured by standard deviation of lung voxel Jacobian determinant values), listed for the DIRLAB and CREATIS data, the individual DIR algorithms, and respective GDL-FIRE\(^\text {4D}\) variants (PM: PlastiMatch; NR: NiftyReg; VR: VarReg). Landmark distance before registration: \((8.46\pm 6.58)\) mm for the DIRLAB and \((8.11\pm 4.76)\) mm for the CREATIS data.

5 Conclusions

The presented GDL-FIRE\(^\text {4D}\) framework illustrates feasibility and potential of deep learning of dense vector fields for motion estimation in clinical thoracic 4D CT image data (TRE values of CNN-based DIR were in the same order than for the underlying DIR algorithms, accompanied by a speed-up factor of approximately 60), and thereby motivates continuing optimization of the framework.