In this section, we evaluate the performance of our network. “Experimental settings” section discusses the generation of the different datasets and provides details on the evaluation and training procedure. “Experimental results” section presents the registration results for AP and lateral radiographs while comparing with other methods. We also report the sensitivity to inaccurate input parameters and the accuracy for non-orthogonal projections. An ablation study is presented in “Ablation study”.
Experimental settings
CT-data preprocessing and augmentation
A total of 315 angio-CT images were acquired and split into a training set of 235 subjects, a validation set of 40 subjects for model selection, and a test set of 40 subjects used to report performance. From each CT-image, the left and right femurs were extracted and rotated to a reference system that aligns the anterior-posterior and lateral views of that femur with the x and y-axis of the image. The femur reference frame of each image was defined based on the neck and shaft axis of the femur. To allow some pose variation around this canonical reference pose, we applied random affine transformations to the image with strict constraints. The randomised angles were allowed within a range of 10\(^{\circ }\) extension/flexion, 10\(^{\circ }\) abduction/adduction and 10\(^{\circ }\) internal/external rotation.
After transforming the images to a pose that is close to that of the reference, the images were cropped around the femoral heads and resized in order to maintain the highest resolution as possible. The left femur images were flipped to resemble right ones. The final CT volumes have a size equal to \((192 \times 128 \times 192)\), and a resolution of \((0.664 \times 0.664 \times 1)\) mm\(^3\). Each image has a corresponding segmentation map S, obtained by graph-cut segmentation method followed by manual corrections [13].
Generating DRR
Digitally reconstructed radiographs (DRR) were simulated from the femur-centred CT volumes by DeepDRR software [14]. DRRs were created with an image size of (422 \(\times \) 640), and downsampled to (160 \(\times \) 224) to fit the network’s input size. The source-detector distance and the isocentre distance of the projection geometry were fixed to 1000 mm and 925 mm, respectively. Two different datasets of DRRs were generated:
-
A dataset with orthogonal projections. The projection geometry was fixed to provide lateral and AP projections. The acquisition geometry corresponding to this dataset resembles best the experimental settings in the literature.
-
A dataset with generalised projection geometries. Projection matrices were parameterised by the left/right anterior oblique (LAO/RAO) angle \(\theta \), which was randomly varied between − 30\(^{\circ }\) and \(+\) 30\(^{\circ }\), around the perfect lateral and AP view. The cranio-caudal angle was set to a constant value of 0\(^{\circ }\). Different combinations of LAO/RAO angles were made for biplanar experiments.
For both datasets, the CT label maps were projected along with the CT images to obtain a 2D labelmap for the DRRs. The DRRs were masked by these labelmaps before feeding them into the network. Note that other structures, in front and behind the femur, are still visible in the masked DRRs.
Evaluation metrics
The registration accuracy of the network is evaluated by means of the Dice score and the Jacard coefficient, which measure the overlap between the warped atlas label map and the ground-truth label map [15]:
$$\begin{aligned} Dice(A,B)&= 2\frac{\vert {A\cap B}\vert }{\vert {A}\vert +\vert {B}\vert } \end{aligned}$$
(5)
$$\begin{aligned} Jac(A,B)&= \frac{\vert {A\cap B}\vert }{\vert {A \cup B}\vert } \end{aligned}$$
(6)
We also report the average symmetric surface distance (ASSD), which measures the average geometric distance between the ground-truth and registered bone surfaces. The similarity between the warped atlas image and the ground-truth image volume is quantified by the structural similarity index (SSIM), which takes the luminance, contrast and structure into account. As our method is a registration method, its ability to estimate the right intensity values of the image volume is limited. It can only warp an atlas with fixed intensity values.
Training details
We implemented our network by using the TensorFlow library. The network was trained for 300 epochs on a NVIDIA Tesla A100 graphics card. The model requires 18.7 GB of memory when being trained with a batch size equal to one, and has a computational complexity of 722 GFLOPS. The loss-function was minimised using the Adam optimizer, with the learning rate set to \(10^{-5}\).
Experimental results
Comparison with other methods
This section describes the results of the registration to AP and lateral DRRs, by our proposed network and by two other networks for comparison. The evaluation metrics are listed in Table 1. Figure 2 illustrates the qualitative performance of the network by some registration examples.
The first comparison method registers a B-spline-based statistical deformation model (SDM) to a pair of radiographs by regressing its principal component weights [10]. This is a deep-learning implementation of the classical method of Yu et al.(2017) [4]. The SDM guarantees plausible shapes and provides smoother deformation fields than our proposed method, as can be seen in Fig. 2. Nevertheless, it is outperformed by our method in terms of registration accuracy (\(p=10^{-30}\)), as reported in Table 1. This indicates that the constraint on the deformation field by the SDM is too strong to correct for small-scale deformations. The lower SSIM value is due to the different atlas image being used for the SDM-based method. This atlas has an average intensity profile which cancels out more subtle local intensity variations.
The second comparison method is a re-implementation of the work of Kasten et al. [7], in which the 3D binary labelmap of the femur is immediately regressed from the biplanar radiographs, without deforming an atlas image. This method achieves a larger Dice score than our method (\(p=4\cdot 10^{-4}\)), but lacks information about the internal structures. As it does not regress the 3D intensity values, the problem is considerably simplified.
Table 1 Registration accuracy of our proposed method and comparison methods [7, 10] Figure 2 shows a good alignment for our method between the input DRRs and the simulated perspective projections of the registered atlas images, including the cortical bone. The geometric distance error between the estimated and ground-truth surface model highlights the lesser trochanter as a challenging region to register accurately for all methods, while global structures like the femoral neck and shaft are more accurately reconstructed.
Sensitivity to inaccurate input
Our network requires calibrated radiographs as input, meaning that the corresponding projection matrix, parameterised by the intrinsic and extrinsic parameters, needs to be known. However, the orientation of an imaging system, like a C-arm system, can never exactly be determined in practice, especially if both projections are taken at different times and the patient moves in between both acquisitions. In this experiment, we study how the uncertainty on the LAO/RAO projection angle affects the registration accuracy for projections which are in reality orthogonal. Figure 3 shows the evaluation metrics with respect to the difference between the ground-truth and input projection angle. For a discrepancy of 5\(^{\circ }\), the average dice score gets reduced from 0.94 to 0.90.
Generalised projection geometries
We retrained and evaluated the registration network on the DRR dataset with generalised projection angles. Instead of perfect AP and lateral DRRs, projections were randomly generated in a range of 60\(^{\circ }\) around the AP and lateral views. By training the network on such generalised dataset, the network can be reused for any projection geometry.
The overall average dice score on the generalised validation dataset (\(N=2880\)) equals \(0.923\pm 0.033\). Figure 4 shows the median Dice scores for different combinations of LAO/RAO projection angles. The Dice score is maximal for near-orthogonal projection geometries, where the angle between both projection directions is between 80\(^{\circ }\) and 110\(^{\circ }\). It is interesting to note that projections do not necessarily need to correspond to perfect AP and lateral views.
Ablation study
To study the effectiveness of individual components in our registration network, we re-trained our network, omitting some modules. We used the same dataset as in “Experimental results” section for training, validation, and testing. The evaluation metrics, listed in Table 2, are compared to the original results of “Experimental results” section by means of a two-sided paired t-test.
Table 2 Quantitative results for the effectiveness of different network components Effectiveness of affine network structure
In this experiment, the affine network of “Affine registration module” section was modified by removing the intermediate concatenations of AP and lateral feature maps. Instead, they were only combined at the end of the affine module, right before regressing the affine parameters. While the affine initialisation is significantly worsened by this, the local registration remains unaffected. It shows that the local registration has a large enough capture range to correct for variations left unseen by the affine initialisation.
Effectiveness of skip-connections
Removing the skip connections in the local network significantly reduces the registration accuracy \((p=10^{-3})\). Secondly, it also increases the training time from 300 to 700 epochs, especially due to the slower training of the affine network. The mismatch in learning rate between the affine and local network can be explained by the vanishing gradient problem. In deep neural networks, the gradient might become very small for the early layers in the network, resulting in a negligible parameter update. The skip connections provide an alternative path to back-propagate the loss-function, which is essential for updating the early network layers.
Effectiveness of two separate 3D decoders
Instead of treating the AP and lateral feature maps separately by two distinct encoder-decoder modules, this network variation combines both feature maps at each level of the 2D encoder, similar to the affine network structure, and only contains one 3D decoder. Skip connections are included between the combined 2D feature maps and 3D decoder. The affine registration module remains the same as depicted in Fig. 1. The results in Table 2 show a highly significant reduction in the affine and local registration accuracy, indicating the preference to decode the 3D feature maps for each projection direction separately.
Effectiveness of inv-ProST layer
The inv-ProST layer is responsible for spatially aligning the decoded 3D feature maps into a common coordinate system, before regressing the deformation field. If the inv-ProST layer is left out and the 3D feature maps are directly concatenated instead, the registration accuracy is significantly reduced \((p<10^{-16})\).