Improved distinct bone segmentation in upper-body CT through multi-resolution networks

Purpose Automated distinct bone segmentation from CT scans is widely used in planning and navigation workflows. U-Net variants are known to provide excellent results in supervised semantic segmentation. However, in distinct bone segmentation from upper-body CTs a large field of view and a computationally taxing 3D architecture are required. This leads to low-resolution results lacking detail or localisation errors due to missing spatial context when using high-resolution inputs. Methods We propose to solve this problem by using end-to-end trainable segmentation networks that combine several 3D U-Nets working at different resolutions. Our approach, which extends and generalizes HookNet and MRN, captures spatial information at a lower resolution and skips the encoded information to the target network, which operates on smaller high-resolution inputs. We evaluated our proposed architecture against single-resolution networks and performed an ablation study on information concatenation and the number of context networks. Results Our proposed best network achieves a median DSC of 0.86 taken over all 125 segmented bone classes and reduces the confusion among similar-looking bones in different locations. These results outperform our previously published 3D U-Net baseline results on the task and distinct bone segmentation results reported by other groups. Conclusion The presented multi-resolution 3D U-Nets address current shortcomings in bone segmentation from upper-body CT scans by allowing for capturing a larger field of view while avoiding the cubic growth of the input pixels and intermediate computations that quickly outgrow the computational capacities in 3D. The approach thus improves the accuracy and efficiency of distinct bone segmentation from upper-body CT.


Introduction
Segmentation of bones is used in bone disease diagnosis, in image-based assessment of fracture risks [1], bone-density [2], for planning and navigation of interventions [3], and for post-treatment assessment.
Bone tissue segmentation from CT has been shown to work well using slicewise 2D CNN-based segmentation algorithms [4][5][6].The tasks and solutions become more varied when moving from bone-tissue segmentation to distinct bone segmentation (our task) where we distinguish individual bones.Vertebrae segmentation has gained much attention, with many of the algorithms using multi-stage approaches and leveraging the sequential structure of the spine [7].Rib segmentation has been tackled by [8], who use a point cloud approach targeted at leveraging their dataset's spatial sparsity.Carpal bone segmentation is performed from X-rays of hands that were placed on a flat surface [9].
Simultaneous segmentation of distinct bones of multiple groups is still relatively little studied.[10] segment 62 different bones from upper-body CT using an atlas-based approach and kinematic joint models.[11] use a multi-stage approach with a localisation network, shape models, and a segmentation network to segment 49 distinct bones of the upper body.Segmentation of bones of different groups in one shot can be used as a starting point for more finegrained atlas segmentations [10], or as a guide for a follow-up inner organ segmentation [12].Segmenting multiple structures at once can also be beneficial for the segmentation accuracy, [13] found their network trained on multiple bone classes to outperform the one-class networks.
The region of interest in upper-body or full-body CT scans is typically larger than the possible input sizes of 3D convolutional neural networks (CNNs).As a result, the input needs to be sampled as patches, restricting the input field of view to the patch size.This problem exacerbates with the development of CT scanners that produce ever more highly resolved images.While a higher resolution allows for capturing more fine-grained details, it covers smaller body areas within a fixed-size input patch.
In order to extend the field of view, larger input patches can be sampled.Using bigger patches, i.e. more input pixels does not increase the number of trainable parameters in a fully connected network, but it does increase the number of necessary intermediate computations.Doubling the patch size in all three dimensions leads to at least eight times more forward-and backward computations, which are taxing for the generally scarce GPU memory.
Countermeasures fall into two categories.A) keeping the resolution and input pixel size high, but reducing the computational load elsewhere.Those measures include reducing the batch size (not to be confused with the patch size), using a simpler model, or reducing the output size.All of those means potentially hamper training and inference.B) Keeping a large field of view by using a small patch size of down-sampled inputs.This approach allows for a wider field of view for a constant input size while losing detail information.
To decide upon the better of the two approaches presented above, the requirements for the task at hand need to be considered.A suitable network for our task of complete distinct bone segmentation from upper-body CT scans (see 1) should provide the following: Its field of view should be sufficiently big to distinguish similar bones at different body locations, e.g.left from right humerus or the fourth from the eighth rib while keeping the computational burden in a feasible area.
The merits of high-resolution inputs -accurate details -and low-resolution inputs -a larger field of view -can be combined in many ways.Cascaded U-Nets consist of two or more individual U-Nets that are trained consecutively.A first model is trained on downsampled input.Its one-hot encoded segmentation results are then upsampled, potentially cropped and used as additional input channels for the following model at higher resolution [14].These approaches all have the downside of requiring the training and sequential inference of multiple models.Instead of this, we focus on end-to-end trainable models here.
End-to-end trained multi-resolution architectures have been proposed in histopathology whole-slide segmentation.For example, MRN [15] combines a 2D target U-Net and one context encoder with drop-skip-connections crossing over at every level.MRN does not contain a context decoder or context loss and is studied on a binary segmentation problem.Another such architecture is HookNet [16], which contains both a target and a context 2D U-Net and two individual losses, but only uses skip connections in the bottleneck layer.
The purpose of our work is to address common segmentation errors that originate from a lack of global context while using 3D U-Nets for distinct bone segmentation.We propose to use a multi-resolution approach and present SneakyNet, an expansion and generalization of the MRN and HookNet architectures.We compare the segmentation accuracy, complexity, and run-time of baseline 3D U-Nets with the SneakyNet.We ablate the model components and find that the use of our generalized architecture improves the results over the HookNet and MRN variants.We will use our bone segmentation in conjunction with 3D rendering of anatomical images in augmented-and virtual reality applications, where segmentations can be used on top or in conjunction with existing transfer functions [17,18].

Materials and Methods
To assess the performance of SneakyNet on upper-body distinct bone segmentation, we train it on our in-house upper-body CT dataset, which has been described in [19].We make ablation studies on the combination of context and target information and on the optimal number of context networks.

SneakyNet Architecture
In general, SneakyNet consists of one target network and one or more context networks.The target network operates on high-resolution data and eventually produces the desired segmentation maps.The context networks operate on lower resolution inputs spanning a larger field of view.Information is propagated from the context networks to the target network using crop-skip connections presented in Section 2.1.1.We present a detailed visual overview of the architecture with one context network in Figure 1.
In our previous work [20], we have explored the suitability of different 2D and 3D network architectures and parameter configurations for upper-body distinct bone segmentation.We found that there is little leeway in architectural choices due to the tasks large required field of view and the many classes that are segmented in parallel.A lean 3D U-Net variant was found to work best [20].We use this variant's architecture for our target and context U-Nets here.In our baseline computations, where we have only a target network and omit the context networks, we select the number of channels in order for our variants and the baselines to have approximately the same number of trainable parameters, to facilitate comparison.Inputs to the network are required to be multiples of 2 M −1 , where M denotes the number of levels of the U-Net.We use the basic architecture of M = 5 and therefore need multiples of 16 pixels in every dimension as input.
For the target network we use inputs of size (Sx, Sy, Sz) at full resolution.For each of the context networks we use that input plus its surrounding area, which together span a field of view of 2 κ • (Sx, Sy, Sz).We display the case of κ = 1 in Figure 1, but also use context networks with κ = 2 and κ = 3 in our ablation studies.The context network inputs are down-sampled to reduce their size to (Sx, Sy, Sz).We perform the down-sampling using (2 κ × 2 κ × 2 κ ) average-pooling with a stride of 2 κ .Both target and context network inputs eventually have a size of (Sx, Sy, Sz), but at different resolutions and fields of view.

Crop-skip connections
We use crop-skip connections to transfer information from the context to the target branch.We crop the encoder output at the desired level m such that only the centre cube of half the size per dimension remains.This centre cube is now spatially aligned to the input of the target branch.We concatenate the centre cube to the next lower level m + 1 of the target decoder to match the spatial size.We refer to the central cropping and subsequent concatenation into a lower level of the target branch as crop-skip-connection.A detailed schematic of the crop-skip connection is depicted in Figure 2. We explore three network configurations which differ in their number of crop-skip connections and their use of a context loss, and compare it to a baseline U-Net.A visual comparison of the architectures is given in Figure 3 and the parameters are provided in Table 1.
• A -Baseline: 3D U-Net with optimal configuration found for the task [20].
• B -HookNet: One context network with a single crop-skip connection is added to the target network.The crop-skip connection enters the target network at its bottleneck layer.This configuration is used in [16].• C -MRN: Crop-skip connections connect the context encoder and the target decoder at every level.There is neither a context decoder nor a context loss function.This configuration was used in [15].• D -proposed SneakyNet: Crop-skip connections connect all levels of the context and target networks.The context network has a decoder with its own loss function.

Training
Our dataset is split into 11 scans for training, 2 for validation and 3 for testing.We use 5-fold cross-validation, ensuring that every scan appears in precisely one of the cross-validation folds in the test set.The loss is composed of an unweighted combination of the target network's loss and the losses of the K context networks.For both networks, we use the sum of the cross-entropy loss L X-Ent and Dice-Loss L DSC [21].As in [20], we sum the Dice-Loss for every class separately and normalize by the number of classes.We optimized the network weights using the Adam optimizer with an initial learning rate of 0.001.We trained our networks for 100000 iterations until convergence was observed.Fig. 4 Confusion matrix among the long bones of the arms and legs.With our method, there is considerably less confusion between the left and right sides of the body and between arm and leg bones.
Our input images are padded by (S −S target )/2 all-around using edge value padding.The padding step ensures that we can sample high-resolution patch centres right to the image's border.
We implemented and trained our networks using Tensorflow Keras 2.5.All training and inference were conducted on NVidia Quadro RTX 6000 GPUs of 24 GB RAM size.

Evaluation
We evaluate the performance of our models using a class-wise Dice Score Coefficient (DSC).To indicate the performance over all classes, we give the median and the 16 and 84 quantiles (1σ) over all classes c.To not give a distorted impression of the distribution, we exclude classes where no true positives of c have been detected and therefore DSC c = 0. We present the percentage of classes included as 'non-zero DSC' in Table 2 and Table 3 to make up for the omission.

Results and Discussion
Our experiments show how automated distinct bone segmentation can be improved using a multi-resolution approach.We evaluate our results on multiple target resolutions with different numbers of context networks and field of view sizes and perform an ablation study to determine the most beneficial way to combine context and target network information.
We evaluated some of the most common errors when using a baseline segmentation method.We found that the missing context information leads to similar-looking bones in different body regions being mistaken for one another.In the confusion matrix presented in Figure 4, we observe that when using a baseline 3D U-Net, humerus pixels were predicted as femur, and the left and right humerus were confused for one another (right confusion matrix).When using context information, these errors are reduced almost entirely (left confusion matrix).
We performed an ablation study to see how different strategies of combining the context and target information within the network perform.In Table 2 we present the quantitative results.For both target patch sizes, 32 and 64, all strategies (B-D) improve upon the baseline 3D U-Net (A).The observed effect is substantially bigger when using the smaller target patch size of 32 3 , where the median DCS rises from 0.64 to 0.75.The DSC still increases from 0.83 to 0.86 median DSC on the bigger target patches.
The combination of skip connections at every level and a context loss function in our proposed architecture increases the accuracy further, as compared to the HookNet [16] and the MRN [15].
In Table 3 we ablate the influence of different numbers of context networks and input patch sizes.Qualitative results are depicted in Figure 5. Comparing the baseline 3D U-Nets with the SneakyNet results, we see that adding context networks to very small target patches of 32 3 pixels almost reaches the performance of our baseline networks operating on 64 3 patches.Going up, the SneakyNet operating on patch size 64 3 even outperforms the baseline 3D U-Net of patchsize 128 3 .We recall, that we had to reduce the number of channels in the baseline 128 3 network, due to memory restraints.Our ablation results suggests, that the addition of context networks are more valuable in adding performance when reaching memory limits.When considering the different FOV of the context networks, we observe the best results when including context FOVs of 128 3 .This covers roughly half of the L-R and A-P dimensions of the scans and seems to contain the necessary information to correctly locate bones, see e.g. the purple lumbar vertebra in Figure 5, which is correctly located in cases where the context FOV reaches 128 3 .
We provide a comparison to other results published on distinct bone segmentation in Table 4.While a direct comparison is difficult due to different datasets, our results compare favourably to both the convolutional neural networks and shape model approach by [11], and to the hierarchical atlas segmentation by [10].

Conclusion
This works presents improvements in distinct bone segmentation from upperbody CT.The proposed multi-resolution networks use additional inputs at a lower resolution but with a larger field of view to provide the necessary context information to assign the proper bone classes.We compared three different ways of combining the context and target information and evaluated the results using zero to three context networks.Using context networks improves the segmentation results on all target patch sizes.

Fig. 1
Fig.1Task overview: We segment 125 distinct bones from upper-body CT scans using SneakyNet, a multi-encoder-decoder network which incorporates inputs at various resolutions.The example here features one context network, but multiple are possible.

Fig. 2
Fig. 2 Detailed view of the architecture.Displayed are only two out of five levels of the U-Nets.Left: the context U-Net working on low-resolution data with a larger field of view.Right: The U-Net working with the central cropped high-resolution data.After all encoder convolutions of level m, a cropped copy of the output is skipped to the target decoder at level m + 1.The decoder receives skip connections from its own encoder and the context network.The intermediate results of the decoder and both skip connections are concatenated along the channel axis before undergoing further convolutions.

Fig. 3
Fig. 3 Schematic of the four network configurations used in our ablation study.A shows a base U-Net, while B, C, D show different possibilities of how to insert information into the target network, see also Section 2.1.1 for a written description.

Fig. 5
Fig. 5 Qualitative prediction results from our ablation study comparing different numbers context networks at various resolutions.The first four results from the left were obtained using a target patch size of 32px per dimension (turquoise), and the remaining three scans with target patch sizes of 64px per dimension (light blue).The grey areas indicate the field of view of the context networks.The sizes of the squares are proportional to the prediction sizes.

Table 1
Comparison of architectures with different field of view (FOV) of their target and context network(s).Operating the full 3D U-Net on patches of size 128 3 exceeds the available GPU memory. *

Table 2
Ablation results in DSC for different model configurations.

Table 3
Ablation results for the number of context networks in the SneakyNet architecture (D).Zero context networks corresponds to the baseline 3D U-Nets (A) with different input patch sizes.

Table 4
Comparison of our best-performing SneakyNet (D, target patch size of 64 3 and one context network with a FOV of 128 3 pixels) to other work on distinct bone segmentation from upper-body CT.Results are in DSC.