1 Introduction

With the increasing requirement of fine-scaled medical care, computer-assisted diagnosis (CAD) has attracted more and more attention in the past decade. An important prerequisite of CAD is an intelligent system to process and analyze medical data, such as CT and MRI scans. In the area of medical imaging analysis, organ segmentation is a traditional and fundamental topic [2]. Researchers often designed a specific system for each organ to capture its properties. In comparison to large organs (e.g., the liver, the kidneys, the stomach, etc.), small organs such as the pancreas are more difficult to segment, which is partly caused by their highly variable geometric properties [9].

In recent years, with the arrival of the deep learning era [6], powerful models such as convolutional neural networks [7] have been transferred from natural image segmentation to organ segmentation. But there is a difference. Organ segmentation requires dealing with volumetric data, and two types of solutions have been proposed. The first one trains 2D networks from three orthogonal planes and fusing the segmentation results [9, 17, 18], and the second one suggests training a 3D network directly [4, 8, 19]. But 3D networks are more computationally expensive yet less stable when trained from scratch, and it is difficult to find a pre-trained model for medical purposes. In the scenario of limited training data, fine-tuning a pre-trained 2D network [7] is a safer choice [14].

This paper presents an alternative framework, which trains 2D segmentation models and uses a light-weighted 3D network, named Volumetric Fusion Net (VFN), in order to fuse 2D segmentation at a late stage. A similar idea is studied before based on either the EM algorithm [1] or pre-defined operations in a 2D scenario [16], but we propose instead to construct generalized linear operations (convolution) and allow them to be learned from training data. Because it is built on top of reasonable 2D segmentation results, VFN is relatively shallow and does not use fully-connected layers (which contribute a large fraction of network parameters) to improve its discriminative ability. In the training process, we first optimize 2D segmentation networks on different viewpoints individually (this strategy was studied in [12, 13, 18]), and then use the validation set to train VFN. When the amount of training data is limited, we suggest a cross-cross-augmentation strategy to enable reusing the data to train both 2D segmentation and 3D fusion networks.

We first apply our system to a public dataset for pancreas segmentation [9]. Based on the state-of-the-art 2D segmentation approaches [17, 18], VFN produces a consistent accuracy gain and outperforms other fusion methods, including majority voting and statistical fusion [1]. In comparison to 3D networks such as [19], our framework achieves comparable segmentation accuracy using fewer computational resources, e.g., using \(10\%\) parameters and being 3\(\times \) faster at the testing stage (it only adds \(10\%\) computation beyond the 2D baselines). We also generalize our framework to other small organs such as the adrenal glands and the duodenum, and verify its favorable performance.

2 Our Approach

2.1 Framework: Fusing 2D Segmentation into a 3D Volume

We denote an input CT volume by \(\mathbf {X}\). This is a \(W\times H\times L\) volume, where W, H and L are the numbers of voxels along the coronal, sagittal and axial directions, respectively. The i-th voxel of \(\mathbf {X}\), \(x_i\), is the intensity (Hounsfield Unit, HU) at the corresponding position, \({i}={\left( 1,1,1\right) ,\ldots ,\left( W,H,L\right) }\). The ground-truth segmentation of an organ is denoted by \(\mathbf {Y}^\star \), which has the same dimensionality as \(\mathbf {X}\). If the i-th voxel belongs to the target organ, we set \({y_i^\star }={1}\), otherwise \({y_i^\star }={0}\). The goal of organ segmentation is to design a function \(\mathbf {g}\!\left( \cdot \right) \), so that \({\mathbf {Y}}={\mathbf {g}\!\left( \mathbf {X}\right) }\), with all \({y_i}\in {\left\{ 0,1\right\} }\), is close to \(\mathbf {Y}^\star \). We measure the similarity between \(\mathbf {Y}\) and \(\mathbf {Y}^\star \) by the Dice-Sørensen coefficient (DSC): \({\mathrm {DSC}\!\left( \mathbf {Y},\mathbf {Y}^\star \right) }={\frac{2\,\times \,\left| \mathcal {Y}\,\cap \,\mathcal {Y}^\star \right| }{\left| \mathcal {Y}\right| \,+\,\left| \mathcal {Y}^\star \right| }}\), where \({\mathcal {Y}^\star }={\left\{ i\mid y_i^\star =1\right\} }\) and \({\mathcal {Y}}={\left\{ i\mid y_i=1\right\} }\) are the sets of foreground voxels.

There are, in general, two ways to design \(\mathbf {g}\!\left( \cdot \right) \). The first one trains a 3D model to deal with volumetric data directly [4, 8], and the second one works by cutting the 3D volume into slices, and using 2D networks for segmentation. Both 2D and 3D approaches have their advantages and disadvantages. We appreciate the ability of 3D networks to take volumetric cues into consideration (radiologists also exploit 3D information to make decisions), but, as shown in Sect. 3.2, 3D networks are sometimes less stable, arguably because we need to train all weights from scratch, while the 2D networks can be initialized with pre-trained models from the computer vision literature [7]. On the other hand, processing volumetric data (e.g., 3D convolution) often requires heavier computation in both training and testing (e.g., requiring \(3\times \) testing time, see Table 1).

In mathematical terms, let \(\mathbf {X}_l^\mathrm {A}\), \({l}={1,2,\ldots ,L}\) be a 2D slice (of \(W\times H\)) along the axial view, and \({\mathbf {Y}_l^\mathrm {A}}={\mathbf {s}^\mathrm {A}\!\left( \mathbf {X}_l^\mathrm {A}\right) }\) be the segmentation score map for \(\mathbf {X}_l^\mathrm {A}\). \(\mathbf {s}^\mathrm {A}\!\left( \cdot \right) \) can be a 2D segmentation network such as FCN [7], or a multi-stage system such as a coarse-to-fine framework [18]. Stacking all \(\mathbf {Y}_l^\mathrm {A}\)’s yields a 3D volume \({\mathbf {Y}^\mathrm {A}}={\mathbf {s}^\mathrm {A}\!\left( \mathbf {X}\right) }\). This slicing-and-stacking process can be performed along each axis independently. Due to the large image variation in different views, we train three segmentation models, denoted by \(\mathbf {s}^\mathrm {C}\!\left( \cdot \right) \), \(\mathbf {s}^\mathrm {S}\!\left( \cdot \right) \) and \(\mathbf {s}^\mathrm {A}\!\left( \cdot \right) \), respectively. Finally, a fusion function \(\mathbf {f}\!\left[ \cdot \right] \) integrates them into the final prediction:

$$\begin{aligned} {\mathbf {Y}}={\mathbf {f}\!\left[ \mathbf {X},\mathbf {Y}^\mathrm {C},\mathbf {Y}^\mathrm {S},\mathbf {Y}^\mathrm {A}\right] }={\mathbf {f}\!\left[ \mathbf {X},\mathbf {s}^\mathrm {C}\!\left( \mathbf {X}\right) ,\mathbf {s}^\mathrm {S}\!\left( \mathbf {X}\right) ,\mathbf {s}^\mathrm {A}\!\left( \mathbf {X}\right) \right] }. \end{aligned}$$
(1)

Note that we allow the image \(\mathbf {X}\) to be incorporated. This is related to the idea known as auto-contexts [15] in computer vision. As we shall see in experiments, adding \(\mathbf {X}\) improves the quality of fusion considerably. Our goal is to equip \(\mathbf {f}\!\left[ \cdot \right] \) with partial abilities of 3D networks, e.g., learning simple, local 3D patterns.

2.2 Volumetric Fusion Net

The VFN approach is built upon the 2D segmentation volumes from three orthogonal (coronal, sagittal and axial) planes. Powered by state-of-the-art deep networks, these results are generally accurate (e.g., an average DSC of over 82% [18] on the NIH pancreas segmentation dataset [9]). But, as shown in Fig. 2, some local errors still occur because 2 out of 3 views fail to detect the target. Our assumption is that these errors can be recovered by learning and exploiting the 3D image patterns in its surrounding region.

Regarding other choices, majority voting obviously cannot take image patterns into consideration. The STAPLE algorithm [1], while being effective in multi-atlas registration, does not have a strong ability of fitting image patterns from training data. We shall see in experiments that STAPLE is unable to improve segmentation accuracy over majority voting.

Motivated by the need to learn local patterns, we equip VFN with a small input region (\(64^3\)) and a shallow structure, so that each neuron has a small receptive field (the largest region seen by an output neuron is \(50^3\)). In comparison, in the 3D network VNet [8], these numbers are \(128^3\) and \(551^3\), respectively. This brings twofold benefits. First, we can sample more patches from the training data, and the number of parameters is much less, and so the risk of over-fitting is alleviated. Second, VFN is more computationally efficient than 3D networks, e.g., adding 2D segmentation, it needs only half the testing time of [19].

Fig. 1.
figure 1

The network structure of VFN. We only display one down-sampling and one up-sampling stages, but there are 3 of each. Each down-sampling stage shrinks the spatial resolution by 1/2 and doubles the number of channels. We build 3 highway connections (2 are shown). We perform batch normalization and ReLU activation after each convolutional and deconvolutional layer.

The architecture of VFN is shown in Fig. 1. It has three down-sampling stages and three up-sampling stages. Each down-sampling stage is composed of two \(3\times 3\times 3\) convolutional layers and a \(2\times 2\times 2\) max-pooling layer with a stride of 2, and each up-sampling stage is implemented by a single \(4\times 4\times 4\) deconvolutional layer with a stride of 2. Following other 3D networks [8, 19], we also build a few residual connections [5] between hidden layers of the same scale. For our problem, this enables the network to preserve a large fraction of 2D network predictions (which are generally of good quality) and focus on refining them (note that if all weights in convolution are identical, then VFN is approximately equivalent to majority voting). Experiments show that these highway connections lead to faster convergence and higher accuracy. A final convolution of a \(1\times 1\times 1\) kernel reduces the number of channels to 1.

The input layer of VFN consists of 4 channels, 1 for the original image and 3 for 2D segmentations from different viewpoints. The input values in each channel are normalized into \(\left[ 0,1\right] \). By this we provide equally-weighted information from the original image and 2D multi-view segmentation results, so that VFN can fuse them at an early stage and learn from data automatically. We verify in experiments that image information is important – training a VFN without this input channel shrinks the average accuracy gain by half.

2.3 Training and Testing VFN

We train VFN from scratch, i.e., all weights in convolution are initialized as random white noises. Note that setting all weights as 1 mimics majority voting, and we find that both ways of initialization lead to similar testing performance. All \(64\times 64\times 64\) volumes are sampled from the region-of-interest (ROI) of each training case, defined as the bounding box covering all foreground voxels padded by 32 pixels in each dimension. We introduce data augmentation by performing random \(90^\circ \)-rotation and flip in 3D space (each cube has 24 variants). We use a Dice loss to avoid background bias (a voxel is more likely to be predicted as background, due to the majority of background voxels in training). We train VFN for \(30\mathrm {,}000\) iterations with a mini-batch size of 16. We start with a learning rate of 0.01, and divide it by 10 after \(20\mathrm {,}000\) and \(25\mathrm {,}000\) iterations, respectively. The entire training process requires approximately 6 h in a Titan-X-Pascal GPU. In the testing process, we use a sliding window with a stride of 32 in the ROI region (the minimal 3D box covering all foreground voxels of multi-plane 2D segmentation fused by majority voting). For an average pancreas in the NIH dataset [9], testing VFN takes around 5 s.

An important issue in optimizing VFN is to construct the training data. Note that we cannot reuse the data used for training segmentation networks to train VFN, because this will result in the input channels contain very accurate segmentation, which limits VFN from learning meaningful local patterns and generalizing to the testing scenarios. So, we further split the training set into two subsets, one for training the 2D segmentation networks and the other for training VFN with the testing segmentation results.

Table 1. Comparison of segmentation accuracy (DSC, %) and testing time (in minutes) between our approach and the state-of-the-arts on the NIH dataset [9]. Both [18] and [17] are reimplemented by ourselves, and the default fusion is majority voting.

However, under most circumstances, the amount of training data is limited. For example, in the NIH pancreas segmentation dataset, each fold in cross-validation has only 60 training cases. Partitioning it into two subsets harms the accuracy of both 2D segmentation and fusion. To avoid this, we suggest a cross-cross-augmentation (CCA) strategy, described as follows. Suppose we split data into K folds for cross-validation, and the \(k_1\)-th fold is left for testing. For all \({k_2}\ne {k_1}\), we train 2D segmentation models on the folds in \(\left\{ 1,2,\ldots ,K\right\} \setminus \left\{ k_1,k_2\right\} \), and test on the \(k_2\)-th fold to generate training data for the VFN. In this way, all data are used for training both the segmentation model and the VFN. The price is that a total of \(K\left( K-1\right) /2\) extra segmentation models need to be trained, which is more costly than training K models in a standard cross-validation. In practice, this strategy improves the average segmentation accuracy by \(\sim \)1% in each fold. Note that we perform CCA only on the NIH dataset due to the limited amount of data – in our own dataset, we perform standard training/testing split, requiring \({<}\)10% extra training time and ignorable extra testing time.

3 Experiments

3.1 The NIH Pancreas Segmentation Dataset

We first evaluate our approach on the NIH pancreas segmentation dataset [9] containing 82 abdominal CT volumes. The width and height of each volume are both 512, and the number of slices along the axial axis varies from 181 to 466. We split the dataset into 4 folds of approximately the same size, and apply cross-cross-augmentation (see Sect. 2.3) to improve segmentation accuracy.

Results are summarized in Table 1. We use two recent 2D segmentation approaches as our baseline, and compare VFN with two other fusion approaches, namely majority voting and non-local STAPLE (NLS) [1]. The latter was verified more effective than its former local version. We measure segmentation accuracy using DSC and report the average accuracy over 82 cases. Based on [18], VFN improves majority voting significantly by an average of \(1.69\%\). The improvement over 82 cases is consistent (the student’s t-test reports a p-value of \(6.9\times 10^{-7}\)), although the standard deviation over 82 cases is relatively large – this is mainly caused by the difference in difficulties from case to case. Figure 2 shows an example on which VFN produces a significant accuracy gain. VFN does not improve [17] significantly, arguably because [17] has almost reached the human-level agreement (we invited a radiologist to segment this dataset individually, and she achieves an average accuracy of \(\sim \)86%). Note that the other approaches without CCA used both the training and validation folds for training, and so all numbers are comparable in Table 1.

Fig. 2.
figure 2

Two typical examples, each with the original image, segmentation results from three viewpoints, and different fusion results. In each label map, red, green and yellow indicate ground-truth, prediction and overlap, respectively.

Due to our analysis in Sect. 2.2, NLS does not produce any accuracy gain over either [18] and [17]. NLS is effective in multi-atlas registration, where the labels come from different images and the annotation is relatively accurate [1]. But in our problem, segmentation results from 2D networks can be noisy, thus recovering these errors requires learning local image patterns from training data, which is what VFN does to outperform NLS.

To reveal the importance of image information, we train a VFN without the image channel in the input layer. Based on [18], this version produces approximately half of the improvement (\(1.69\%\)) by the full model. We show an example in Fig. 2, in which the right part of the pancreas is missing in both sagittal and axial planes, but the high confidence in the coronal plane and the continuity of image intensities suggest its presence in the final segmentation.

3.2 Our Multi-organ Dataset

The radiologists in our team collected a dataset with 300 high-resolution CT scans. These scans were performed on some potential renal donors. Four experts in abdominal anatomy annotated 11 abdominal organs, taking 3−4 h for each scan, and all annotations were verified by an experienced board certified Abdominal Radiologist. Except for the pancreas, we choose several challenging targets, including the adrenal glands, the duodenum, and the gallbladder (easy cases such as the liver and the kidneys are not considered). We use 150 cases for training 2D segmentation models, 100 cases for training VFN, and test on the remaining 50 cases. The data split is random but identical for different organs.

Table 2. Comparison of segmentation accuracy (DSC, \(\%\)) on our multi-organ dataset. The baseline for [18] and [17] is majority voting. The numbers of [17] are different from those in their original paper, because we are using a different dataset.

Results are shown in Table 2. Again, our approach consistently improves 2D segmentation, which demonstrates the transferability of our methodology. In pancreas, based on [17], we obtain a p-value of \(2.7\times 10^{-5}\) over 50 testing cases. In adrenal glands, although the average accuracy gains are not large, the improvement is significant in some badly segmented cases, e.g., Fig. 2 shows two examples with more than \(20\%\) accuracy boosts. Refining bad segmentations makes our segmentation results more reliable. By contrast, the 3D network [19] produces unstable performance ([19] was designed for pancreas segmentation, thus works reasonably well in pancreas), which is mainly caused by the limited training data especially for small organs such as adrenal glands and gallbladder.

Therefore, we conclude that 2D segmentation followed by 3D fusion is currently a very promising idea to bridge the gap between 2D and 3D segmentation approaches, particularly if there is limited training data.

4 Conclusions

In this paper, we discuss an important topic in medical imaging analysis, namely bridging the gap between 2D and 3D organ segmentation approaches. We propose to train more stable 2D segmentation networks, and then use a light-weighted 3D fusion module to fuse their results. In this way, we enjoy the benefits of exploiting 3D information to improve segmentation, as well as avoiding the risk of over-fitting caused by tuning 3D models (which have 10\(\times \) more parameters) on a limited amount of training data. We verify the effectiveness of our approach on two datasets, one of which contains several challenging organs.

Based on our work, a promising direction is to train the segmentation and fusion modules in a joint manner, so that the 2D networks can incorporate 3D information in the training process by learning from the back-propagated gradients of VFN. Another issue involves training VFN more efficiently, e.g., using hard example mining. These topics are left for future research.