Keywords

1 Introduction

For clinical applications, accurate segmentation of tumors is very meaningful for diagnosis and surgical treatment. Segmentation of brain tumors from multimodal magnetic resonance imaging (MRI) is a challenging task due to the segmentation of brain tumor plays an important role in the computer-aided brain tumor disease diagnosis, while the standard for the segmentation of brain tumor is not very clear. Also the brain tumor’s various shapes and the intensity level similarity between tumor tissue and neighboring organs will result in the segmentation performance degraded. Therefore, how to accurately and efficiently segment the brain tumor, becomes a hot topic in the medical image analysis field.

There are kinds of methods to segment the brain tumor, such as the level-set, the region growing and the fuzzy clustering. Manual intervention is required in some methods. For example, the region growing method  [12] needs users to manually select a seed point in the image. Recently, automatically selecting the seed point [4] has been proposed. Level-set is another segmenting method which is based on the active contour model. How to select a better initial contour is essential in the level-set. R. Rana employed the fast Bounding Algorithm to select the initial contour in the tumor area, and used the level set method to extract the tumor boundary accurately [17]. Fuzzy clustering method is usually adopted by combining with other methods, such as K-means or C-means [20]. These methods also need a prior knowledge of the data distribution. And another method is to classify the voxels into different tissues by using hand-crafted features, and then use the conditional random field (CRF) model to combine the smoothness of the classification results and maximize the consistency of the levels between the pixels in the neighborhood [15, 23].

Recently, convolutional neural networks(CNN) have achieved breakthrough achievements in various visual fields such as image classification [11], object detection [6] and natural image semantic segmentation [5, 14]. Moreover, CNN is gradually applied to brain tumor segmentation and has achieved good results. One of a popular method is to extract the image patches from the MRI through the sliding window and assign labels for the central pixel [16, 19]. As stated in [19] the image patches can be employed to complete the segmentation, adopting different cascading modes so that the model can simultaneously extract local and global information. All these methods are patch-level-based, but these methods need too much training data and time-consuming.

Fully convolutional model(FCN) [14], achieved good performance in natural image segmentation. It replaces the fully connected layers of the traditional CNN with convolutional kernels and adds upsampling to restore the resolution of the input image. FCNN [2] and DUNet [9] use a fully convolutional approach to achieve an end-to-end segmentation model. These models are all similar to U-Net model [18], but each block is internally different. In this paper, we propose a hybrid pyramid U-Net(HPU-Net) model for brain tumor segmentation. Our main contributions are as follows:

  • Feature pyramid is introduced into the U-Net model. Combining multiple scales of information to complete the segmentation.

  • Hybriding multi-scale information with the semantc and location information, improves the segmentation performance.

2 Methodology

In this section, we present a hybrid pyramid U-Net (HPU-Net) model for brain tumor segmentation. The proposed network is used to process multimodal MRI and combine multi-scale information from different stages for efficient and accurate image segmentation.

Fig. 1.
figure 1

HPU-Net structure. It contains a downsampling path, a upsampling path and a feature hybrid pyramid path.

2.1 HPU-Net Model

The architecture of the model is shown in Fig. 1. It consists of 3 modules, a downsampling path with convolution and max pooling layers, an upsampling path with convolution and upsample, and an auxiliary segmentation path based on the image pyramid. The downsampling path is mainly to extract high level and global contextual features of the tumor. However, the upsampling path is used to reconstruct the object details. As we know, the high-level feature has much semantic information and low-level feature has much location information, the auxiliary path is used to extract multi-scale information and make full use of multiple levels of information and combine semantic and location information in the upsampling path to help the model complete segmentation for objects of different scales.

The downsampling path is similar to U-Net’s model, but there is a slightly different. We add batch normalization (BN) [8] layer inside each block, and each block has two convolutional layers with \(3 \times 3\) kernels and two BN layers and 1 max-pooling layer with \(2 \times 2\) strides. There two main reasons for why we make these changes: (1) As the model going deeper, gradient vanishing may occur during the back-propagation which making the training of the model stagnant, and in order to speed up the convergence of the model, we add the BN layer. (2) In medical images, some lesions occupy a smaller proportion of the entire image, and as the network going deeper, convolution and each downsampling operation may cause the lesion area to vanish. So in order to extract high-level information, we use two convolutional layers in each block.

For the upsampling path, we use symmetric structures with downsampling. Each block contains two convolutional layers with \( 3 \times 3 \) kernels and two BN layers, and one upsampling layer. The feature map after upsampling, and then concatenates with the feature map before maxpooling of the symmetric block in the downsampling path, this can combine the semantic and location information. Note that we used bilinear interpolation to perform upsampling, did not use convtranspose or deconvolutional layer as it will introduce more parameters and calculations in the network. After the concatenation, the new feature map contains semantic and location information and we can obtain the better results.

2.2 Hybrid Pyramid Network

Whether in object detection or image segmentation, the network depth and stride are usually a pair of contradictory things. The commonly used network structure corresponds to a relatively large stride, and the small objects in the image are even smaller than the stride size. The segmentation performance will decrease for some small objects. Common idea for solving this problem is multi-scale training and testing, also known as image pyramids [1]. However, this approach requires high time and computational cost. In object detection, Tsung-Yi Lin [13] proposed a feature pyramid method to detect small targets. So in our proposed method, feature pyramid is proposed to integrate the multi-scale information with the semantic and location information. Figure 1 illustrates the layout of the performance HPU-Net schematically.

In the upsampling path, if we only upsample the feature map one by one block, the segmentation results will have some holes, especially for smaller tumor regions which the model may ignore. This will greatly degrade the segmentation performance. And the tumor has multi-scale shapes and size, so we employ the feature pyramid to effectively explore the multi-scale information of the objects. Then, we upsample the feature map from each block in the upsampling path to obtain the same size feature map as the original input image by bilinear interpolation. After upsampling, the feature map is then merged with the corresponding bottom-up map in the upsampling path by element-wise addition. And then a \(1 \times 1\) convolutiuonal layer is attached to reduce the channel dimensions. At last a softmax layer is applied to finish the final classification. So the softmax layer receives the output feature maps from all processing blocks in the upsampling path \(x_{0}\), \(x_{1}, ..., x_{l-1}\) as inputs:

$$\begin{aligned} X_{\text{ in }\_\text{ softmax }}=H(x_{0})+H(x_{1})+...+H(x_{l-1}) \end{aligned}$$
(1)

where \(x_{i}\) means the feature maps of every block in the upsampling path, the input feature maps of softmax layer is \(X_{\text {in}\_\text {softmax}}\), H(x) refers to the operation of upsamping and convolution. The feature map used in the final prediction combines features of different scales and different semantic intensities. This not only uses multi-scale information, but also employs the semantic information in the downsampling path and the location information in the upsampling path to achieve the best segmentation results. This approach only adds 4 convolutional layers compared with U-Net, introducing a small number of parameters, but the segmentation performance has been improved significantly.

In each block of network, we use the combination of CONV-BN-RELU. In order to ensure non-linear mapping we use RELU [11] as activation function and batch normalization to reduce the internal-covariate-shift. With the BN layer, we can increase the learning rate to accelerate the convergence speed of the model and prevent the gradient vanish.

Fig. 2.
figure 2

A brain tumor example with doctors delineation. From left to right they are Flair, T1, T2, T1c and GroundTruth. The internal tumor has four color: necrosis (blue), edema (green), non-enhancing (orange) and enhancing tumor (dark red). (Color figure online)

3 Evaluation

The BRATS2015 [10, 15] and BRATS2017 [3, 15] challenge dataset are used to train and validate in our experiment. The BRATS2015 training data set includes 290 samples, 220 from the high grade glioma category (HGG) and 70 from the low grade glioma (LGG) category. And BRATS2017 training dataset consisted of 210 samples from HGG and 75 samples from LGG.

Every subject has multimodal MRI: namely T1, T1-contrast (T1c), T2 and Flair, which are skull-stripped and co-registered. Figure 2 shows the gliomas tumor with doctors delineation and the internal region. The evaluation for segmentation results mainly consists of three parts: (1) complete tumor region; (2)the core region of the tumor (including all tumor area except for edema); (3) the enhancing tumor region (only including the enhancing tumor area). For each part, the Dice Similarity Coefficient (DSC), Positive Predictive Value (PPV) and Sensitivity are computed. The DSC calculates the overlap part between the manual and the automatic segmentation. It is defined as,

$$\begin{aligned} DSC=\frac{2TP}{FP+2TP+FN}, \end{aligned}$$
(2)

where FN, FP and TP are the numbers of false negative, false positive and true positive detections, respectively. Sensitivity is useful to evaluate the number of TP and FN detections, defined as,

$$\begin{aligned} Sensitivity=\frac{TP}{TP+FN}. \end{aligned}$$
(3)

Finally, PPV is a measure of the amount of TP and FP, defined as,

$$\begin{aligned} PPV=\frac{TP}{TP+FP}. \end{aligned}$$
(4)

3.1 Implementation

We normalized each subject’s data with zero mean and unit standard deviation. Then we removed the slices that do not contain tumor information. And all images are cropped to 160*160 as the input to the model. At the end, BRATS2015 dataset only retained 15,000 slices, and BRATS2017 dataset retained 17800 slices. We augmented the dataset by left rotating the first half and right rotating the other half to construct a new dataset that is two times larger than the original one.

We use the Keras library with Tensorflow as the backend. The model was trained with standard back-propagation using Adam as an optimizer, and all parameters are initialized using he_normal. The training time on the augmented data is about ten hours to run 70 rounds using a standard computer with a NVIDIA Titan X GPU.

Fig. 3.
figure 3

The performance curves of 3 blocks and 4 blocks. From left to right: complete, core and enhancing. The vertical axis is Dice and horizontal axis is the number of epochs.

Fig. 4.
figure 4

The performance curves of with and without hybrid pyramid network.From left to right: complete, core and enhancing. The vertical axis is Dice and horizontal axis is the number of epochs.

3.2 Cross Validation

We performed a 5-fold cross-validation on the augmented data and two experiments were achieved to evaluate the deeper model and hybrid pyramid.

First, we tested with four blocks and three blocks in down-sampling path, to verify whether the deeper of the model could improve the segmentation accuracy. We plotted the dice coefficients for the three tasks in different epochs. As shown in Fig. 3, it can be seen that the four block models significantly improve the partitioned dice coefficients on the three tasks compared with the model with three blocks. This is because increasing the depth of the model helps to extract more high-level features, and going deeper of the model will also provide the pyramid module with more multi-scale information. Especially for the core and enhancing tumor regions, the dice coefficients improved at least 7%. Because the area of these two regions is relatively small and the deepening of the model will integrate more multi-scale information.

We also explored the impact of the hybrid pyramid on model accuracy. Figure 4 shows the effect of dice coefficients on models with and without pyramids on the validation set. For each task, it is clear that the introduction of the pyramid improve the segmentation performance of the classification model. Without using hybrid pyramid network, the classification model degraded on the segmentation of core and enhancing tumor regions. In our experiment, the dice coefficients with hybrid pyramid network can improve 5% at least. It confirmed the improvement of our proposed model with the feature pyramid module.

Fig. 5.
figure 5

Brain tumor segmentation results of all networks, from left to right they are GroundTruth, DUNet, FCNN, FCDenseNet, VGG and our proposed.

3.3 Results Analysis

We compared the proposed method with state-of-the-art methods on BRATS2017 dataset. As it contains HGG and LGG images, we use the 3560 slices as test which are not involved in training. The proposed method is among the top-ranking in the state-of-the-art (see Table 1).

Specifically, FCNN and DUNet achieved good performance on BRATS2017 challenge. The performence of our model is better (by a big margin over FCNN and DUNet, e.g., 0.80 vs 0.67 and 0.80 vs 0.70 in terms of Dice for Core tumor segmentation). Particularly, FCDenseNet [21], as we know the DenseNet  [7] got the best performance on ILSVRC2017. FCDenseNet references the dense block, and it’s dice and sensitivity on enhancing region is lower than HPU-Net (0.59 vs 0.76 and 0.59 vs 0.67) and the FCDenseNet needs more memory and the training time is longer than our method.

Table 1. Comparison with the state-of-the-arts on the testing set of BRATS2017
Table 2. Comparison with the state-of-the-arts on the testing set of BRATS2015

To confirm that the performance of our model, we also evaluate our proposed method BRATS2015 dataset. For the same test data, the performance of the baseline system and our proposed method on BRATS2015 is shown in Table 2. From these experimental result, we can see that our model also shows state-of-the-art performance on this dataset. Our HPU-Net network structure is simple and effective combining multi-scale features.

As we can see from Fig. 5, the segmentation results of groundtruth, DUNet, FCNN, FCDenseNet, VGG, and our proposed HPU-Net model are shown from left to right. It is clear that DUNet divided some of the necrosis regions (blue) into non-enhancing regions (orange). The FCNN directly ignored non-enhancing regions (orange). However, FCDenseNet divided some of the enhancing regions (dark red) into edema regions (green). VGGNet divided some enhancing regions (dark red) into non-enhancing regions (orange). These wrong segmentation results were due to lost of the multi-scale information of the data. On the contrary, the HPU-Net model performed better because of the effective fusion of multi-scale features.

4 Conclusion

We propose a hybrid pyramid U-Net model which is an end-to-end brain tumor segmentation model. Our model includes a downsampling path and an upsampling path and a hybrid pyramid path to extract multi-scale information. Deeper model made the dice improved, and the introduction of the feature pyramid also improved the segmentation result. Our model achieved significant better results and we try to perform the nature image segmentation in the future.