1 Introduction

Gliomas are the most common primary brain tumors in humans. They are characterized by different levels of aggressiveness which directly influences prognosis. Due to the gliomas’ heterogeneity (in terms of shape and appearance) manifested in multi-modal magnetic resonance imaging (MRI), their accurate delineation is an important yet challenging medical image analysis task. However, manual segmentation of such brain tumors is very time-consuming and prone to human errors. It also lacks reproducibility which adversely affects the effectiveness of patient’s monitoring, and can ultimately lead to inefficient treatment.

Therefore, automatic brain tumor detection (i.e., which pixels in an input image are tumorous) and classification (what is a type of a tumor and/or which part of a tumor, e.g., edema, non-enhancing solid core, or enhancing structures a given pixel belongs to; see examples in Fig. 1) from MRI are vital research topics in the pattern recognition and medical image analysis fields. A very wide practical applicability of such techniques encompasses computer-aided diagnosis, prognosis, staging, and monitoring of a patient. In this paper, we propose a deep learning technique to detect and segment gliomas from MRI in a cascaded processing pipeline. These gliomas are further segmented into the enhancing tumor (ET), tumor core (TC), and the whole tumor (WT).

Fig. 1.
figure 1

Different parts of a brain tumor (detection is presented in the second row—green parts show the agreement with a human reader) segmented using the proposed method (third row) alongside original images (first row): red—peritumoral edema, yellow—necrotic and non-enhancing tumor core, green—GD-enhancing tumor. (Color figure online)

1.1 Contribution

The contribution of this work is multi-fold:

  • We propose a deep learning technique for detection and segmentation of brain tumors from MRI. Our deep neural networks (DNNs) are inspired by the U-Nets [28] with considerable changes to the architecture, and they are cascaded—the first DNN performs detection, whereas the second segments a tumor into the enhancing tumor, tumor core, and the whole tumor (Fig. 1).

  • To improve generalization capabilities of our segmentation models, we build an ensemble of DNNs trained over different folds of a training set, and average the responses of the base classifiers.

  • We show that our approach can be seamlessly applied to the multi-modal MRI analysis, and allows for introducing separate processing pathways for each modality.

  • We validate our techniques over the newest release of the Brain Tumor Segmentation dataset (BraTS 2018), and show that they provide high-quality detection and segmentation, and offer instant segmentation.

1.2 Paper Structure

This paper is organized as follows. In Sect. 2, we discuss the current state of the art in brain-tumor delineation. The proposed deep learning-based techniques are presented in Sect. 3. The results of our experiments are analyzed in Sect. 4. Section 5 concludes the paper and highlights the directions of our future work.

2 Related Literature

Approaches for automated brain-tumor delineation can be divided into atlas-based, unsupervised, supervised, and hybrid techniques (Fig. 2). In the atlas-based algorithms, manually segmented images (referred to as atlases) are used to segment incoming (previously unseen) scans [25]. These atlases model the anatomical variability of the brain tissue [22]. Atlas images are extrapolated to new frames by warping and applying non-rigid registration techniques. An important drawback of such techniques is the necessity of creating large (and representative) annotated reference sets. It is time-consuming and error prone in practice, and may lead to atlases which cannot be applied to other tumors because they do not encompass certain types of brain tumors [1, 6].

Fig. 2.
figure 2

Automated delineation of brain tumors from MRI—a taxonomy.

Unsupervised algorithms search for hidden structures within unlabeled data [9, 19]. In various meta-heuristic approaches, e.g., in evolutionary algorithms [33], brain segmentation is understood as an optimization problem, in which pixels (or voxels) of similar characteristics are searched. It is tackled in a biologically-inspired manner, in which a population of candidate solutions (being the pixel or voxel labels) evolves in time [7]. Other unsupervised algorithms encompass clustering-based techniques [14, 29, 35], and Gaussian modeling [30]. In supervised techniques, manually segmented image sets are utilized to train a model. Such algorithms include, among others, decision forests [10, 39], conditional random fields [36], support vector machines [17], and extremely randomized trees [24].

Deep neural networks, which established the state of the art in a plethora of image-processing and image-recognition tasks, have been successful in segmentation of different kinds of brain tissue as well [12, 16, 21] (they very often require computationally intensive data pre-processing). Holistically nested neural nets for MRI were introduced in [38]. White matter was segmented in [11], and convolutional neural networks were applied to segment tumors in [13]. Interestingly, the winning BraTS’17 algorithm used deep neural nets ensembles [15]. However, the authors reported neither training nor inference times of their algorithm which may prevent from using it in clinical practice. Hybrid algorithms couple together methods from other categories [26, 31, 37].

We address the aforementioned issues and propose a deep learning algorithm for automated brain tumor segmentation which exploits a new multi-modal fully-convolutional neural network based on U-Nets. The experimental evidence (presented in Sect. 4) obtained over the newest release of the BraTS dataset (BraTS 2018) shows that it can effectively deal with multi-class classification, and it delivers high-quality tumor segmentation in real time.

3 Methods

In this work, we propose an algorithm which utilizes cascaded U-Net-based deep neural networks for detecting and segmenting brain tumors. Our approach for this task is driven by an assumption that the most salient features of a lesion are not contained in a single image modality.

There are multiple ways to exploit all the modalities in deep learning-based engines. One way is to store three (or four) modalities as channels of a single image, like RGB (RGBA), and process it as a standard color image. Although this approach has a significant downside—only the first layer (which extracts the most basic features) has access to the modalities as separate inputs, it can be successfully applied to easier computer-vision and image-processing tasks. Consecutive layers in the network process the outputs of the previous layers—a mix of features from all the modalities.

Hu and Xia processed each modality separately, and merged them at the very end of the processing chain to produce the final segmentation mask, to fully benefit from information manifested in each modality [8]. In this work, we combine both techniques—we use merged modalities for brain-tumor detection, and separate processing pathways for further segmentation of a tumor.

3.1 Detection of Brain Tumors from MRI

The first stage of our image analysis approach involves taking the whole image as an input (i.e., different modalities are stacked together as the channels of an image), and producing a binary mask of the region of interest (therefore, it performs detection of a tumor). This binary mask is used to select the voxels of all modalities from the original images (rendering remaining pixels as background). This region is passed to the segmentation unit by the U-Net in the second stage for the final multi-class segmentation.

The architecture of our DNN used for detection is visualized in Fig. 3 (note that we present multiple processing pathways which are exploited in segmentation; for detection, only one pathway is used, and the sigmoid function was applied as the non-linearity). The DNN prediction is binarized using a threshold \(\varvec{T_b}\). The binary mask is post-processed using the 3D connected components analysis—the size of connected components is calculated, and the one with the largest volumes remains. If the next (second) largest connected component is at least \(\varvec{T_{cc}}\) (in %) of the volume of the largest, it is kept as well. The binary masks resulting from the first stage are used to produce input to the second stage. More details on the architecture of our deep network itself are presented in the following subsection.

3.2 Segmentation of Detected Brain Tumors

Our DNN for brain tumor segmentation separates processing pathways and merges them at the very bottom of the network, where the feature space is compacted the most, and at each bridged connection (Fig. 3). By doing that, we assure that the low- and high-level features are extracted separately for all modalities in the contracting path. Those features can “interact” with each other in the expanding path, producing high-quality segmentations. Our preliminary experiments showed that the pre-contrast T1 modality carries the smallest amount of information, therefore in order to reduce the amount of segmentation time and resources (to make our method easily applicable in a real-life clinical setting), we did not use that modality in our pipeline. However, the proposed U-Net-based architecture is fairly flexible and allows for using any number of input modalities.

Fig. 3.
figure 3

The proposed deep neural network architecture. Three separate pathways (e.g., for FLAIR, T1c, and T2) are shown as a part of the contractive path. At each level (each set of down blocks) the output is concatenated and sent to a corresponding up block. At the bottom, there is a merging block, where all the features are merged before entering the expanding path. The output layer is a 1 \(\times \) 1 convolution with one filter for the first stage (detection), and three filters for the second stage (segmentation).

Our models are based on a well-known U-Net [28], with considerable changes to the architecture. First, there are separate pathways for each modality, effectively making three contracting paths. In the original architecture the number of filters was doubled at each down-block, whereas in our model it is constant everywhere, except in the very bottom part of the network (where the concatenation and merging of the paths takes place) where it is doubled. The down-block in our model consists of three convolutional layers (48 filters of the size 3 \(\times \) 3 each, with stride 1). The second alteration to the original U-Net are the bridged connections, which join (concatenate) activations from each pathway of the contracting paths with their corresponding activations from the expanding path, where they become merged. This procedure allows the DNN to extract high-level features while preserving the context stored earlier. The expanding path is standard—each up-block doubles the size of an activation map by the upsampling procedure, which is followed by two convolutions (48 filters of 3 \(\times \) 3 size each, with stride 1). In the last layer, there is a 1 \(\times \) 1 convolution with 1 filter in the detection, and 3 filters in the multi-class classification stages, respectively.

The output of the second stage is an activation map of the size \(I_w \times I_h \times 3\), where the last dimension represents the number of classes, and \(I_w\) and \(I_h\) are the image width and height, respectively. The activation is then passed through a softmax operation, which performs the final multi-class classification.

4 Experimental Validation

4.1 Data

The Brain Tumor Segmentation (BraTS) dataset [2,3,4,5, 20] encompasses MRI-DCE data of 285 patients with diagnosed gliomas—210 high-grade glioblastomas (HGG), and 75 low-grade gliomas (LGG). Each study was manually annotated by one to four experienced readers. The data comes in four co-registered modalities: native pre-contrast (T1), post-contrast T1-weighted (T1c), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (FLAIR). All the pixels have one of four labels attached: healthy tissue, Gd-enhancing tumor (ET), peritumoral edema (ED), the necrotic and non-enhancing tumor core (NCR/NET).

The data was acquired with different clinical protocols, various scanners, and at 19 institutions, therefore the pixel intensity distribution may vary significantly. The studies were interpolated to the same shape (240 \(\times \) 240\(\times \) 155, hence 155 images of \(240\times 240\) size, with voxel size 1 mm\(^3\)), and they were pre-processed (skull-stripping was applied). Overall, there are 285 patients in the training set \(\varvec{T}\) (210 HGG, 75 LGG), 66 patients in the validation set \(\varvec{V}\) (without ground-truth data provided by the BraTS 2018 organizers), and 191 in the test set \(\varPsi \) (unseen data used for the final verification of the trained models).

4.2 Experimental Setup

The DNN models were implemented using Python3 with the Keras library over CUDA 9.0 and CuDNN 5.1. The experiments were run on a machine equipped with an Intel i7-6850K (15 MB Cache, 3.80 GHz) CPU with 32 GB RAM and NVIDIA GTX Titan X GPU with 12 GB VRAM. The training metric was the DICE score for both stages (detection and segmentation), which is calculated as

$$\begin{aligned} \mathrm{DICE(A,B)}=\frac{2 \cdot \left| A\cap B\right| }{\left| A\right| +\left| B\right| }, \end{aligned}$$
(1)

where A and B are two segmentations, i.e., manual and automated. DICE ranges from zero to one (one is the perfect score). The optimizer was Nadam (Adam with Nesterov momentum) with the initial learning rate \(10^{-5}\), and the optimizer parameters: \(\beta _1 = 0.9\), \(\beta _2 = 0.999\). The training ran until DICE over the validation set did not increase by at least 0.002 in 10 epochs. The training time for one epoch is around 10 min (similar for both stages). The networks converges in around 20–30 epochs (the complete training for each fold takes 7–8 h). For detection, we used the manually-tuned thresholds: \(\varvec{T_b}=0.5\), and \(\varvec{T_{cc}}=20\%\).

Both networks are relatively small, which directly translates to the low computational requirements during inference—one volume can be processed and classified end-to-end within around 5 s. To exploit the training set completely, and still be able to use validation subset to avoid over-fitting, the final prediction was performed with an ensemble of five models trained on different folds of the training set (we followed the 5-fold cross-validation setting over the training set). Using an ensemble of five models (and averaging their outputs to elaborate the final prediction) was shown to improve the performance, while extending the inference time to around 18 s per full volume.

4.3 Experimental Results

In Table 1, we gather the results (DICE) obtained over the training and validation BraTS 2018 datasets (in the 5-fold cross-validation setting). The whole tumor class represents the performance of the first stage of our classification system (evaluated on all the classes merged into one—exactly as the first stage model is trained). Here, we report the average DICE for 5 non-overlapping folds of the training set \(\varvec{T}\), and the final DICE for validation \(\varvec{V}\), and test \(\varPsi \) sets obtained using the ensembles of 5 base deep classifiers learned using training examples belonging to different folds. Note that the ground-truth data (pixel labels) for \(\varvec{V}\) and \(\varPsi \) were not known to the participants during the BraTS 2018 challenge, hence they could not be exploited to improve the models.

Table 1. Segmentation performance (DICE) over the BraTS 2018 validation set obtained using our DNNs trained with T1c, T2, and FLAIR images. The scores are presented for whole tumor (WT), tumor core (TC), and enhancing tumor (ET) classes. For the training set, we report the average across 5 non-overlapping folds, whereas for the validation set—the results reported automatically by the BraTS competition server (for validation, we used an ensemble of 5 DNNs trained over different folds).

The results show that an ensemble of DNNs manifests fairly good generalization capabilities over the unseen data, and it consistently obtains high-quality classification. Interestingly, we did not use any data augmentation techniques in our approach (which can be perceived as an implicit regularization), and even without increasing the size and heterogeneity of the training data, the ensembles were able to accurately delineate brain tumors in unseen scans. It also indicates that data augmentation could potentially further improve the capabilities (both detection and segmentation) of our deep models by providing a large number of artificially created (but visually plausible and anatomically correct) training examples generated using the original \(\varvec{T}\).

Table 2. The results reported for the unseen test set \(\varPsi \) show that our models can generalize fairly well over the unseen data (however, the results are worse when compared to the validation set). We report DICE alongside the Hausdorff distance (HD).

In Table 2, we gather the results obtained over the unseen test set \(\varPsi \)—we report not only DICE, but also the Hausdorff distance (HD) given as

$$\begin{aligned} \mathrm{HD(A,B)}=\max \left( h(A,B), h(B,A)\right) , \end{aligned}$$
(2)

where h(AB) is the directed Hausdorff distance:

$$\begin{aligned} h(A,B)=\max _{a\in A}\min _{b\in B}\left| \left| a-b\right| \right| , \end{aligned}$$
(3)

and \(\left| \left| \cdot \right| \right| \) is a norm operation (e.g., Euclidean distance) [32]. It can be noted that this metric is quite sensitive to outliers (the lower HD, the higher quality segmentation we have in terms of contour similarity). The results show that our deep-network ensemble can generalize quite well over the unseen data, however the DICE values are slightly lowered when compared to the validation set. We can attribute it to the heterogeneity of the testing data (as mentioned earlier, we did not apply any data augmentation to increase the representativeness of the training set). Interestingly, the whole-tumor segmentation remained at the very same level (see DICE in Table 2), and our method delivered high-quality whole-tumor delineation (we can observe the highest decrease of accuracy for the enhancing part of a tumor, and it amounts to more than 0.08 DICE on average). It also leads us to the conclusion that for tumor segmentation (differentiating between different parts of a lesion), the deep models require larger and more diverse sets (perhaps due to subtle tissue differences which cannot be learnt from a limited number of brain-tumor examples) and potentially better regularization.

5 Conclusions

In this paper, we presented an approach for effective detection and segmentation (into different parts of a tumor) of brain lesions from magnetic resonance images which exploits cascaded multi-modal fully-convolutional neural networks inspired by the U-Net architecture. The first deep network in our pipeline performs tumor detection, whereas the second—multi-class tumor segmentation. We cross-validated the proposed technique (in the 5-fold cross-validation setting) over the newest release of the BraTS dataset (BraTS 2018), and the experimental results showed that:

  • Our cascaded multi-modal U-Nets deliver accurate segmentation, and ensembling the models (and averaging the response of base classifiers) trained across separate folds allows us to build the final model which generalizes well over the unseen testing data.

  • We showed that our networks can be trained fairly fast (7–8 h using 1 GPU), and deliver real-time inference (around 18 s per volume).

  • We showed that our models can be seamlessly applied to both two- and multi-class classification (i.e., tumor detection and segmentation, respectively).

Our current research is focused on applying our techniques to different organs and modalities (e.g., lung PET/CT imaging [23]), and developing data augmentation approaches for medical images. Such algorithms (which ideally generate artificial but visually plausible and realistic images) can be perceived as implicit regularizers which help improve the performance of models over the unseen data by introducing new examples into a training set [18, 27, 34].