Background

Medical image segmentation is one of the most common areas of applying deep learning into the medical image analysis. Meanwhile, semantic segmentation is usually under the request to do the automatic partition of the interesting areas such as organs and lesions, which will be applied in the assistant diagnosis [1], the tissue-specific measurement [2], the three-dimensional reconstruction [3], and the visual enhancement [4].

Traditional image segmentation methods include threshold-based [5], deformable surface modal based [6], active surface modal based [7], etc. The performance of these methods is limited, for the reason of similarity between interested areas and surroundings. Moreover, determining interested areas usually strongly depends on handcrafted features that suffer from the limited feature representation ability [8]. Deep learning is constantly creating new achievements in computer vision and pattern recognition. In some tasks of natural image classification, the performance of deep learning based on approaches even surpasses that of the human judgment [9]. The achieved good performances of state-of-the-art deep learning techniques are mainly attributed to the ability of the convolutional neural network (CNN) to learn the hierarchical representation of images, so that it does not depend on the handcrafted features and overcomes the limitation of handcrafted features in revealing the characteristics of complex objects [10]. The strong feature learning ability of CNN opens up a new direction for medical image segmentation. CNN is typically used for classification, and the output of images is mostly only the category labels. In the task of medical image segmentation, the desired output should include location, that is, the classification of each pixel is necessary. Patch-based method [11,12,13] determines the class of each pixel by predicting the label of the local area around each pixel (by using sliding window method), However, the training of this method is very slow and it is difficult to determine the most appropriate size of the local area. The larger area will affect the accuracy, while the smaller area is difficult to consider the context information. Fully convolution networks (FCNs) [14] solves these two problems efficiently and elegantly. Unlike classical CNN, which uses fully connection layer to get fixed-length vectors after the convolution layer for classification, FCN uses deconvolution to up-sample the feature map and restore it to the same size as the input image, thus each pixel can be predicted. On this basis, U-Net [15] designs the network structure consisting of an encoder path that contains multiple convolutions for down sampling, and a decoder path that has several deconvolution layers to up-sample the feature. Furthermore, it combines high-resolution features with up-sampled features by using skip connection to improve positioning accuracy. This encoder-decoder structure has also become the basic structure of many segmentation method,including segmentation of 3D medical images that can make better use of depth information [16,17,18,19]. However, due to the indistinguishability of interested areas in tissues, for example, the tumors with the surrounding normal tissues, it is still a big challenge to establish effective methods for medical image semantic segmentation.

Referring to the clinic diagnosis experience of radiologists, the diagnosis report is made based on synthesizing multiple-perspective clues from the multiple medical imaging methods. For example, four different modes of MRI (Magnetic Resonance Imaging) images are used in brain tumor surgery: T1 (spin-lattice relaxation), T1c (T1-contrasted), T2 (spin-spin relaxation), and Flair (fluid attenuation inversion recovery). Enhancing and non-enhancing structures are segmented by evaluating the hyper-intensities in T1C. T2 highlights the edema and Flair is used to cross-check the extension of the edema. Each modality has distinct responses for different sub regions of gliomas. The final diagnosis is usually determined by multiple modalities. Because the information provided by single modal images is very limited, it is difficult to meet the high-precision clinical needs. Multi-modal images provide more information about the patient’s lesion and its surrounding areas, and the information of different modalities is complementary each other in revealing the lesion characteristics from different perspectives. How to make good use of these complementary information has become a direction to improve the accuracy of segmentation. Existing methods often treat modalities as different channels in the input data [20, 21]. However, the correlations between them are not well utilized. To draw inspiration from the recent success of SKNet [22] and understanding of clinic diagnosis experience, we propose a multi-modality self-attention aware deep network for 3D biomedical segmentation. By using Multi-Modality Self-Attention Aware convolution to realize the self-weighted fusion of multi-modal data, it achieves state-of-the-art performance for multi-modal brain tumor segmentation.

Methods

Multi-path encoder and decoder

To realize processing of multi-modal 3D medical images, we explore to construct a multi-path input 3D segmentation network. The network adopted in the paper is the encoder and decoder structure similar to U-Net as shown in Fig. 1. Here, the encoder is used to extract the deep representation of each modality of medical image, while the decoder is used to up-sample the learned feature map at each level and restore feature at the last level to the original resolution for the pixel-wise region and semantic label prediction.

Fig. 1
figure 1

Comparisons of encoder and decoder structures

To deal with the multi-modal data at the encoder end, there are usually two solution: single-path with concatenating the multi¬-model image at the data-level and multi-path with concatenating the multi-modal image at the feature level. The structure of two fusion methods as illustrated in Fig. 1. Because the multi-path structure facilitates the processing of information from each modal separately, we take the multi-path as the base structure of the encoder.

More specifically, at the encoder end, we adopt a ResNet [23] as backbone network which consists of one input layer and four down-sampling layers. 3D convolution of the kernel size 3 × 3 and 7 × 7 is used for input and down-sampling layers respectively.

The structure of the decoder corresponds to the encoder, which includes four up-sampling layers and one output layer. For up-sampling layers, each 3D Transpose convolution with kernel size 3 × 3 is used to up-sampling feature map, and combines with the corresponding high-resolution features. All convolutions above are further applied by an element-wise rectified-linear non-linearity (ReLU). After up-sampling the feature maps to the original resolution, 1 × 1 convolution is used to produce the class probabilities of each pixel.

Referring to the experience of radiologists in clinical diagnosis based on overall consideration of significant symptoms reflecting on certain multiple modal images, we discuss an attention mechanism to improve the segmentation performance by paying the different attention on different features and different modal images. A new self-attention aware mechanism is proposed and illustrated in the following section.

Multi-modality self-attention aware convolution

Recently, attention mechanism is used for a series of tasks [24, 25], it biases the allocation of the most informative feature expressions and simultaneously suppresses the less useful ones. Furthermore, SENet [26] brings a gating mechanism to self-recalibrate the feature map via channel-wise importance. Then on the base of these, SKNet was proposed to focus on the adaptive local receptive fields size of neurons sizes. Similarly, we propose the Multi-Modality Self-Attention Aware Convolution to fuse multi-modal features, which can adaptively adjust the fusion weights according to the contribution degree of different modalities (see Fig. 2).

Fig. 2
figure 2

Multi-Modality Self-Attention Aware Convolution

For the obtained multi-modal features Um ∈ RW × H × D × C, we first fuse them via an element-wise summation to integrate information:

$$ \mathrm{U}={\sum}_m^M{U}_m $$
(1)

where W, H, D are feature dimensions, C is the number of channels and m is modality in M (all modalities).

Then we shrink each feature map on the channel by 3D global average pooling to generate channel-wise statistics as z ∈ RC. Specifically, the c-th element of z is calculated as:

$$ {z}_c={F}_{gp}\left({U}_c\right)=\frac{1}{\mathrm{W}\times \mathrm{H}\times \mathrm{D}}{\sum}_{i=1}^W{\sum}_{j=1}^H{\sum}_{k=1}^D{U}_c\left(i,j,k\right) $$
(2)

To realize the adaptive weighting of the multi-modal feature map, M full-connection layers are used to generate M weighting parameters w ∈ RC under the guidance of feature descriptor z. Specifically, a SoftMax operator is applied on the channel-wise digits to adaptively select different modality of information:

$$ {\mathrm{w}}_c^m=\frac{e^{z_c^m}}{\sum_m^M{e}^{z_c^m}},{\sum}_m^M{\mathrm{w}}_c^m=1 $$
(3)

The final feature map \( \overset{\sim }{U}\in {R}^{W\times H\times D\times C} \) is obtained through the attention weights between multi-modal:

$$ \overset{\sim }{U}={\sum}_m^M{w}_m\bullet {U}_m $$
(4)

The system overview of our method shows in Fig. 3.

Fig. 3
figure 3

Multi-Modality Self-Attention Aware Deep Network

Results

Dataset and data preprocessing

The Dataset for this study comes from BRATS-2015 [27]. The training set consists of 220 patients with high grade gliomas and 54 subjects with low grade gliomas. The testing set contains images of 110 patients. Each patient was scanned with four sequences: T1, T1c, T2, and FLAIR. The size of each MRI image is 155 × 240 × 240. All of the images were skull-striped and re-sampled to an isotropic 1 mm3 resolution, and four sequences of the same patient had been co-registered. All ground truth annotations were carefully prepared under the supervision of expert radiologists. The ground truth contains five labels: non-tumor, necrosis, edema, non-enhancing tumor and enhancing tumor. Because the original testing set is without ground truth, we split the training data into two parts: 195 high grade gliomas and 49 low grade gliomas for training, and the rest 30 subjects for testing. For data preprocessing, we first extract the region of interest area from the original image to prevent the model from focusing on zero regions and getting trapped into a local minimum. Then we resize a volume to 144 × 144 for each axial plane and normalize the intensity of a volume based on the mean and standard error (std).

The evaluation was done for three different tumor sub-compartments:

  • Enhancing Tumor (ET): it only takes the active tumor region (label 4 for high-grade only)

  • Whole Tumor (WT): it considers all tumor areas (labels 1, 2, 3, 4)

  • Tumor Core (TC): it considers tumor core region without necrosis (labels 1, 3, 4)

Training set

The training patch size was 144 × 144 × 16 which means that we put 16 slices of volume into the network at a time. Our networks were implemented in Pytorch. We use stochastic gradient descent (SGD) optimizer for training, with the initial learning rate is 10e-3, momentum 0.9, weight decay 5 × 10−4, batch size 1 and maximal iteration 400. Network parameters are initialized by kaiming initialization. The Cross-Entropy loss plus Dice loss is used for training.

Evaluation criteria

There are three kinds of Metrics in biomedical segmentation: Dice, Sensitivity, and Positive Predicted Value.

$$ Dice=\frac{2\ast TP}{\left(2\ast TP+ FP+ FN\right)} $$
(5)
$$ Sensitivity=\frac{TP}{\mathrm{TP}+\mathrm{FN}} $$
(6)
$$ Positive\ Predicted\ Value=\frac{TP}{\mathrm{TP}+\mathrm{FP}} $$
(7)

where TP, TN, FP, FN are meant as true positives, true negatives, false positives, and false negatives. Dice (Dice Similarity Index) is a measure of how similar both prediction and ground truth are. A high Sensitivity implies the most lesions were segmented successfully. Positive Predicted Value indicates the capability of a test to detect the presence of disease.

Experimental results

In Table 1, we compare the performance of single-path with multi-path encoder by using a simple structure (shown in Fig. 1) on the testing set. The results show that the single-path encoder can make better use of multi-modal information, because the combination of input data in the channel dimension can make the convolution kernel of the encoder layers learn multi-modal information simultaneously and integrate it. Although the simple multi-path input cannot learn the complementary information of the multi-modal data, sharing parameters can solve this problem to a certain extent.

Table 1 Comparison of segmentation results between single and multi-path encoder

In Table 2, we compare the performance of two attention mechanisms. On the basis of the previous experiment, we added SE block [26] to each convolution layer to weight the multi-modal information on the channel dimension for the U-Net [15] structure. Then, we add our MMSA structure to the multi-path structure to realize the self-weighted fusion of multi-modal information. Experimental results show that both attention mechanisms can improve the performance of the original network, furthermore, our method achieves the optimal results.

Table 2 Segmentation result of two attention mechanisms

Figure 4 shows examples of segmentation results. For simplicity of visualization, only Flair image is shown. Among them, different colors represent different categories, green for edema, red for necrosis, yellow for enhancing tumor core, blue for non-enhancing tumor core. As shown in Fig. 4, our method is more accurate for the segmentation of lesions, and the area of misclassification is less comparing with the approach of single path with the SE block. The segmentation results are more approximate with that of the Ground truth.

Fig. 4
figure 4

Segmentation result of the brain tumor from a training image

Discussion

In the independent testing set, the model obtains similar results. It shows that the model has a certain generalization ability in the task of glioma segmentation. In order to verify the effectiveness of self-attention aware convolution, the comparative experiment is carried out under the same training parameters. The starting point of this paper is to study how to make better use of multi-modal data. The task of brain glioma segmentation here is just to verify the performance of the model, and the method can be used for other multi-modal image segmentation tasks.

In order to cooperate with the multi-modal data fusion scheme proposed, we adopt the design of multi-path input. Therefore, missing modality and the change of input order will seriously affect the test results, which makes the model not flexible enough in use.

Conclusions

In this paper, we introduce an attention mechanism architecture for 3D multi-modal image biomedical segmentation. With the proposed multi-modality self-attention aware convolution, the segmentation result is improved by counting the different impact of different features from different modalities. The self-attention aware deep network provides an effective solution for the multi-modal problem with adaptive weighting and fusion mechanism based on data learning. Experimental results on BRATS-2015 dataset demonstrate that our method is effective and achieves better segmentation results comparing the single path with simple concatenative and without taking account of the variety of each modality. In the future, more research with our proposed MMSA network will be done on the application of medical segmentation based on multi-parameter MRI in some complex application situation such as the liver diagnosis, where there exists close appearance among lesions and surroundings, at the meantime, large diversity exists among same types of lesions.