Keywords

1 Introduction

In medical image analysis, image segmentation is an important task and is primordial to visualize and quantify the severity of the pathology in clinical practice. Multi-modality imaging provides complementary information to discriminate specific tissues, anatomies and pathologies. Numerous automatic approaches have been developed to speed up medical image segmentation such as Multi-atlas based approaches [4] and model-based approaches [12].

Both strategies are typically optimized for a specific set of multi-modal images and usually require these modalities to be available. In clinical settings, image acquisition and patient artifacts, among other hurdles, make it difficult to fully exploit all the modalities; as such, it is common to have one or more modalities to be missing for a given instance. This problem is not new, and the subject of missing data analysis has spawned an immense literature in statistics (e.g. [13]). In medical imaging, a number of approaches have been proposed, some of which require to re-train a specific model with the missing modalities or to synthesize them [6]. Synthesis can improve multi-modal classification by adding information of the missing modalities in the context of a simple classifier such as random forests [11]. Approaches to imitate with fewer features a classifier trained with a complete set of features have also been proposed [7]. Nevertheless, it should stand to reason that a more complex model should be capable of extracting relevant features from just the available modalities without relying on artificial intermediate steps such as imputation or synthesis.

This paper proposes a deep learning framework (HeMIS) that can segment medical images from incomplete multi-modal datasets. Deep learning [3] has shown an increasing popularity in medical image processing for segmenting but also to synthesize missing modalities [11]. Here, the proposed approach learns, separately for each modality, an embedding of the input image into a latent space. In this space, arithmetic operations (such as computing first and second moments of a collection of vectors) are well defined and can be taken over the different modalities available at inference time. These computed moments can then be further processed to estimate the final segmentation. This approach presents the advantage of being robust to any combinatorial subset of available modalities provided as input, without the need to learn a combinatorial number of imputation models.

2 Method

2.1 Hetero-Modal Image Segmentation

Typical convolutional neural network (CNN) architectures take a multiplane image as input and process it through a sequence of convolutional layers (followed by nonlinearities such as \(\text {ReLU}(\cdot ) \equiv \max (0,\cdot )\)), alternating with optional pooling layers, to yield a per-pixel or per-image output [3]. In such networks every input plane is assumed to be present within a given instance: since the very first convolutional layer mixes input values coming from all planes, any missing plane introduces a bias in the computation that the network is not equipped to deal with.

We propose an approach wherein each modality is initially processed by its own convolutional pipeline, independently of all others. After a few independent stages, feature maps from all available modalities are merged by computing mapwise statistics such as the mean and the variance, quantities whose expectation does not depend on the number of terms (i.e. modalities) that are provided. After merging, the mean and variance feature maps are concatenated and fed into a final set of convolutional stages to obtain network output. This is illustrated in Fig. 1. In this procedure, each modality contributes a separate term to the mean and variance; in contrast to a vanilla CNN architecture, a missing modality does not throw this computation off: the mean and variance terms simply become estimated with larger uncertainty. In seeking to be robust to any subset of missing modalities, we call this approach hetero-modal rather than multi-modal, recognizing that in addition to taking advantage of several modalities, it can take advantage of a diverse, instance-varying, set of modalities. In particular, it does not require that a “least common denominator” modality be present for every instance, as sometimes needed by common imputation methods.

Fig. 1.
figure 1

Illustration of the Hetero-Modal Image Segmentation architecture. Modalities available at inference time, \(M_k\), are provided to independent modality-specific convolutional layers in the back end. Feature maps statistics (first&second moments) are computed in the abstraction layer, which after concatenation are processed by further convolutional layers in the front end, yielding pixelwise classifications outputs.

Let \(k \in \mathcal {K} \subseteq \{1, \ldots , N\}\) denote a modality within the set of available modalities for a given instance, and \(M_k\) represent the image of the k-th modality. For simplicity, in this work we assume 2D data (e.g. a single slice of a tomographic image), but it can be extended in an obvious way to full 3D sections. As shown on Fig. 1, HeMIS proceeds in three stages:

1. Back End: In our implementation, this consists of two convolutional layers with ReLU, the second followed with a (2, 2) max-pooling layer, denoted respectively \(C_k^{(1)}\) and \(C_k^{(2)}\). To ensure that the output layer consists of the same number of pixels as the input image, the convolutions are zero-padded and the stride for all operations (including max-pooling) is 1. In particular, pooling with a stride of 1 does not downsample, but simply “thickens” the feature maps; this is found to add some robustness to the results. The number of feature maps in each layer is given in Fig. 1. Let \(C_{k,\ell }^{(j)}\) be the the \(\ell \)-th feature map of \(C_k^{(j)}\).

2. Abstraction Layer: Modality fusion is computed here, as first and second moments across available modalities in \(C^{(2)}\), separately for each feature map \(\ell \),

$$ \widehat{\mathrm {E}}_{\ell }\left[ C^{(2)}\right] = \frac{1}{|\mathcal {K}|} \sum _{k \in \mathcal {K}} C_{k,\ell }^{(2)} \quad \text {and}\quad \widehat{\mathrm {Var}}_{\ell }\left[ C^{(2)}\right] = \frac{1}{|\mathcal {K}|-1} \sum _{k \in \mathcal {K}} \left( C_{k,\ell }^{(2)} - \widehat{\mathrm {E}}_{\ell }\left[ C^{(2)}\right] \right) ^2\!\!, $$

with \(\widehat{\mathrm {Var}}_{\ell }[C^{(2)}]\) defined to be zero if \(|\mathcal {K}|=1\) (a single available modality).

3. Front End: Finally the front end combines the merged modalities to produce the final model output. In our implementation, we concatenate all \(\widehat{\mathrm {E}}\left[ C^{(2)}\right] \) and \(\widehat{\mathrm {Var}}\left[ C^{(2)}\right] \) feature maps, pass them through a convolutional layer \(C^{(3)}\) with ReLU activation, to finish with a final layer \(C^{(4)}\) that has as many feature maps as there are target segmentation classes. The pixelwise posterior class probabilities are given by applying a softmax function across the \(C^{(4)}\) feature maps, and a full image segmentation is obtained by taking the pixelwise most likely posterior class. No further postprocessing on the resulting segment classes (such as smoothing) is done.

3 Data and Implementation Details

We studied the HeMIS framework on two neurological pathologies: Multiple Sclerosis (MS) with the MS Grand Challenge (MSGC) and a large Relapsing Remitting MS (RRMS) cohort, as well as glioma with the Brain Tumor Segmentation (BRATS) dataset [8].

MS MSGC: The MSGC dataset [10] provides 20 training MR cases with manual ground truth lesion segmentation and 23 testing cases from the Boston Children’s Hospital (CHB) and the University of North Carolina (UNC). We downloadedFootnote 1 the co-registered T1W, T2W, FLAIR images for all 43 cases as well as the ground truth lesion mask images for the 20 training cases. While lesions masks for the 23 testing cases are not available for download, an automated system is available to evaluate the output of a given segmentation algorithm.

RRMS: This dataset is obtained from a multi-site clinical study with 300 RRMS patients (mean age 37.5 yrs, SD 10.0 yrs). Each patient underwent an MRI that included FLAIR, T1W, T2W and T1 post-contrast (T1C) images.

BRATS. The BRATS-2015 dataset contains 220 subjects with high grade and 54 subjects with low grade tumors. Each subject contains four MR modalities (FLAIR, T1W, T1C and T2) and comes with a voxel level segmentation ground truth of 5 labels: healthy, necrosis, edema, non-enhancing tumor and enhancing tumor. As done in [8], we transform each segmentation map into 3 binary maps which correspond to 3 tumor categories, namely; Complete (which contains all tumor classes), Core (which contains all tumor subclasses except “edema”) and Enhancing (which includes the “enhanced tumor” subclass). For each binary map, the Dice Similarity Coefficient (DSC) is calculated [8].

BRATS-2013 contains two test datasets; Challenge and Leaderboard. The Challenge dataset contains 10 subjects with high grade tumors while the Leaderboard dataset contains 15 subjects with high grade tumors and 10 subject with low grade tumors. There are no ground truth provided for these datasets and thus quantitative evaluation can be achieved via an online evaluation system [8]. In our experiments we used Challenge and Leaderboard datasets to compare the HeMIS segmentation performance to the state-of-the-art, when trained on all modalities. To deal with class imbalance, we adopt the patch-wise training procedure mentioned in [5]. We make the HeMIS architecture robust to missing modalities by randomly dropping any number for a given training example. We refer to this training scheme as pseudo-curriculum training.

4 Experiments and Results

We first validate HeMIS performance against state-of-the-art segmentation methods on the two challenge datasets: MSGC and BRATS. Since the test data and the ranking table for BRATS 2015 are not available, we submitted results to BRATS 2013 challenge and leaderboard. These results are presented in Table 1.Footnote 2 As we observe, HeMIS outperforms Tustison et al. [12], the winner of the BRATS 2013 challenge, on most tumor region categories.

The MSGC dataset illustrates a direct application of HeMIS flexibility as only three modalities (T1W, T2W and FLAIR) are provided for a small training set. Therefore, given the small number of subjects, we first trained HeMIS on RRMS dataset with four modalities and fine-tuned on MSGC. Our results were submitted to the MSGC websiteFootnote 3, with a resuts summary appearing in Table 2. The MSGC segmentation results include three other supervised approaches; when compared to them, HeMIS obtains highly competitive results with a combined score of 83.2 %, where 90.0 % would represent human performance given inter-rater variability.

Table 1. Comparison of HeMIS when trained on all modalities against BRATS-2013 Leaderboard and Challenge winners, in terms of Dice Similarity (scores from [8]).
Table 2. Results of the full dataset training on the MSGC. For each rater (CHB and UNC), we provide the volume difference (VD), surface distance (SD), true positive rate (TPR), false positive rate (FPR) and the method’s score as in [10].

The main advantage of HeMIS lies in its ability to deal with missing modalities, specifically when different subjects are missing different modalities. To illustrate the model’s flexibility in such circumstances, we compare HeMIS performance to two common approaches to deal with random missing modalities. The first, mean-filling, is to replace a missing modality by the modality’s mean value. In our case since all means are zero by construction, replacing a missing modality by zeros can be viewed as imputing with the mean. The second approach is to train a multi-layer perceptron (MLP) to predict the expected value of specific missing modality given the available ones. Since neural networks are generally trained for a unique task, we need to train 28 different MLPs (one for each \(\circ \) in Table 3 for a given dataset) to account for different possibilities of missing modalities. We used the same MLP architecture for all these models, which consists of 2 hidden layers with 100 hidden units each, trained to minimize the mean squared error.

Table 3. Dice similarity coefficient (DSC) results on the RRMS and BRATS test sets (%). The table shows the DSC for different MRI modalities being either absent (\(\circ \)) or present (\(\bullet \)), in order of FLAIR (F), T1W (\(T_1\)), T1C (\(T_1c\)), T2W (\(T_2\)). Results are reported for HeMIS, Mean (mean-filling) and the imputation MLP.

Table 3 shows the DSC for this experiment on the test set. On the BRATS dataset, for the Core category, HeMIS achieves the best segmentation in almost all cases (14 out of 15) and for the Complete and Enhancing categories it leads in most cases (10 and 9 cases out of 15 respectively). Also, the mean-filling approach hardly outperforms HeMIS or MLP-imputation. These results are consistent with the MS lesion segmentation dataset, where HeMIS outperforms other imputation approaches in 9 out of 15 cases. In scenarios where only one or two modalities are missing, while both HeMIS and MLP-imputation obtain good results, HeMIS outperforms the latter in most cases on both datasets. On BRATS, when missing 3 out of 4 modalities, HeMIS outperforms the MLP in a majority of cases. Moreover, whereas the HeMIS performance only gradually drops as additional modalities become missing, the performance drop for MLP-imputation and mean-filling is much more severe. On the RRMS cohort, the MLP-imputation appears to obtain slightly better segmentations when only one modality is available.

Fig. 2.
figure 2

HeMIS segmentation results on BRATS and MS subjects. For BRATS (first row) the segmentation colors describe necrosis (blue), non-enhancing (yellow), active core (orange) and edema (green). For the MS case, the lesions are highlighted in red. The columns present the results overlaid on top of a FLAIR image for different combinations of input modalities, with ground truth in the last column.

Although it is expected that tumor sub-label segmentations should be less accurate with fewer modalities, we should still hope for the model to report a sensible characterization of the tumor “footprint”. While MLP and mean-filling fail in this respect, HeMIS quite well achieves this goal by outperforming in almost all cases of the Complete and Core tumor categories. This can also be seen in Fig. 2 where we show how adding modalities to HeMIS improves its ability to achieve a more accurate segmentation. From Table 3, we can also infer that the FLAIR modality is the most relevant for identifying the Complete tumor while T1C is the most relevant for identifying Core and Enhancing tumor categories. On the RRMS dataset, HeMIS results are also seen to degrade slower than the other imputation approaches, preserving good segmentation when modalities go missing. Indeed, as seen in Fig. 2, even though with FLAIR alone HeMIS already produces good segmentations, it is capable of further refining its results when adding modalities, by removing false positives and improving outlines of the correctly identified lesions or tumor.

5 Conclusion

We have proposed a new fully automatic segmentation framework for heterogenous multi-modal MRI using a specialized convolutional deep neural network. The embedding of the multi-modal CNN back-end allows to train and segment datasets with missing modalities. We carried out an extensive validation on MS and glioma and achieved state-of-the art segmentation results on two challenging neurological pathology image processing tasks. Importantly, we contrasted the graceful performance degradation of the proposed approach as modalities go missing, compared with other popular imputation approaches, which it achieves without requiring training specific models for every potential missing modality combination. Future work should concentrate on extending the approach to broader modalities outside of MRI, such as CT, PET and ultrasound.