1 Introduction

Multiple Sclerosis (MS) is a chronic inflammatory-demyelinating disease of the central nervous system [1]. Magnetic Resonance Imaging (MRI) is fundamental in MS to characterize and quantify MS lesions. The number and volume of lesions are used for MS diagnosis, to track its progression and to evaluate treatments [2]. Conventional MRI in MS usually consists in Fluid-Attenuated Inversion Recovery (FLAIR), T2-weighted (T2-w) and T1-weighted (T1-w) images. Accurate identification of MS lesions in MR images is extremely difficult due to variability in lesion location, size and shape, in addition to anatomical variability between subjects. Since manual segmentation requires expert knowledge, it is time consuming and prone to intra- and inter-expert variability, several methods have been proposed to automatically segment lesions [1, 6]. In order to reduce false lesion detections, segmentation algorithms have to integrate complementary information from multimodal data. Although many solutions have been proposed, e.g. 3-class tissue classification and Machine Learning (ML) approaches [1], the challenge remains to provide segmentation techniques that work regardless of the type of MS lesion or MRI protocol.

MS lesion segmentation algorithms are generally prone to detection of false positives, especially voxel-wise approaches, where inference is performed directly on the voxel-wise probabilities. We propose to tackle this problem by replacing classical methods for correction for multiple testings, e.g. Bonferroni and FDR-correction, with a locally multivariate inference: the a-contrario analysis [3].

We present a novel framework for the automated segmentation of MS lesions from multimodal MRI, based on a comparison at the voxel level between a patient and a model of healthy controls with an a-contrario approach. In Sect. 2, we present the steps of the proposed framework and the evaluation metrics. Then, in Sect. 3 we illustrate the experiments, performed on a multi-site clinical dataset. Finally, we discuss the results and conclude in Sect. 4.

2 Materials and Methods

2.1 MS Lesion Detection Framework

The a-contrario Approach. The a-contrario approach is a locally multivariate procedure which uses the size of a local excursion set as statistic [3]. An a-contrario framework was previously presented to extract patterns of abnormal perfusions in individual patients [4]. Its general steps can be summarized as follows: (i) a voxel-wise probability map is computed under a background model (i.e. the null-hypothesis in statistical decision theory [5]), (ii) a locally multivariate probability is estimated, and (iii) a correction for multiple testing is performed. We propose to apply the a-contrario approach to the segmentation of MS lesions from multimodal MRI as follows.

(i) Voxel-Wise Probability Map. In [7], a general methodology for the comparison, at a voxel level, of a patient model with a group of models was presented. We adopted a similar approach to compute the input voxel-wise probability map of the a-contrario analysis. Precisely, at a given voxel, we compared an intensity vector \(V_{P}\in \mathbb {R}^{h}\), where h is the dimension and P indicates the patient, with a set of intensity vectors \(V_{j}\) from the control group, with j = 1, ..., N controls. These intensity vectors were created from the image modalities (i.e. in our workflow we used FLAIR and T2-w modalities). The group of controls is assumed to follow a multivariate Normal distribution \(\mathcal {N}(\overline{V},\varSigma _{V})\), where \(\overline{V}\) and \( \varSigma _{V} \) denote respectively the average and covariance matrix of the control group. Thus, the difference statistic between \(V_{P}\) and \(\overline{V}\) can be computed as a Mahalanobis distance \( d^{2}(V_{P}) = (V_{P} - \overline{V})^{T} \varSigma _{V}^{-1}(V_{P} - \overline{V})\). \( d^{2}(V_{P})\) varies between zero and infinity, with smaller values if the patient vector more likely belongs to \(\mathcal {N}(\overline{V},\varSigma _{V})\). The test p-value can be computed as:

$$\begin{aligned} p(V_{P}) = 1 - F_{h,N-h}(d^{2}(V_{P})) \end{aligned}$$
(1)

where \(F_{h,N-h}\) is the cumulative distribution function of a Fisher distribution with parameters h and \( N - h \). The obtained p-value map was employed as the input for the region-based probabilities estimation.

(ii) Region-Based Probabilities. The uncorrected p-value map was partitioned into regions, namely a grid of spheres of radius r centered at each voxel. A set of uncorrected p-value thresholds \( p = \left\{ p_{1},...,p_{T} \right\} \) was defined i.e. a set of decision thresholds. For a threshold \( p_{i} \), the p-value map was thresholded to produce a binary map referred to as a rare event map. For each region s, the number of rare events occurring at a level \( p_{i} \) was computed and denoted as \( k_{s} \). Hence, the probability \( \pi _{i}^{s} \) of having \( k_{s} \) or more rare events was calculated from the tails of the binomial distribution:

$$\begin{aligned} \pi _{i}^{s} = P (X \ge k_{s} ), \quad \text {with} \quad X \sim B(n,p_{i}) \end{aligned}$$
(2)

where n is the total number of voxels in the sphere s, i.e. the number of tests. The probability \( \pi _{i}^{s} \) associated to a region s was then assigned to its center voxel. Of all region-based probabilities, only the minimum probability over all p-value thresholds \( p_{i} \), \( min(\pi _{i}^{s}) \), was retained per voxel.

(iii) Correction for Multiple Testing. The probability map from step (ii) was then corrected for multiple testing. The probability map was converted to a Number of False Alarms (NFA) map, i.e. the number of false detections in the background, as:

$$\begin{aligned} \mathrm {NFA}_{s} = N_{s} T min(\pi _{i}^{s}) \end{aligned}$$
(3)

where \( {N_{s}} \) and T are the total number of regions and p-value thresholds, respectively. Last, the NFA map was thresholded so that regions with \(\mathrm {NFA}>\epsilon \) were discarded to obtain \(\epsilon -\)significant regions, where \(\epsilon \) is the detection threshold.

Post-processing. After the a-contrario analysis, the segmentation outcome may still include false positives due to e.g. registration errors, noise and artifacts. A few post-processing steps were therefore performed to reduce these false detections. A candidate lesion was discarded if one of the following conditions was verified: (i) it did not belong to an hyper-intensities mask, (ii) it was not sufficiently located in the white matter, (iii) its size was lower than \( 3\,\mathrm {mm^{3}} \). The hyper-intensities mask was computed by performing Otsu’s thresholding [8] on the product of the T2-w and FLAIR images of a subject [9]. The white matter probability map was calculated from the control subjects and then thresholded at 0.7 to obtain a mask.

2.2 Dataset and Pre-processing

MS Patients. We evaluated the proposed method on the MICCAI 2016 MS lesion segmentation challenge dataset [10]. It included 53 images of patients suffering from MS (15 training images and 38 testing images; evaluation on the testing images can be performed by submission to the evaluation platformFootnote 1). They were acquired in four different sites (Siemens 3T Verio, Siemens Aera 1.5T, Philips 3T Ingenia, GE 3D Discovery). The MR imaging protocol included 3D T1-w, T2-w and 3D FLAIR anatomical images. More details on the imaging protocol are available on the challenge website\(^{1}\). For each subject, manual delineations of MS lesions from seven trained radiologists were provided; the ground truth was computed from the seven independent manual segmentations using LOP STAPLE [11].

Group of Controls. 20 MRI datasets of healthy subjects were acquired on a Siemens 3T Verio scanner. The MR imaging protocol included: 3D T1-w (matrix size: \(256\times 256 \times 160\), resolution: \(1 \times 1 \times 1\) \( \mathrm {mm^{3}} \)); T2-w (matrix size: \(192 \times 256 \times 44\), resolution: \(1 \times 1 \times 3\) \( \mathrm {mm^{3}} \)); 3D FLAIR (matrix size: \(256 \times 256 \times 160\), resolution: \(1 \times 1 \times 1\) \( \mathrm {mm^{3}} \)).

Pre-processing. MR images were denoised [12], rigidly registered towards T1-w images [13], skull-stripped [14] and bias corrected [15]. The proposed framework relies on a voxel-wise comparison of a patient to a set of controls. Hence, it requires that patient and controls images are in the same coordinates system, i.e. corresponding voxels describe the same spatial position, and corresponding anatomical tissues show the same intensity profile. A template image was generated from the set of controls images by applying a method derived from [16], which constructs an unbiased atlas representing the average intensity and shape of a number of images. Patient images were registered to the template image using a linear registration, based on a block-matching algorithm [13], followed by a dense non-linear registration [17]. In order to reduce inter-subject variability, intensities were normalized using k-means [18].

2.3 Evaluation of MS Lesion Detection

The quality of the proposed segmentation framework was assessed using three metrics:

  1. (i)

    Dice Similarity Coefficient (DSC), i.e. the spatial overlap between the result R and the ground truth G:

    $$\begin{aligned} DSC = 2 \frac{\mid R \cap G \mid }{\mid R \mid + \mid G \mid } \end{aligned}$$
    (4)
  2. (ii)

    Positive Predictive Value (PPV), i.e. the proportion of true positive lesions \( TP_{R} \) within the segmented N lesions:

    $$\begin{aligned} PPV = 2 \frac{TP_{R}}{N} \end{aligned}$$
    (5)
  3. (iii)

    F1 score, i.e. the weighted average of the lesion sensitivity \( Se_{L} \) and the positive predictive value PPV:

    $$\begin{aligned} F1 = 2 \frac{Se_{L} PPV}{Se_{L} + PPV} \end{aligned}$$
    (6)

These two last metrics evaluated the algorithm in terms of detection of individual lesions, independently of their contour quality i.e. at the lesion level and not at the voxel level.

Comparison with False Discovery Rate Correction. Inference in voxel-wise comparison approaches is generally performed directly on the p-value map by applying a False Discovery Rate (FDR) correction for multiple comparison [7]. The widely applied Benjamini-Hochberg procedure enables controlling the expected proportion of false positives when considering all tests, e.g. it ensures that no more than a ratio \( q = 5\% \) of detections are false positives [19]. For comparison with our method, we replaced the a-contrario analysis with the FDR correction. Hence, we applied the method in [19] to the voxel-wise probability map as obtained from step (i), followed by the same post-processing steps. We evaluated the outcomes using the three metrics presented above. We explored the significance of the differences in the scores obtained by the two approaches using the Wilcoxon test (a p-value \( < 5\% \) was considered significant).

3 Results

3.1 Implementation and Computation Time

The framework was implemented in Python and employed in-house toolsFootnote 2 for the pre-processing and post-processing steps. In the a-contrario framework, the radius r of a sphere was equal to two voxels, the set of p-values was \( p =\left\{ \ 1.10^{-05}, 1.10^{-04}, 1.10^{-03}\right\} \), and \(\epsilon =1\). The computation time to process a subject on a laptop with an Intel Core i7 CPU 2.40 GHz (8 cores) was approximately 10 min.

3.2 Evaluation of MS Lesion Detection

Figure 1 shows a representative case of uncorrected p-value map from step (i) and detected MS lesions as obtained with the proposed framework. In Fig. 2, two segmentations outcomes as obtained with the two methods, i.e. the proposed method and the FDR-corrected voxel-wise probability map, are reported. From visual inspection, it appears that both methods are capable of detecting the true lesions; however, the FDR correction approach seems to be more prone to false positives than the proposed approach.

Fig. 1.
figure 1

(a) Original FLAIR image followed by (b) its uncorrected p-value map and superimposed MS lesion segmentations from (c) experts segmentation and (d) proposed framework.

Fig. 2.
figure 2

(a) Original FLAIR image followed by FLAIR image and superimposed MS lesion segmentations from: (b) experts segmentation, (c) proposed framework, (d) FDR-correction. Arrow heads show some false detected lesions: green arrows for false positive on both (c) and (d), red arrows in (d) only. (Color figure online)

Table 1. Average scores per metric and p-value of the Wilcoxon test on corresponding sets of scores.

For each patient and for both the methods, we computed the three evaluation metrics. The average scores are reported in Table 1, together with the outcomes of the Wilcoxon test. In Fig. 3, the three scores for the proposed method are reported for increasing Total Lesion Load (TLL). Figure 4 shows the differences in scores per patient between the proposed framework and the FDR-correction approach for increasing TLL, where positive difference values indicate that the first outperforms the latter. The Wilcoxon test indicates that the scores are significantly different.

Fig. 3.
figure 3

Metrics as obtained with the proposed method (PM) for increasing Total Lesion Load (TLL) per patient. From the left: DSC, PPV, and F1 score. TLL varied from about 0.5 \(\mathrm {cm^{3}}\) to 70 \(\mathrm {cm^{3}}\). A log regression model is fitted to the data and a 95% confidence interval for that regression is shown.

Fig. 4.
figure 4

The differences in scores as obtained with the two approaches for increasing Total Lesion Load. From the left: DSC, PPV, and F1 score. A linear regression model is fitted to the data and a 95% confidence interval for that regression is shown.

Generally, the proposed method outperforms the classical approach. This is particularly evident for low lesion loads, whereas the two performances tend to converge for high lesion loads. The highest improvements of the proposed method over the FDR correction approach were 36% in DSC (TTL \(\mathrm {\approx 3\,cm^{3}}\)), 73% in PPV (TTL \(\mathrm {\approx 3\,cm^{3}}\)), and 31% in F1 score (TTL \(\mathrm {\approx 8\,cm^{3}}\)). The average improvements were about 10% in DSC, 20% in PPV, and 10% in F1 score. Overall, we observed that all the scores tend to decrease with the total lesion load. This can be partially explained by the disagreement among the experts, which increases and hence becomes more relevant for a lower lesion load.

4 Conclusion

In this paper, we have proposed an automatic and unsupervised framework for the segmentation of MS lesions from multimodal MRI. It computes a voxel-wise probability map by comparing a patient with a group of controls, and it estimates locally multivariate probabilities using an a-contrario approach. Experiments have shown that the method outperforms the classical FDR-correction approach. Improvements increase with decreasing total lesion load, indicating that the proposed method is more specific and sensitive for patients with low lesion loads. The performance of the method relies on parameters, i.e. size of a region and set of thresholds, that must be accurately tuned on a set of cases.

Evaluation was performed on the MICCAI 2016 MS lesions segmentation challenge dataset, comprising clinical images acquired with different MR scanners and acquisition protocols [10]. This is an important aspect when developing techniques that are meant to be employed in the clinical practice. Compared to the results from the challenge results board (see footnote 1), the accuracy of the proposed framework was similar to that of the top rank strategies. Compared to other multivariate approaches, such as Machine Learning techniques, it has the clear advantage of being simple and not computationally intensive. These are important benefits, as the primary objective of the proposed framework is to assist radiologists in the clinical practice.