Introduction

Clinical progression and disability accumulation are highly heterogeneous in all phenotypes of multiple sclerosis (MS) [1]. The variability of the disease course is guided by inflammation, axonal degeneration and remyelination phenomena [2], but how these phenomena influence the course of the disease is not yet clear. Several disease modifying treatments are now available to improve long-term prognosis of patients with MS [3]. However, choosing the right treatment is still a challenge for clinicians who need to balance the drug safety profile with the risk of disability progression on individual basis.

Neuroimaging techniques are powerful tools to investigate MS [4] and Magnetic Resonance Imaging (MRI) findings are considered good predictors of conversion from clinically isolated syndrome (CIS) to clinically definite MS [5], as well as of long-term disability [6, 7]. Several MRI studies have highlighted various aspects of tissue damage in MS [8] demonstrating a prognostic role of T2-hyperintense lesions, global and cortical atrophy [9], as well as that of damage to some key structures, such as thalamus [10] and cerebellum [11, 12]. Among these, lesion burden appeared to be a relevant predictor of long-term cognitive outcome [13] and disease progression [14]. Further factors anticipating disability progression in MS have been shown to be structural and microstructural damage in the cerebellum [15], thalamus [16] and normal appearing white matter (NAWM) [17].

Recently, machine learning (ML) techniques have been applied to analyze clinical and radiological data in MS. Indeed, ML classifications represent a valuable tool for predicting conversion from CIS to MS [18] and to reliably distinguish patients with MS from healthy subjects [19]. Classifiers were also able to identify brain regions, and both functional and structural connections relevant to better understand the disease [19].

Till now, ML techniques have been applied mainly for diagnostic purpose and the utility of those tools for the prediction of disability progression has not been explored yet. This point is crucial in MS management, considering that identifying patients with higher risk of disability progression might promptly recognize subjects who may benefit from a more aggressive therapeutic approach.

We hypothesized that a data-driven approach on clinical and MRI data may predict disability progression in single subjects with MS. We tested this hypothesis in this pilot study, applying ML classifiers built on clinical data and neuroradiological features. Moreover, we investigated among clinical and neuroradiological features what ML classifiers are able to identify the most important factors in predicting disability progression in MS.

Materials and methods

A flowchart showing the methods of this work is proposed in Fig. 1. A detailed step-by-step description of all the procedures follows.

Fig. 1
figure 1

Flowchart of the methods. a MRI preprocessing. From 3D-T1-weighted images cerebellar volume was calculated via SUIT, thalamic volume using FSL’s FIRST and gray matter and white matter volumes via FSL’s SIENAX. On the T2-weighted images lesions were identified and segmented using JIM, to calculate the lesion load. By combining lesion and white matter masks, we calculated the normal appearing white matter mask for each subject. From diffusion-weighted images both fractional anisotropy and mean diffusivity maps were calculated, and results and were combined with the normal appearing mask previously obtained to extract microstructural metrics of the normal appearing white matter. b Feature selection. Clinical and neuroradiological, were selected together with binary classes (stable patients = 0, patients with disease progression = 1) and a random feature and used to describe the sample of patients with multiple sclerosis. c Machine learning classifier. After having checked features for co-linearity, a random forest classifier was applied 1000 times feature built on both clinical and radiological features, clinical features alone, radiological features alone. Out-of-Bag test was used to avoid overfitting and performances were evaluated via the confusion matrix of the surviving classifiers. DWI diffusion-weighted images, GM gray matter, WM white matter, FA fractional anisotropy, MD mean diffusivity

Sample selection

Study protocols were approved by the ethical committee of Policlinico Umberto I/Sapienza University (Rome, Italy, Site 1) and Ethics Committee for Biomedical Activities “Carlo Romano” of Federico II University (Naples, Italy, Site 2) as appropriate. All subjects provided written informed consent.

We analyzed clinical and MRI data of subjects with MS collected by two centers: the Human Neuroscience Department of Sapienza University (Site 1) and the MS center of the Federico II University (Site 2).

From both sites, clinical evaluations were performed between 2010 and 2018, while all MRI acquisitions were performed between 2010 and 2015. MS patients were included according to the following selection criteria: diagnosis of MS according to the Mc Donald’s criteria [20, 21]; age between 18 and 70 years; clinical assessment and MRI examination not more than one month apart; clinical follow-up available after a minimum of 2 years from the MRI examination. On both visits, patients were examined by expert neurologists assessing the clinical status via the Expanded Disability Status Scale (EDSS). At the follow-up examination, disability progression was defined as 1.5-point increase for patients with a baseline EDSS of 0, 1 point for scores from 1.0 to 5.0, and 0.5 points for scores equal or higher than 5.5 [22].

Finally, at both sites patients underwent a MRI scan with a 3 T system (Verio, Siemens equipped with a 12-channelhead coil at Site 1; Trio, Siemens, equipped with a 8-channel head coil at Site 2), including tridimensional (3D) T1-weighted (-w); T2-w, either dual echo or FLAIR; diffusion-w images. Detailed information on acquisition parameters are reported in Supplementary Materials.

MRI data analysis

T2-weighted lesion load (T2LL) was calculated independently by three expert operators, with five to eleven years of experience, who identified the hyperintense WM lesions, using a semi-automatic technique (Jim, Xinapse System, Leicester, UK; http://www.xinapse.com). In the two sites, a lesion mask common to all patients was built including brain areas presenting lesions in at least the 10% of the sample [23].

Gray (GM) and white (WM) matter volumes were calculated via SIENAX, implemented in FSL environment (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/), while thalamic volumes (calculated as the sum of right and left) were obtained via FSL FIRST. Cerebellar lobules, summed to obtain total cerebellar volume, were identified and calculated using the Spatially Unbiased Infratentorial Toolbox (SUIT, http://www.diedrichsenlab.org/imaging/suit.htm), implemented in SPM12. Lastly, GM, WM, thalamic and cerebellar volumes were normalized to the scaling factor (obtained in SIENAX) to account for head size.

Diffusion maps were generated using the DTI_fit model, part of the FSL Diffusion Toolbox (http://www.fmrib.ox.ac.uk/fsl/fdt). Combining the results of FA, or MD, for each patient with WM mask and the image of the common lesions, we obtained individual maps of the FA, or MD, for NAWM, that were further averaged to obtain a single value for FA-NAWM and MD-NAWM [23].

Statistical analyses

Statistical analyses were performed with Matlab R2018b (https://it.mathworks.com/). Group comparison between clinical and radiological features of the two sites was performed using a χ-square to test sex, MS phenotype and disability progression, while a Mann–Whitney non parametric test was used for the remaining variables (age, disease duration, EDSS at baseline, time between visits, T2LL, GM, WM, thalamic and cerebellar volumes, as well as FA- and MD-NAWM values).

To avoid data redundancy and reduce variance in classifier performances, we investigated features’ co-linearity of clinical (age, sex, phenotype, disease duration, EDSS at baseline, time between visits and therapy at baseline) and neuroradiological (T2LL, GM, WM, thalamic and cerebellar volumes, FA- and MD-NAWM) features using a partial correlation analysis. Specifically, we performed partial correlation between each pair of features controlling for all the remaining features. For all analyses, a significance threshold was set for p < 0.01 without multiple comparison correction that allowed us to be conservative in the removal of confounding effects.

ML classifiers

ML classifiers were performed and analyzed with Python, via the browser version of Jupyter Notebook application (https://jupyter.org/). We used a supervised ML technique and identified two binary classes: patients who were stable (negative output, 0) and patients who worsened in disability (positive output, 1) at the follow-up visit. We built a matrix whose rows represented patients and columns represented features. For each patient, features included clinical and neuroradiological data, MR acquisition site and class. A last column consisting of a randomly generated variable was added to the matrix. We applied a random forest algorithm for 1000 times and evaluated the confusion matrices of resulting classifiers.

In each of the 1000 classifiers, we used features calculated on samples from both sites to evaluate classifier performances. Patients from both sites were randomly selected to obtain the same numerosity in both classes (50% stable and 50% with disease progression patients) to avoid bias due to class numerosity. To train the ML classifier we randomly selected 80% of the sample with the same numerosity of stable and with disease progression patients and left the other 20% for validation. To avoid overfitting, we calculated each classifier accuracy both on the validation set and in a subsample of the training set and considered in the following analysis only classifiers showing differences in accuracies smaller than 0.02, i.e. Out-of-Bag test. The 0.02 threshold was selected being reasonably low to allow considering the results of the Out-of-Bag test and the training to be consistent, since there is no typical value reported in the literature, at the best of our knowledge.

Classifier performances were evaluated via accuracy, area under the true positive versus false positive rates curve (AUC), sensitivity (true positive rate) and specificity (true negative rate). Within each classifier including both clinical and neuroradiological features, permutation importance was calculated for each feature and for the random variable: only parameters whose importance was higher than the random variable’s importance, were considered actually relevant in the classifier.

Results

Sample selection

Average and standard deviation of clinical and neuroradiological features for the entire sample built from the two sites, as well as for both Site 1 and Site 2 separately, are reported in Table 1.

Table 1 Clinical and neuroradiological features

In the entire sample, participants were 39.66 ± 10.23 years old, range [19.50–70.30], with 104 females and 59 males. Of these 163 subjects, at baseline 122 patients had a relapsing–remitting form of MS, while the remaining 41 patients had a progressive form. Patients showed a disease duration of 9.90 ± 8.06 years, range [0.00–37.00], with a time between visits that was 3.93 ± 0.95 years, range [2,3,4,5,6]. At the follow-up examination, disease progression was observed in 58 over 163 patients (35.6% of the sample), whose EDSS distributions at baseline and at follow-up were respectively 3.5 [0.0–7.0] and 4.5 [1.5–7.0], while in the remaining 105 patients the EDSS remained stable at 3.0 [0.0–7.5].

Since patients who remained stable at follow-up were about 2/3 of both samples, and we aimed at balancing size of classes, at each of the 1000 performed models we picked all the patients who experienced disability progression from both sites’ sample and also randomly picked an equal number of patients who remained stable, reaching a final number of 72 participants from Site 1 and 44 participants from Site 2.

Feature selection

Due to differences in demographic and clinical characteristics of participants between the two sites, we performed correlation analysis in the two samples separately. We found a number of features, which were inter-correlated, in both Site 1 and Site 2 samples (Tables 2, 3). After removing inter-correlated features (disease duration, MD-NAWM values, GM and WM volumes), ML classifiers included age, sex, disease phenotype, EDSS score and therapy at baseline, time between visits as clinical features, while T2LL, thalamic and cerebellar volumes and FA-NAWM were included as neuroradiological features.

Table 2 Correlation matrix, Site 1 sample
Table 3 Correlation matrix, Site 2 sample

ML classifiers

Out of 1000 classifiers built on both clinical and neuroradiological features, 162 classifiers had a difference between the accuracy calculated in the validation set and in the subsample of the training set smaller than 0.02. Accuracy, AUC, sensitivity and specificity on these 162 classifiers are displayed in Fig. 2. The classifier with the highest resulting metrics proved to have an accuracy = 0.79, an AUC = 0.81, with a sensitivity and a specificity of 0.90 and 0.71, respectively (Table 4).

Fig. 2
figure 2

Metrics of classifier built on clinical and radiological features. Histogram of accuracy (ACC) and area under the true positive versus true negative rate curve (AUC), sensitivity and specificity obtained from the 150 classifiers, surviving the Out-of-Bag test, performed on the sample features. On the y-axis number of classifiers is displayed

Table 4 Metrics

Then we tested separately classifiers built on neuroradiological features alone (T2LL, FA-NAWM, thalamic and cerebellar volume) and clinical features alone (age, sex, disease phenotype, EDSS score and therapy at baseline, time between visits). For the latter separated analyses, we followed the same procedure implemented to analyze neuroradiological and clinical features together, i.e. balancing the numerosity of class 0 and class 1, adding a random feature and validating for overfitting via the Out-of-Bag test. We found that classifiers built on neuroradiological features were more accurate and sensitive than those built on clinical features or on mixed clinical-neuroradiological features, showing better accuracy, AUC, sensitivity and specificity values (Table 4; Fig. 3).

Fig. 3
figure 3

Metrics of classifier built on radiological/clinical features. Histograms of accuracy (ACC) and area under the true positive versus true negative rate curve (AUC), sensitivity and specificity, surviving the Out-of-Bag test obtained from the 309 classifiers built on neuroradiological features (top) and the 128 classifiers built on clinical features (bottom). On the y-axis number of classifiers is displayed

To have a better depiction of features’ role in predicting disability progression, we further investigated the importance of each feature within each classifier built on both clinical and radiological features. T2LL was recognized as important feature in predicting disability progression in all the performed classifiers. EDSS and therapy at baseline were important in the 55% and 72% of the classifiers. Thalamic volume and FA-NAWM were important in 39% and 29% of the classifiers respectively. All the other radiological and clinical features were found to be more important than the random feature in a negligible number of classifiers (less than 10%, see Fig. 4).

Fig. 4
figure 4

Important features frequency. Figure showing, for each feature, the percentage of time (i.e. classifiers) more important than the random feature in predicting disability progression in our MS group. EDSSbase expanded disability status scale at baseline, TimeFU time between first visit and follow-up, CrbV cerebellar gray matter volume, GMV gray matter volume, FA-NAWM fractional anisotropy of normal appearing white matter

Discussion

We showed that MRI measures alone may predict disability progression with the highest accuracy, even in respect to the combination of neuroradiological and clinical features. We are prone to explain this last result as outsourcing from the risk of overfitting. Indeed, the risk of overfitting was higher in the classifiers including both clinical and neuroradiological features due to the numbers of features and subjects and allowed fewer classifiers to survive in respect to the original amount.

Even if we expect that combining both clinical and MRI data may predict clinical outcome at best [24], the role of neuroradiological features seems to be of primary relevance suggesting a more informative independent charge in respect to clinical features alone.

To strengthen the generalizability of the models obtained by the ML classifiers [25], we included in the study features characterizing subjects recruited in two sites, visited by four different clinicians and whose MRI were acquired with two different MR systems. As well, the observed subject samples covered wide ranges of age and disease severity. Therefore, our results may be generalizable and not site-specific, despite recruitment and acquisition protocols from the two sites may have produced differences in feature estimation. Further, we approached our research question with a supervised method, i.e. implementing classification algorithms, and it prevented us to perform less sophisticated methods like principal component analysis.

As Lew et al. [26] showed, ML algorithms may predict 2-year disability worsening in subjects with progressive disease. Our results extend the previous data, confirming the utility of a ML approach also for subjects in an earlier stage of the disease or with a lower level of disability. This is particularly relevant in this population of MS patients when clinicians have the opportunity to choose among several pharmacological treatments with different mechanism of actions and efficacy outcomes. Indeed, to identify disease features to drive neurologist in their choices is still a matter of debate across literature, given the absence of definitive shared prognostic factors of treatment choose and response [27].

Although we reached 79% of diagnostic accuracy, our results showed that our model was unable to provide the excellence [28]. Different reasons could explain this finding, ranging from a relatively small sample size to the not-linear relation between structural damage and disability, as it is quantified by the EDSS [29]. Nevertheless, it has to be considered that a sample size of 116 subjects may be enough to model disability progression from clinical and radiological features, as previously found [23, 30, 31]. Furthermore, our analysis included only parameters of structural brain damage, while the association between clinical findings and radiological extent of involvement is generally poor and known, given the clinico-radiological paradox in MS [32].

With regards to MRI data, our results indirectly confirm the importance of T2LL in predicting disease worsening, a variable known to influence both the disease course [24] and its progression [14, 33]. In particular, following the first episode of the disease, T2LL is helpful to identify those patients at risk of developing an aggressive form of MS [34]. This underlines how our result can help physicians in selecting those patients that can benefit from an early and more efficacy therapy. A recent study by Bakshi and colleagues found that baseline brain parenchymal fraction, but not T2LL, may be a good predictor of 5-year disability worsening [35]. A possible explanation to this discrepancy could be researched in the different clinical endpoint. In particular, authors considered longitudinal difference in the EDSS score as a disability marker, while we considered disability progression as defined in Rìo et al. [22]. Further, our results confirm the prognostic role of thalamus, whose measurement is reliable and comparable among centers, in MS [36]. Indeed, thalamic involvement in MS has been widely investigated using different approaches, with thalamic atrophy and function associated with poorer motor and cognitive performances [37, 38]. A recent and extensive work in 1417 subjects with MS identified key regions with early atrophy such thalamus consistently across MS phenotypes, exploring the sequence of GM atrophy and revealing the potentiality of therapeutic strategy adjustment based on the staging of patients [10, 39]. Finally, FA-NAWM values also proved to be a significant contributor to clinical progression in MS, in line with the established knowledge about the association between axonal loss and clinical deficit in MS [40], as well as that aberrant diffusivity measures in the NAWM may provide information in the relapsing remitting form of MS [41].

Lastly, in our analysis, both disability and therapy at baseline emerged as clinical features relevant to disability progression. In particular, therapy response is individual and depends on many factors, including age, phenotype and disability at treatment start [42], and is known to influence both disease course [43] and disability accumulation [44].

Limitations

As mentioned before, a limitation of our study relies in the relatively small sample size, which might have affected the performance of ML classifiers in the prediction of MS disability, limiting their diagnostic accuracy. The small sample size also prevented us from using an independent test set to report final results [25]. Further, we used EDSS as index of disability. Even if depending mainly on walking ability, EDSS is widely used as disability measure [45]. More in general, the major limitation of this study has to be researched in its retrospective nature. For this reason, future prospective studies, properly designed to evaluate with additional techniques (e.g. fMRI, MTR, qMRI) other aspects of brain involvement, as well as other CNS structures (e.g. spinal cord) are strongly warranted, to verify our model and possibly increase its diagnostic accuracy.

Conclusion

ML classifiers built on clinical and neuroradiological features may predict disability progression in subjects with MS at individual level with accuracy almost reaching a value of 80%. Among the MRI variables, T2LL and thalamic volume were the most important features in the prediction of disability progression, while disability and therapy at baseline were the most relevant features among the clinical variables. The implementation of classifiers based on neuroradiological and clinical features may aid clinicians to predict if a single subject may be prone to disability progression and consequently to tailor the treatment.