Introduction

Multiple sclerosis (MS) is an immune-mediated disease of the central nervous system with several subtypes. The most common subtype is relapsing-remitting multiple sclerosis (RRMS) [1]. Patients with RRMS present with acute or subacute symptoms (relapses) followed by periods of complete or incomplete recovery (remissions) [2]. Effective treatment of patients with RRMS can prevent disease progression and associated severe consequences, like spasticity, fatigue, cognitive dysfunction, depression, bladder dysfunction, bowel dysfunction, sexual dysfunction, pain, and death [3].

Relapses have been commonly used as a primary efficacy endpoint in phase III randomized clinical trials leading to market approval of RRMS therapies, although the strength of the association between relapses and disease progression (an outcome of highest interest to patients) is still debated [4, 5, 6]. Prognosis for relapses in individuals with RRMS could support individualized decisions and disease management. A prognostic model for relapses may also be helpful for the efficient selection of patients in future randomized clinical trials and, therefore, for the reduction of type II errors in these trials [7]. In addition, such a model could support individualized decisions on initiation or switch of disease-modifying treatment (DMT). To our knowledge, no widely accepted prognostic model for MS has been used in clinical practice yet.

A recent systematic review of prediction models in RRMS [8] identified only three prognostic models (i.e. models that focus on predicting the outcome instead of predicting treatment response) with relapses as the outcome of interest [7, 9, 10]. However, all three studies had methodological shortcomings. Only one small study, with 127 patients, used a cohort of patients that is considered the best source of prognostic information [8, 10]. All three studies used complete cases, excluding cases with missing data, analysis without justifying the assumptions underlying this approach; given the potential non-random distribution of missing data, the results might be biased [11]. In addition, none of them validated internally their model and they did not present calibration or discrimination measures. Hence, they might be at risk of misspecification [12]. In addition, none of them used shrinkage to avoid overfitted models [13]. Finally, none of the studies evaluated the clinical benefit of the model, an essential step, which quantifies whether and to what extent a prognostic model is potentially useful in decision-making and clinical practice. Similar limitations exist in other published prognostic models, which commonly have serious deficiencies in the statistical methods, are based on small datasets and have inappropriate handling of missing data and lack validation [14].

In this research work, we aim to fill the gap of prognostic models on relapses for RRMS patients. We present the development, the internal validation, and the evaluation of the clinical benefit of a prognostic model for relapses for individuals with RRMS using real-world data from the Swiss Multiple Sclerosis Cohort (SMSC) [15]. The cohort is comprised of patients diagnosed with RRMS who are followed bi-annually or annually in major MS centres with full standardized neurological examinations, MRIs and laboratory investigations [15]. Our prognostic model is designed for a patient who, within the Swiss health care system and standard MS treatment protocols, would like to estimate their probability of having at least one relapse within the next 2 years.

Data and methods

In “Data description”, we describe the data available for the model development. We followed seven steps (described in detail in “Steps in building the prognostic model”) to build and evaluate the prognostic model: (1) selection of prognostic factors via a review of the literature, (2) development of a generalized linear mixed-effects model in a Bayesian framework, (3) examination of sample size efficiency, (4) shrinkage of the coefficients, (5) dealing with missing data using multiple imputations, (6) internal validation of the model. Finally, we evaluated the potential clinical benefit of the developed prognostic model. For the development and the validation of our prognostic model we followed the TRIPOD statement [16]; the TRIPOD checklist is presented in Appendix Table 3.

Data description

We analysed observational data on patients diagnosed with relapsing-remitting multiple sclerosis (RRMS) provided by the Swiss Multiple Sclerosis Cohort (SMSC)) study [15], which has been recruiting patients since June 2012. SMSC is a prospective multicentre cohort study performed across seven Swiss centres. Every patient included in the cohort is followed up every 6 or 12 months, and the occurrence of relapses, disability progression, DMTs initiation or interruption, adverse events, and concomitant medications are recorded at each visit. Brain MRI and serum samples are also collected at each visit. The strength of SMSC is the high quality of data collected including MRI scans and body fluid samples in a large group of patients. In addition, several internal controls and validation procedures are performed to ensure the quality of the data.

We included patients with at least 2 years of follow-up. The drop-out rate in the entire SMSC cohort was 15.8%. Drop-out was primarily associated with change of address and health care provided by a physician not associated with SMSC. Therefore, we assume that patients dropping out of the cohort before completing 2 years were not more likely to have relapsed than those remaining in the cohort, and hence the risk of attrition bias is low. The dataset includes 935 patients, and each patient has one, two, or three 2-year follow-up cycles. At the end of each 2-year cycle, we measured relapse occurrence as a dichotomous outcome. At the beginning of each cycle, several patient characteristics are measured and we considered them as baseline characteristics for this specific cycle. In total, we included 1752 cycles from the 935 study participants. Patients could be prescribed several potential DMTs during their follow-up period, i.e. a patient during a 2-year follow-up cycle could either take no DMT or one of the available DMTs. We used the treatment status only at baseline of each 2-year cycle to define the dichotomous prognostic factor “currently on treatment” or not.

We transformed some of the continuous variables to better approximate normal distributions and merged categories with very low frequencies in categorical variables. Table 1 presents summary statistics of some important baseline characteristics using all cycles (n = 1752), while in Appendix Table 4, we present the outcome of interest (frequency of relapse within 2 years), as well as several baseline characteristics separately for patients that were included in 1 cycle, patients that were included in 2 cycles, and patients that were included in 3 cycles.

Table 1 Summary statistics of some important baseline characteristics using all 1752 2-year cycles coming from 935 unique patients in SMSC

Notation

Let Yij denote the dichotomous outcome for individual i where i=1, 2, …, n at the jth 2-year follow-up cycle out of ci cycles. PFijk is the kth prognostic factor k=1,…,np. An individual develops the outcome (Yij = 1) or not (Yij = 0) according to its probability pij.

Steps in building the prognostic model

Step 1—Selection of prognostic factors

Developing a model using a set of predictors informed by prior knowledge (either in the form of expert opinion or previously identified variables in other prognostic studies) has conceptual and computational advantages [17, 18, 19]. Hence, in addition to the information obtained from the three prognostic models included in the recent systematic review discussed in introduction 7, 9, 10, we aimed to increase our relevant information, via searching for prediction models or research works aiming to identify subgroups of patients in RMMS. We searched in PubMed (https://pubmed.ncbi.nlm.nih.gov), using the string ((((predict*[Title/Abstract] OR prognos*[Title/Abstract])) AND Relapsing Remitting Multiple Sclerosis[Title/Abstract]) AND relaps*[Title/Abstract]) AND model[Title/Abstract]. We then decided to build a model with all prognostic factors included in at least two of the previously published models.

Step 2—Logistic mixed-effects model

We developed a logistic mixed-effects model in a Bayesian framework:

Model 1

$$ \left.{Y}_{ij}\sim Bernoulli\Big({p}_{ij}\right) $$
$$ logit\left({p}_{ij}\right)={\beta}_0+{u}_{oi}+\sum \limits_{k=1}^{np}\left({\beta}_k+{u}_{ki}\right)\times {PF}_{i,k,j} $$

We used fixed-effect intercept (β0), fixed-effect slopes (βk), individual-level random effects intercept (uoi), and individual-level random effects slopes (uki) to account for information about the same patient from different cycles.

We define\( \kern0.5em \boldsymbol{u}=\left(\begin{array}{c}{\boldsymbol{u}}_{\boldsymbol{o}}\\ {}{\boldsymbol{u}}_{\boldsymbol{k}}\end{array}\right) \) to be the (np + 1) × n matrix of all random parameters and we assume it is normally distributed u~N(0, Du) with mean zero and a (np + 1) × (np + 1) variance-covariance matrix

$$ {D}_u=\left[\begin{array}{ccc}{\sigma}^2& \rho \times {\sigma}^2\kern0.5em \dots \kern0.5em \rho \times {\sigma}^2& \rho \times {\sigma}^2\\ {}\begin{array}{c}\rho \times {\sigma}^2\\ {}\vdots \\ {}\rho \times {\sigma}^2\end{array}& \ddots & \begin{array}{c}\rho \times {\sigma}^2\\ {}\vdots \\ {}\rho \times {\sigma}^2\end{array}\\ {}\rho \times {\sigma}^2& \begin{array}{ccc}\rho \times {\sigma}^2& \dots & \rho \times {\sigma}^2\end{array}& {\sigma}^2\end{array}\right] $$

This structure assumes that the variances of the impact of the variables on multiple observations for the same individual are equal (σ2) and that the covariances between the effects of the variables are equal too (ρ × σ2).

Step 3—Examination of sample size efficiency

We examined if the available sample size was enough for the development of a prognostic model [16]. We calculated the events per variable (EPV) accounting for both fixed-effects and random-effects and for categorical variables [20]. We also used the method by Riley et al. to calculate the efficient sample size for the development of a logistic regression model, using the R package pmsampsize [21]. We set Nagelkerke’s R2 = 0.15 (Cox-Snell’s adjusted R2 = 0.09) and the desired shrinkage equal to 0.9 as recommended [21].

Step 4—Shrinkage of the coefficients

The estimated effects of the covariates need some form of penalization to avoid extreme predictions [13, 22]. In a Bayesian setting, recommended shrinkage methods use a prior on the regression coefficients [23]. For logistic regression, a Laplace prior distribution for the regression coefficients is recommended [24] (i.e. double exponential, also called Bayesian LASSO)

$$ \pi \left({\beta}_1,{\beta}_2,\dots, {\beta}_{np}\right)={\prod}_{k=1}^{np}\frac{\lambda }{2}{e}^{-\lambda \mid {\beta}_k\mid }, $$

where λ is the shrinkage parameter. A Laplace prior allows small coefficients to shrink towards 0 faster, while it applies smaller shrinkage to large coefficients [25].

Step 5—Multiple imputations for missing data

In the case of missing values in the covariates, we assumed that these are missing at random (MAR), meaning that, given the observed data, the occurrence of missing values is independent of the actual missing values. Appropriate multiple imputation models should provide valid and efficient estimates if data are MAR. As our substantive model is hierarchical, we used Multilevel Joint Modelling Multiple imputations using the mitml R package [26].

First, we checked for variables not included in the substantive model that could predict the missing values (i.e. auxiliary variables). Then, we built the imputation model, using both fixed-effect and individual-level random effects intercept and slopes as in our substantive (Model 1), where the dependent variables are the variables that include missing values for imputation, and the independent variables are all complete variables included in the substantive model and the identified auxiliary variables.

We generated 10 imputed datasets, using the jomoImpute R function, and we applied the Bayesian model (Model 1) to each of the imputed datasets. We checked convergence of the imputations using the plot R function in the mitml R package. Finally, we obtained the pooled estimates for the regression coefficients, \( \hat{\beta_0} \) and \( \hat{\beta_k} \), using Rubin’s rules [27] (testEstimates R function) with two matrices containing the mean and the variances estimates, respectively, from each imputed dataset as arguments.

Step 6—Internal validation

First, we assessed the calibration ability of the developed model, via a calibration plot with loess smoother, for the agreement between the estimated probabilities of the outcome and the observed outcome’s proportion (val.prob.ci.2 R function). We used bootstrap internal validation to correct for optimism in the calibration slope and in discrimination, measured via the AUC [13]. For each one of the 10 imputed datasets, we created 500 bootstrap samples and in each one of them: (1) we constructed a generalized linear model with the pre-specified predictors, using the glm R function, denoted as Model*, (2) we calculated the bootstrap performance as the apparent performance of Model* on the sample for each one of the bootstrap samples, (3) we applied the Model* to the corresponding imputed dataset to determine the test performance, (4) we calculated the optimism as the difference between bootstrap performance and test performance. Then, we calculated the average optimism between the 500 bootstrap samples and used Rubin’s rules to summarize the optimism for the AUC and the calibration slope between the 10 imputed datasets. We calculated the optimism-corrected AUC and calibration slope of our prognostic model, by subtracting the optimism estimate from the apparent performance.

Ideally, we should construct the Bayesian logistic mixed-effects model exactly as we developed the original model. However, this would need 15000 h to run, as the Bayesian model needs to run for 500 bootstrap samples in each one of the 10 imputed datasets (i.e. 5000 times) and the Bayesian model itself needs 3 h, and hence, the bootstrap internal validation we performed results to a rough optimism estimation ignoring the dependence between the same individual.

We used self-programming R routines to validate the model via bootstrapping.

Clinical benefit of the developed model

Decision curve analysis is a widely used method to evaluate the clinical consequences of a prognostic model. This method aims to overcome some weaknesses of the traditional measures (i.e. discrimination and calibration) that are not informative about the clinical value of the prognostic model [28]. Briefly, decision curve analysis calculates a clinical “net benefit” for a prognostic model and compares it in with the default strategies of treat all or treat none of the patients. Net benefit (NB) is calculated across a range of threshold probabilities, defined as the minimum probability of the outcome for which a decision will be made.

More detailed, information about their risk of relapsing within the next 2 years might be important to help patients to re-consider whether their current treatment and approach should continue to follow the established standards of care in Switzerland. If the probability of relapsing is considered too high, maybe RRMS patients would be interested in taking a more radical stance towards the management of their condition: discuss with their treating doctors about more active disease-modifying drugs (which might also have a high risk of serious adverse events), explore the possibility of stem cell transplantation etc. Let us call this the “more active approach”. If the probability of relapsing is higher than a threshold α% then a patient will take a “more active approach” to the management of their condition; otherwise, they will continue “as per standard care”.

We examined the net benefit of our final model, via the estimated probabilities provided, by using decision curve analysis and plotting the NB of the developed prognostic model, using the dca R function, in a range of threshold probabilities α% that is equal to

NBdecision based on the model= \( \left( True\ positive\%\right)-\left( False\ positive\%\right)\times \frac{a\%}{1-a\%} \) .

We compare the results with those from two default strategies: recommend “as per standard care for all” and continue “more active approach for all”. The NB of “as per standard care for all” is equal to zero in the whole range of the threshold probabilities, as there are no false positives and false negatives. “More active approach for all” does not imply that the threshold probability a% has been set to 0 and is calculated for the whole range of threshold probabilities using the formula:

\( {NB}_{more\ active\ approach\ for\ all}=(prevalence)-\left(1- prevalence\right)\times \frac{a\%}{1-a\%} \)

These two strategies mean the more active treatment options will be discussed and considered by all patients (“more active approach for all”) or with none (“as per standard care for all”). A decision based on a prognostic model is only clinically useful at threshold a% if it has a higher NB than both “more active approach for all” and (“as per standard care for all”). If a prognostic model has a lower NB than any default strategy, the model is considered clinically harmful, as one of the default strategies leads to better decisions [28, 29, 30, 31, 32].

We made the analysis code available in a GitHub library: https://github.com/htx-r/Reproduce-results-from-papers/tree/master/PrognosticModelRRMS

Results

For the model development, we used 1752 observations coming from 2-year repeated cycles of 935 patients who experienced 302 relapses.

First, we took into account the three prognostic models included in the recent systematic review [7, 9, 10] that predict relapse (not the treatment response to relapses) in patients with RRMS. Our search in PubMed identified 87 research articles. After reading the abstracts, we ended up with seven models that predicted either relapses or treatment response to relapses. Three of them were already included in the recent systematic review, as they predicted relapses. Hence, we identified three additional models that predict the treatment response to relapses [33, 34, 35], and one research work aiming to identify subgroups of RRMS patients who are more responsive to treatments [36].

Figure 1 shows which prognostic factors were selected and which pre-existing prognostic models were included [7, 9, 33, 34, 35, 36]. We included none of the prognostic factors included in Liguori et. al.’s [10] model, as none of the prognostic factors they used (i.e. MRI predictors) were included in any other of the available models. We briefly summarize these models in Section 1 of the Appendix file, and some important characteristics of these models are shown in Appendix Table 5.

Fig. 1
figure 1

Venn diagram of the prognostic factors included at least two times in pre-existing models and included in our prognostic model. The names with an asterisk refer to the first author of each prognostic model or prognostic factor research [ 7, 9, 10, 33, 34, 35, 36]. . EDSS, Expanded Disability Status Scale; Gd, gadolinium

The prognostic factors included in our model are presented in Table 2 with their pooled estimated \( \hat{\beta_k} \), ORs and their corresponding 95% credible intervals (CrIs). We have also developed a web application where the personalized probabilities to relapse within 2 years are calculated automatically. This is available for use in a R Shiny app https://cinema.ispm.unibe.ch/shinies/rrms/. In this example the variance σ2 is estimated 0.0001 and the covariance ρ × σ2 are equal to 0.00005. Hence, the random intercept and all random slopes were estimated close to 0. For convenience and speed of estimation, predictions were made using only the fixed effects estimates. In the Supplementary file, Appendix Table 6, we present the estimated coefficients in each of the ten imputed datasets.

Table 2 Pooled estimates of the regression coefficients \( \hat{\beta_k,} \) ORs and the 95% CrIs for each one of the parameters in the model (centralized to the mean), using Rubin’s rules. The estimated σ (standard deviation of the impact of the variables on multiple observations for the same individuals) is 0.01. The estimated correlation ρ between the effects of the variables is 0.49. The pooled optimism-corrected AUC is 0.65 and the pooled optimism-corrected calibration slope is 0.91. Disease duration was transformed to log(disease duration+10), and months since last relapse was transformed to log(months since last relapse+10)

The full model’s degrees of freedom were 22 (for 10 predictors with random intercept and slope) and the events per variable (EPV) was 13.7. The efficient sample size was calculated as 2084 (to avoid optimism in the regression coefficients), 687 (for agreement between apparent and adjusted model performance), and 220 (for a precise estimation of risk in the whole population) [21]. Our available sample size suggests that there might be optimism in our regression coefficients. However, this should have been addressed via the shrinkage we performed.

In Fig. 2, we show the distributions of the calculated probability of relapsing for individuals by relapse status. The overlap in the distributions of the probabilities is large, as also shown by the optimism-corrected AUC (Table 2). The overall mean probability of relapsing is 19.1%. For patients who relapsed the corresponding mean is 23.4% whereas for patients who did not relapse is 18.0%. Figure 3 shows the calibration plot, with some apparent performance measures and their 95% confidence intervals (CIs), of the developed prognostic models and represents the agreement between the estimated probabilities and the observed proportion to relapse within 2 years.

Fig. 2
figure 2

The distribution of probability of relapsing within the next 2 years by relapse status at the end of 2-year follow-up cycles. The dashed lines indicate the mean of estimated probability/risk for cycles that ended up with relapse (purple) and for those without relapse (yellow)

Fig. 3
figure 3

Calibration plot (N = 1752) of the developed prognostic model with loess smoother. The distribution of the estimated probabilities is shown at the bottom of the graph, by status relapse within 2 years (i.e. events and non-events). The horizontal axis represents the expected probability of relapsing within the next 2 years and the vertical axis represents the observed proportion of relapse. The apparent performance measures (c-statistic and c-slope) with their correspondent 95% CI are also shown in the graph

In Fig. 4, the exploration of the net benefit of our prognostic model is presented [ 29, 30, 31, 32]. In the figure, the vertical axis corresponds to the NB and the horizontal axis corresponds to the preferences presented as threshold probabilities. The NB is a weight between the benefit of identifying, and consequently correctly treating, individuals that relapsed and the harm (e.g. side effects) of wrongly prescribing patients the “more active approach” due to false positives results. Threshold probabilities refer to how decision makers value the risk of relapsing related to a harmful condition for a given patient, a decision that is often influenced by a discussion between the decision maker and the patient. It is easily seen that the dashed line, corresponding to decisions based on the developed prognostic model, has the highest NB compared to default strategies, between the range 15 and 30% of the threshold probabilities. Nearly half of the patients (46.5%) in our dataset have calculated probabilities between these ranges, in at least one follow-up cycle. Hence, for patients that consider the relapse occurrence to be 3.3 to 6.6 times worse (\( \frac{1}{a\%} \)) than the risks, costs, and inconvenience in “more active approach”, the prognostic model can lead to better decisions than the default strategies.

Fig. 4
figure 4

Decision curve analysis showing the net benefit of the prognostic model per cycle. The horizontal axis is the threshold estimated probability of relapsing within 2 years, a%, and the vertical axis is the net benefit. The plot compares the clinical benefit of three approaches: “as per standard care for all” approach, “more active care for all” approach, and “decision based on the prognostic model” approach (see definitions in “Clinical benefit of the developed model”). For a given threshold probability, the approach with the highest net benefit is considered the most clinically useful model. The decision based on the prognostic model approach provides the highest net benefit for threshold probabilities ranging from 15 to 30%

Discussion

We developed a prognostic model that predicts relapse within 2 years for individuals diagnosed with RRMS, using observational data from the SMSC [15], a prospective multicenter cohort study, to inform clinical decisions. Prognostication is essential for the disease management of RRMS patients, and until now, no widely accepted prognostic model for MS is used in clinical practice. A recent systematic review on prognostic models for RRMS [8], describes that most of the prognostic models, regardless of the outcome of interest, are lacking statistical quality in the development steps, introducing potential bias, did not perform internal validation, did not report important performance measures like calibration and discrimination, and did not present the clinical impact of the models. More specifically, only three studies examined the relapses as an outcome of interest and none of them satisfied the criteria above. Our model aimed to fill the existing gap, by satisfying all the above criteria, to enhance the available information for predicting relapses and to inform decision-making.

Given that a manageable number of characteristics is needed to establish the risk score, doctors and patients can enter these using our online tool (https://cinema.ispm.unibe.ch/shinies/rrms/), estimate the probability of relapsing within the next 2 years, and take treatment decisions based on patient’s risk score. This tool shows the potential of the proposed approach, however, may not yet be ready for use in clinical practice, as decision-making tools need external validation with an independent cohort of patients.

We included eight prognostic factors (all measured at baseline where also the risk was estimated): age, disease duration, EDSS, number of gadolinium-enhanced lesions, number of previous relapses 2 years prior, months since last relapse, treatment naïve, gender, and “currently on treatment”. The EPV of our model is 13.7, the sample size is efficient enough, and more than the sample size of all three pre-existing prognostic models. The optimism corrected AUC of our model is 0.65, indicating a relatively small discrimination ability of the model. However, in the literature, only Stühler et. al. reported the AUC of their model that was also equal to 0.65. In our previous work [37], the optimism corrected AUC using the LASSO model, with many candidate predictors, was 0.60, whereas this of the pre-specified model was 0.62. This could indicate that, in general, relapses are associated with unknown factors. The prognostic model we developed seems to be potentially useful, preferred over “Treat all” or “Treat none” approaches for threshold ranges between 15 and 30%.

The applicability of our model is limited by several factors. First, the risk of relapsing is not the only outcome that patients will consider when making decisions; long-term disability status would also determine their choice [4], and there is an ongoing debate of whether the relapse rate is associated with the long-term disability [ 5, 6, 7, 38]. That could be a further line of future research, and prognostic models with good statistical quality for long-term disability still need to be developed. In addition, the sample size of the SMSC is relatively small compared to other observational studies; this study though is of high quality. Furthermore, the bootstrap internal validation we performed ignores the dependence between the same individuals. In each one of the 10 imputed datasets and the 500 bootstrap samples, we constructed a frequentist logistic linear model. Ideally, we should construct the Bayesian logistic mixed-effects model exactly as we developed the original model. In addition, for model parsimony reasons, our model assumes that the variances of the impact of the variables on multiple observations for the same individual are equal and that the covariances between the effects of the variables are equal too. This assumption might be relaxed by, e.g. assuming covariate-specific correlations. Finally, our model was not validated externally, something essential for decision-making tools. In the near future, independent researchers, as recommended by Colins et. al. [39], should validate externally our model before it is ready for clinical use.

Conclusions

The prognostic model we developed offers several advantages in comparison to previously published prognostic models in RRMS. We performed multiple imputations for the missing data to avoid potential bias induced [11], we used shrinkage of the coefficients to avoid overfitting [13], and we validated internally our model presenting calibration and discrimination measures, an essential step in prognosis research [13]. Importantly, we assessed the net benefit of our prognostic model, which helps to quantify the potential clinical impact of the model. Our web application, when externally validated, could be used by patients and doctors to calculate the individualized risk of relapsing within the next 2 years and to inform their decision-making.