FormalPara Key Summary Points

Why carry out this study?

Previous multiple prediction models for non-response to methotrexate (MTX) have been proposed, which all resulted in an area under the curve (AUC) between 0.65 and 0.80, but not all models have been validated.

The purpose of the current study was to externally validate a previously developed prediction model for insufficient response to MTX and to enhance the model’s applicability in clinical practice.

What has been learned from the study?

The prediction model was externally validated with an AUC of 0.75 (95% CI 0.64–0.85), enhanced for clinical applicability (AUC = 0.75, 95% CI 0.69–0.81) and successfully integrated in an online tool “Evidencio”, which can assist clinicians and patients in shared decision-making.

Patients with high risk scores for insufficient response to MTX according to our model integrated in Evidencio can immediately intensify MTX treatment with biologic disease-modifying anti-rheumatoid drugs/conventional synthetic disease-modifying anti-rheumatoid drugs (bDMARDs/csDMARDs) as proposed in the America College of Rheumatology/European League Against Rheumatism (ACR/EULAR) 2019 updated guidelines for RA treatment [1], while sufficient responders are spared expensive step-up treatment.

Disease activity in these patients can be tightly controlled during the window of opportunity, resulting in better long-term responses and avoiding unnecessary adverse events of MTX.

Introduction

Methotrexate (MTX) is the first-line therapy in rheumatoid arthritis (RA) [1]. Although efficacious in a large proportion of patients, MTX is poorly effective in approximately 30% of early RA patients [2, 3]. Patients on MTX who do not show improvement at 3 months (insufficient responders) or do not reach the treatment target of low disease activity/remission at 6 months (non-responders) are switched to biologic disease-modifying anti-rheumatoid drug (bDMARD) therapies or novel targeted synthetic DMARD (tsDMARD) therapies, including, e.g., Janus kinase (JAK) inhibitors [1, 4, 5], with or without concomitant MTX treatment. To ensure that only patients unresponsive to MTX receive early (additional) treatment with b/tsDMARDs and those responsive to MTX are spared costly biologics or synthetic drugs, we and others have constructed models to predict MTX (non)-response [6,7,8,9]. Our prognostic multivariable prediction model for the prediction of insufficient response, defined as: disease activity score 28 (DAS28 > 3.2) at 3 months of MTX therapy, was constructed in the treatment in the Rotterdam Early Arthritis Cohort (tREACH) and included clinical predictors (DAS28 and Health Assessment Questionnaire [HAQ]), life-style predictors (smoking and BMI) and laboratory parameters involved in MTX metabolism (erythrocyte folate and single-nucleotide polymorphisms: SNPs) [9]. This model classified 80% of patients correctly (area under the curve [AUC] of the receiver operating characteristic [ROC]: 0.80 [95% CI 0.73–0.86]) and was externally validated in the MTX-Rotterdam cohort showing a similar prognostic performance (AUC 0.80 [95% CI 0.69–0.91]) even though BMI and smoking predictors were absent from this validation cohort [9]. The aim of the current study was to validate the complete prediction model, including BMI and smoking status predictors, in an external early RA cohort (U-Act-Early) from a different geographic region and to enhance the model’s applicability in clinical practice [10].

Methods

The methodology of this study followed transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines [11].

Patients

The external validation cohort consisted of 91 patients from the U-Act-Early cohort, a multicenter, double-blind, placebo-controlled strategy trial, registered at ClinicalTrials.gov (number: NCT01034137) [10]. DMARD- and glucocorticoid (GC)-naïve early RA patients were eligible for inclusion once classified as RA patients according to the 1987 America College of Rheumatology (ACR) [12] (n = 7, 8%) or the 2010 ACR/EULAR classification criteria (n = 84, 92%) [13], and had a disease duration < 12 months and active disease at baseline (disease activity score 28; DAS28 ≥ 2.6). Patients were randomly assigned to a treatment strategy with tocilizumab (TCZ) + placebo, MTX + placebo or their combination (TCZ + MTX) and treated to the target of sustained remission (i.e., a DAS28 < 2.6 and swollen joint count of ≤ 4 joints of the 28 joints assessed, during ≥ 24 weeks). All 91 patients included were derived from the initial MTX + placebo strategy arm. The starting MTX dose was 10 mg/week orally and increased stepwise 5 mg every 4 weeks up to 30 mg/week until remission or the maximum tolerable dose. During the trial, GC use was not permitted. The tREACH (n = 285) cohort was described earlier [9]. Importantly, in the tREACH, the optimal MTX dose of 25 mg/week was reached within 3 weeks (combined with other conventional synthetic (cs) DMARDs and/or GCs) and therapy was targeted to low disease activity (DAS28 ≤ 3.2) at 3 months. If this failed, step-up treatment with additional csDMARDs (sulfasalazine and/or hydroxychloroquine) or bDMARDs (i.e., TNF-alpha inhibitor) was initiated. In both cohorts, folic acid (10 mg/week) was prescribed during MTX treatment. This study was approved by the medical ethics committee of the University Medical Center Utrecht (ML22497) and the medical ethics committee of Erasmus Medical Center (MEC-2006-252). All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Written informed consent was obtained for all patients.

Outcome and Clinical Predictors

The primary outcome was insufficient MTX response after 3 months of treatment start, defined as DAS28 > 3.2, where DAS28 was based on the erythrocyte sedimentation rate (DAS28-ESR). All predictors were dichotomized prior to analyses. Cut-off values were defined previously in the tREACH cohort as: DAS28 > 5.1, HAQ > 0.6, erythrocyte folate < 750 nmol/l, current smoking, BMI > 25 kg/m2, ABCB1 rs1045642 (GG/GA vs. AA) genotype, and ABCC3 rs4793665 (TC/CC vs. TT) genotype [9]. As erythrocyte-folate levels were slightly higher in U-Act-Early, new cut-off points were examined for erythrocyte-folate (deciles) and BMI (> 30 kg/m2) in U-Act-Early and tested for improvement of the model. The secondary outcome measure was non-response to MTX after 6 months of treatment, defined as DAS28-ESR > 3.2.

Erythrocyte Folate and Genetic Variants

In U-Act-Early, erythrocyte folate and genetic variants included in the original prediction model [9] were determined from EDTA whole blood samples stored at − 80 °C, as described elsewhere [14, 15]. DNA was obtained from whole blood using a MagNAPure Compact (Roche Life Science, Almere, The Netherlands) and genotypes were determined for ABCB1 rs1045642 and ABCC3 rs4793665 using real-time PCR with Taqman, as described previously [15]. Samples were tested for deviation from the Hardy–Weinberg equilibrium (HWE). SNPs were determined in the same lab and according to the same protocols as the tREACH study [9]. Importantly, there were methodological differences in the measurement of folate between the tREACH and U-Act-Early cohorts. The Elecsys® Folate III assay (Ref 7027290190; Roche Diagnostics) has been re-standardized since 2017 in accordance with the WHO International Standard NIBSC Code 03/178. This resulted in 10% lower erythrocyte folate levels (U-Act-Early) compared to those quantified using the previous assay (tREACH). Furthermore, serum folate levels, required for folate correction in whole blood, were not available in U-Act-Early. As serum folate levels take up only a small part of the total folate concentrations, whole blood folate levels were corrected for the average serum folate concentration in the tREACH (25 nmol/l).

Statistical Analysis

Clinical and laboratory parameters for the tREACH (derivation) and U-Act-Early (validation) cohorts were compared. The difference in DAS28 at 3 months (compared to baseline) was expressed as a mean with standard deviation (± SD) and assessed using a paired-sample t test. Differences between cohort variables were tested using an independent two-group t test, if the assumptions of normal distribution (visual inspection) and equal variances (tested using the Levene’s test) were met. If these assumptions were not met, non-parametric Mann–Whitney U test was performed. Differences in proportions were tested using a two-sample proportion test. Due to missing informed consent at the start of this validation study, eight subjects from the tREACH (derivation) cohort were excluded from analyses in this study. This is why the model as described previously was first re-analyzed on the tREACH data excluding these eight subjects, resulting in negligible differences in effect sizes compared to previous study [9]. Next, to validate the prediction model in the external validation cohort (U-Act-Early), the predictors, DAS28 > 5.1, HAQ > 0.6, erythrocyte folate < 750 nmol/l, current smoking, BMI > 25 kg/m2, ABCB1 rs1045642 genotype and ABCC3 rs4793665 genotype were entered into a multivariable logistic regression and the probability for insufficient response was calculated using the pROC package in R according to the following formula:\(P insufficient response \left(\%\right)=\frac{{e}^{\left(\beta 0+\beta 1\bullet x1+\beta 2\bullet x2+\dots +\beta n\bullet xn\right)}}{1+{e}^{\left(\beta 0+\beta 1\bullet x1+\beta 2\bullet x2+\dots +\beta n\bullet xn\right)}} x 100\), where β0 represents the constant, β, β2, and βn represent the regression coefficients for each of the predictors x1, x2, xn. Subsequently, an ROC curve with AUC was constructed using the predicted probabilities and compared with results in the tREACH. The previously developed prediction model in tREACH (derivation) was considered successfully validated in the U-Act-Early (validation) if the area under the curve (AUC) of the receiver operating characteristic (ROC) was not significantly lower than in the tREACH. Goodness of fit between the predicted probabilities and observed values was tested using the Hosmer–Lemeshow test, where p > 0.05 indicated that a model fit the data well. All analyses were performed in R studio (Version: 3.5.3, “2019-03-11”). P values < 0.05 were considered statistically significant. Subjects with missing data (N = 104) were excluded from analyses (complete-case analysis).

To enhance the model’s clinical applicability and thus facilitate its clinical implementation, we applied the prediction model on the combined dataset of tREACH and U-Act-Early (N = 264) using the step-up approach. Therefore, the model could be simplified using fewer predictors and possible two-way interactions could be examined in a combined cohort with more power. Statistically significant interactions (p < 0.05) were added to the model. To simplify the model, we assessed individual contribution of variables to the predictive power of the model by sequential addition of predictors. Model fits were compared using the likelihood ratio test. Probability for insufficient response was calculated for each patient as well as the corresponding specificity, positive predictive value (PPV), sensitivity, and negative predictive value (NPV) were determined, using the “pROC” package in R. The final prediction model was uploaded onto the online platform ‘Evidencio’ providing a tool for clinicians to decide whether to start MTX combination therapy.

Results

Cohort Comparisons

In U-Act-Early (validation cohort), mean DAS28 decreased from 5.0 (± 1.1) to 3.6 (± 1.6) during the first 3 months (p < 0.001). Mean DAS28 in tREACH (derivation cohort) was 5.0 (± 1.1), which decreased to a mean DAS28 of 3.1 (± 1.2, p < 0.001). Baseline DAS28 in U-Act-Early was comparable to that of tREACH (p = 0.613; Table 1). In U-Act-Early, 58 patients (64%) were categorized as insufficient MTX responders (i.e., DAS28 > 3.2 at 3 months) compared to 114 (43%) in tREACH (p = 0.006). In U-Act-Early, 39 (44%) patients were classified as MTX non-responders (i.e., DAS28 > 3.2 at 6 months), which was not significantly different from 38% in tREACH (Table 1). Additionally, U-Act-Early consisted of significantly more rheumatoid factor positive patients (81%) compared to tREACH (65%) (p = 0.007; Table 1), whereas no significant differences were found for anti-citrullinated protein antibody (ACPA) positivity (p = 0.214; Table 1). Despite the lower folate levels due to (international) re-standardization of the method, the erythrocyte-folate levels were still significantly higher in U-Act-Early compared to tREACH (p = 0.006; Table 1) and genotype GG/GA for ABCB1 was significantly more frequent in tREACH (p = 0.016; Table 1), while genotypes for ABCC3 were similar between cohorts. Importantly, besides MTX, co-medication was prescribed in tREACH (derivation) but not in U-Act-Early (validation; Table 1).

Table 1 Descriptives of the derivation (tREACH) and external validation (U-Act-Early) cohorts

Validation of Prediction Model at 3 Months

The model combining all predictors accomplished an AUC of 0.75 (95% CI 0.64–0.85) (Fig. 1) in U-Act-Early (validation), which means that 75% of the patients could be classified correctly. The strongest predictor for insufficient response to MTX in U-Act-Early was baseline DAS28 > 5.1 (p = 0.008; Table 2). Odds ratios (ORs) for the predictors DAS28, HAQ, erythrocyte folate, BMI and smoking in U-Act-Early (validation) were in the same direction as in the tREACH (derivation; Table 2). The OR of the ABCC3 SNP was however in the opposite direction in U-Act-Early (OR = 0.6, 95% CI 0.23–1.79) compared to tREACH (OR = 3.1, 95% CI 1.39–6.94).

Fig. 1
figure 1

ROC curve for the prediction of insufficient response (DAS28 > 3.2) to MTX after 3 months of treatment. Area under the curve (AUC) is reported as follows: AUC (95% confidence interval). Predictors were: baseline DAS28 > 5.1, baseline HAQ > 0.6, ABCB1 genotype, ABCC3 genotype, baseline erythrocyte folate, BMI > 25 kg/m2 and current smoking

Table 2 Validation of multivariable logistic regression models for insufficient response to MTX (DAS28 > 3.2) at 3 months of treatment in an external validation cohort (U-Act-Early)

Similar results were found for the prediction of non-response at 6 months. An ROC curve was constructed with an AUC of 0.71 (95% CI 0.60–0.82; Supplementary Figure S1) in U-Act-Early (validation), which is comparable to the predictive value of the tREACH model (derivation) at 6 months (AUC 0.75, 95% CI 0.67–0.83; Supplementary Figure S1).

Enhancement of the Model’s Clinical Applicability

Next, the U-Act-Early and tREACH cohort were combined to increase power and enhance the model’s clinical applicability. In this combined cohort, the ORs for all predictors were greater than 1 and all predictors, except for the SNPs were significant (Supplementary Table S1). The combined model reached an AUC of 0.74 (95% CI 0.68–0.80; Supplementary Table S1 and Fig. 1) at 3 months.

Additionally, in this combined set, we investigated whether all predictors were required to reach 74% predictive power or whether the model could be further simplified. To do so, we analyzed changes in AUC upon sequential addition of predictors to the model. We started with the most readily available clinical predictors DAS28 > 5.1 and HAQ > 0.6, which generated an ROC with an AUC of 0.67 (95% CI 0.61–0.74; Table 3).

Table 3 Logistic model building in combined datasets: U-Act-Early + tREACH

Upon addition of smoking to the model, the AUC significantly increased (p = 0.01) to 0.70 (95% CI 0.64–0.76), followed by BMI, upon which the AUC further improved to 0.72 (95% CI 0.66–0.78, p = 0.02). Upon addition of erythrocyte folate to the model the AUC reached 0.73 (95% CI 0.67–0.79, p = 0.02). Addition of ABCB1 and ABCC3 genotypes did not significantly improve the model (AUC = 0.74, 95% CI 0.68–0.80, p = 0.12; Table 3). Hence, the model could be simplified to a model where SNP genotypes were excluded resulting in a model with predictive power of 73%.

To fine-tune the model, all two-way interaction terms between predictors were tested. An interaction term between HAQ and BMI (OR = 3.68 95% CI 1.07—13.14) significantly contributed to the model. This means that a BMI > 25 kg/m2 was associated with worse disease activity when HAQ values were > 0.6. Furthermore, an interaction term between HAQ and erythrocyte folate (OR = 0.23, 95% CI 0.06–0.86) also significantly contributed to the model, indicating that low erythrocyte folate concentrations (< 750 nmol/l) significantly predicted insufficient response when HAQ values were < 0.6. Hence, interaction terms for HAQ and BMI and HAQ and erythrocyte folate were added to the model. Upon addition of these interaction terms to the model, the AUC of the final model, shown in Table 4, increased to 0.75 (95% CI 0.69–0.81). As mentioned in the Methods section, we generated new cut-off values for erythrocyte folate and the BMI in the U-Act-Early cohort which, when included, did not result in higher AUCs.

Table 4 Final prediction model enhanced for clinical implementation

Model Translation to the Clinic

In order to apply the model in clinical practice, the prediction model was integrated into an online platform “Evidencio” [16]. Using this tool, clinicians can easily enter DAS28, HAQ, erythrocyte folate, smoking, and BMI for individual patients using sliding scales and buttons. The model then automatically calculates and presents a probability of insufficient response to MTX for this specific patient. Specificity, positive predictive value (PPV), sensitivity, and negative predictive values (NPV) were calculated for different cut-off values of these probabilities for insufficient response and are presented in Supplementary File S1.

Our online model can be found in Evidencio using the following link: https://www.evidencio.com/models/show/2191. A patient-specific report presenting the results and model interpretation can be downloaded from Evidencio (Supplementary File S1). An example is shown in Fig. 2, where a patient with DAS28 score = 4.9, HAQ = 0.4, BMI = 22 kg/m2, erythrocyte-folate = 720 nmol/l, and current smoking status = yes, has a probability of insufficient response of 71.7%, with corresponding specificity of 86% and PPV of 75%.

Fig. 2
figure 2

Example of online platform Evidencio for the implementation of the prediction model. Values for each individual patient can be filled out using the buttons and slides. Corresponding probability for insufficient response is automatically calculated using the prediction model

Discussion

We externally validated our previously developed prediction model for insufficient response to MTX therapy at 3 and 6 months after treatment initiation in early RA patients including all predictors DAS28 > 5.1, HAQ > 0.6, ABCB1 rs1045642 genotype, ABCC3 rs4793665 genotype, erythrocyte folate < 750 nmol/l, current smoking and BMI > 25 kg/m2. To enhance clinical applicability and facilitate implementation, the validated model was applied in the combined derivation and validation cohort. This model, without ABCB1 rs1045642 genotype and ABCC3 rs4793665 genotype, had an AUC of 0.75, meaning that it classified 75% of the insufficient responders correctly. Currently, according to the EULAR treatment guidelines, MTX is the first-line therapy in RA. Treatment is only up-scaled after 3 to 6 months of insufficient response to MTX, despite evidence supporting a ‘window of opportunity’ for targeted treatment [1, 17]. This window of opportunity is a limited period between diagnosis and RA progression in which the disease could still be modified, radiographic damage/functional disability could be limited, and progression could be slowed down upon early control of disease activity, for which sufficient treatment is required [17]. Our prediction model could assist in identification of insufficient responders at diagnosis: for those with high probability of insufficient response to MTX, additional biologics or JAK inhibitors could be prescribed, while for patients with low probabilities of insufficient response these expensive treatments could be spared. This distinction at diagnosis could save precious time for insufficient responders, allowing earlier control of disease activity resulting in better long-term outcomes.

We externally validated the model for the first time in its entirety as, besides the clinical and laboratory predictors, the lifestyle predictors (BMI and smoking) were also examined in the U-Act-Early cohort (as opposed to the initial validation in the MTX-Rotterdam cohort, which lacked the life-style predictors) [9]. The strongest predictor was high disease activity at baseline (DAS28 > 5.1) confirming previous findings [7, 18]. Due to differences in treatment intensities (i.e., MTX dose and co-medication) between the derivation and the validation cohort, we investigated whether the model was applicable at 6 months despite step-up treatments after the 3-month mark. Indeed, 71% of the non-responders to MTX were classified correctly, which was similar to the 75% in the tREACH derivation cohort at 6 months. In the combined cohort, all predictors except for ABCB1 and ABCC3 genotypes significantly contributed to the predictive power of the model. Addition of ABCB1 and ABCC3 genotypes to the model showed only minimal improvement, resulting in an absolute change in AUC of 0.01, which was not statistically significant. A meta-analysis on the relationship between ABCB1 genotype and response to MTX in 2014 RA patients showed an association between this genotype and response to MTX, yet our patient group was too small to validate this result [9]. Another recent GWAS study did not show a relationship between ABCB1 or ABCC3 and treatment response [19]. Since the differences in predictive power were minor and the effect of ABCC3 genotype pointed in opposite directions in the two cohorts, possibly indicating a spurious finding, we excluded both genotypes from the model.

In agreement with our study, increased BMI (obese > 30 kg/m2) was previously found to be associated with insufficient response to MTX in RA patients [20, 21]. It is postulated that the effect of BMI on non-response to MTX could be due to the release of proinflammatory adipokines [e.g., leptin, interleukin-6, and or tumor-necrosis factor alpha (TNF-α)] from the adipose tissue [20]. Also, the predictive power of smoking was is in accordance with previous studies [6, 7, 20, 22].

So far, several prediction models for MTX non-response have been proposed, which resulted in an AUC between 0.65 and 0.85 [6,7,8, 22, 23]. Different outcome measures at different time points and combination therapies complicate comparison between models and their validation. However, the best-performing models all included clinical parameters and laboratory parameters, which is in line with our findings [6, 8]. We also showed that clinical predictors (DAS28 > 5.1 and HAQ > 0.6) alone classified fewer insufficient responders correctly (67%) compared to the model combining clinical, life-style (BMI and smoking) and laboratory predictors (erythrocyte folate), which classified 73% of patients correctly. Most clinical predictors and life-style predictors are easy to assess. Erythrocyte folate may not be available in every laboratory, however the assay is relatively easy to assess [24].

Strengths of this study are that both derivation and validation studies were prospectively designed and that patients in the external validation cohort were included from different districts in the Netherlands. Limitations are that the size of the external validation cohort was limited, however the number of cases in both the internal (tREACH) and external (U-Act-Early) cohort were similar. In addition, the model was validated in an MTX monotherapy group, while it was designed in a combination (GC and csDMARD) therapy group. Commonly, however MTX is co-prescribed with a short course of glucocorticoids (prednisone) as MTX’s optimal effect ensues after 8–12 weeks [1]. Despite differences in co-medication between the cohorts, the prediction model had similar predictive value and OR for predictors were in the same direction in both cohorts, indicating that co-medication did not affect the prediction of response to MTX. Another limitation is that smoking status was assessed using questionnaires; possibly biasing the results as smoking behavior could be underreported or underestimated. In future studies, cotinine, the degradation product of nicotine, could be quantified as an objective measure for smoking status, which can easily be determined in serum [25].

Furthermore, we showed that the online platform Evidencio provides an easy tool for implementation of the prediction model in clinical practice. Evidencio is freely available so that the data can be uploaded to automatically validate the model in specific cohorts. In addition, using the Evidencio platform, clinicians can directly use the model in their practice. When a new patient is diagnosed with RA, patient’s information on DAS28, HAQ, erythrocyte-folate, BMI, and smoking status can be provided to Evidencio. Subsequently, a probability of insufficient response to MTX with corresponding specificity, positive predictive value (PPV), sensitivity, and negative predictive value (NPV) are provided by the tool and may help clinicians and patients in shared decision-making on step-up treatment with bDMARDs or tsDMARDs. The choice of a cut-off depends on the clinical goal. Taking into consideration the “window of opportunity” [17] for optimal treatment, we consider it crucial to adequately treat insufficient MTX responders with additional bDMARDs/tsDMARDs. Therefore, our goal for this prediction model was to identify as many insufficient responders as possible, while at the same time attempting to restrict the use of bDMARDs/tsDMARDs to those patients who really need them, hence to avoid misclassification of sufficient responders. Considering this, a cut-off probability of 70% (of insufficient response) could be chosen. At this cut-off, 75% of patients classified as insufficient responder match actual insufficient responders (PPV) and could be treated with additional bDMARDs/tsDMARDs. Additionally, at this cut-off, 86% of all sufficient responders would be correctly classified as such (specificity) and could be spared additional treatment.

The importance of erythrocyte-folate for the predictive power of the model implies that this model is specific in predicting insufficient response to MTX, as MTX is structurally similar to folate. Hence, low erythrocyte folate levels are possibly a surrogate measure for poor MTX absorption, transportation, and MTX accumulation in the cell, as described previously [14]. However, it is possible that a certain proportion of insufficient responders to MTX are difficult-to-treat RA patients who are poorly responsive to various b/ts DMARDs [26, 27]. So far, we cannot identify difficult-to-treat RA patients in advance. Furthermore, as recently argued, treatment strategies could be more important than specific drugs, implying that these patients could still benefit from quicker and more aggressive treatment to reach a certain treatment target when earlier identified as insufficient responders [28, 29].

Conclusions

We successfully externally validated our previously published prognostic prediction model of insufficient response to MTX, which correctly classified 75% of insufficient responders at 3 months and 71% of non-responders at 6 months of treatment. The model can be used in clinical practice to identify insufficient responders to MTX with the goal of treating them with additional biologic or JAK inhibitors as early as possible to reduce disease activity and limit joint damage. Application of the tool by means of a clinical trial is warranted.