FormalPara Key Summary Points

Why carry out this study?

In rheumatoid arthritis, despite availability of targeted treatments, only a minority of patients achieve sustained remission.

Little evidence exists to direct choice of biologic disease-modifying antirheumatic drugs in individual patients.

Our goal was to identify a “rule” based on clinically feasible biomarkers to predict response to sarilumab and discriminate between responses to sarilumab versus adalimumab, using clinical trial data and machine learning.

What was learned from the study?

The presence of anti-cyclic citrullinated peptide antibodies, combined with C-reactive protein > 12.3 mg/l, emerged as a biomarker “rule” that could potentially predict response to sarilumab.

This finding needs to be confirmed in real-world studies.

Digital Features

This article is published with digital features, including a video abstract to facilitate understanding of the article. To view digital features for this article go to https://doi.org/10.6084/m9.figshare.14512056.

Introduction

Treatment guidelines for rheumatoid arthritis (RA) recognize the importance of attaining clinical improvement within 3 months and remission or low disease activity within 6 months of treatment initiation [1, 2]. Biologic disease-modifying anti-rheumatic drugs (bDMARDs) are recommended in the presence of poor prognostic factors or if response to initial treatment is inadequate. The choice of bDMARD is often based on physician experience, patient preference, and cost, and is complicated by a variety of available agents [3, 4]. Nevertheless, there is a remarkably similar plateau in responder rates for patients achieving 20% (ACR20), 50% (ACR50), and 70% (ACR70) response based on American College of Rheumatology criteria, irrespective of the bDMARD or targeted synthetic DMARD (tsDMARD) studied [5,6,7].

Given the importance of rapid response to treatment in prevention of irreversible joint damage and improved symptom control, a personalized approach to treatment selection would be preferred over a prolonged and iterative trial-and-error process [8]. However, only a few biomarkers have been identified as candidates for treatment optimization in RA, with at best modest associations with treatment response, and with inconsistent applicability in current clinical practice. For example, the presence of autoantibodies to rheumatoid factor or cyclic citrullinated peptide (CCP) may predict response to rituximab, and genetic factors such as the HLA-DRB1 shared epitope may predict response to tumor necrosis factor inhibitors (TNFi) and tocilizumab, an inhibitor of the interleukin-6 receptor (IL-6R) [3, 9]. It is possible that treatment response would be better predicted by combinations of biomarkers and clinical characteristics [3, 8, 10,11,12,13]. However, the multitude of potential clinical and biomarker-based predictors, in combination with their thresholds, and inherent constraints on clinical availability, poses a significant conceptual and computational challenge.

Artificial intelligence techniques such as machine learning are increasingly being used to identify individuals at risk for disease, predict outcome, and optimize treatments [4, 14, 15]. In machine learning, computers apply hypothesis-free algorithms that enable development of data-based mathematical models [4]. To develop machine learning models, a randomly selected subset of data such as that obtained from patients in clinical trials is used to select, among a predefined set of parameters (e.g., clinical or blood biomarkers), those factors that are associated with a certain predefined outcome (e.g., ACR20). Once the parameters are set so that the error in predicting the outcome is minimized, those parameter values (i.e., the “rule”) are validated using the remaining data, or new external data sources [4]. This approach allows the identification of hidden patterns and rules in large datasets, while reducing the risk of overfitting, and having to correctly specify hypotheses a priori [4]. Machine learning has been applied to electronic health records to prognosticate RA disease activity [14, 16] and to define disease phenotypes in RA [17]. However, to the best of our knowledge, it has not yet yielded a robust, clinically feasible rule that would predict treatment response to biologic therapies in patients with RA.

Sarilumab is a human monoclonal antibody to IL-6R approved for the treatment of moderate-to-severe RA [18, 19]. As with other bDMARDs, the characteristics of patients most likely to benefit from sarilumab treatment remain poorly understood. In this post hoc analysis, we used machine learning to identify a simple and clinically feasible rule that could predict favorable response to sarilumab, and in one trial, an incremental response compared with adalimumab.

Methods

Data Sources

This post hoc analysis used patient-level data from four phase 3 sarilumab trials: MOBILITY (sarilumab versus placebo in patients with inadequate response to methotrexate; NCT01061736) [20], MONARCH (sarilumab versus adalimumab as monotherapy in patients with inadequate response or intolerant to methotrexate; NCT02332590) [21], TARGET (sarilumab versus placebo in patients intolerant to TNFi; NCT01709578) [22], and ASCERTAIN (comparative safety of sarilumab and tocilizumab; NCT01768572) [23]. Patients in all four trials were ≥ 18 years old and met the ACR 1987 revised classification criteria (MOBILITY) or the ACR 2010/European League Against Rheumatism (EULAR) classification criteria (MONARCH, TARGET, ASCERTAIN) for active RA at baseline [20,21,22,23]. Patients had CRP levels ≥ 6 mg/l in MOBILITY, ≥ 8 mg/l in MONARCH and TARGET, and ≥ 4 mg/l in ASCERTAIN [20,21,22,23]. At baseline in each trial, at least 65% of patients were seropositive for rheumatoid factor and at least 75% were positive for anti-CCP autoantibodies [20,21,22,23]. MOBILITY data were used to train and validate the model (data were randomly split into training and validation sets by the algorithm; henceforth referred to as cross-validation), which was later tested in the full data sets of all four trials.

Variables and Outcome Used for Model Development

Eighteen categorical and 24 continuous baseline variables, including demographics, blood protein biomarkers, and clinical scores from the MOBILITY trial were identified as potential predictors of clinical response (Table 1).

Table 1 Parameters used in the GUIDE algorithm

The clinical endpoint of ACR20 at week 24 was used for the model training, once the machine learning methodology had been chosen (see below).

Choice of the Machine Learning Methodology

Since our goal was identification of a simple, clinically feasible rule with a good prediction performance, we initially tested three decision tree methods, rpart (CRAN library: (https://cran.r-project.org/web/packages/rpart/index.html), C5.0 (https://www.rulequest.com), and the Generalized, Unbiased, Interaction Detection and Estimation (GUIDE; version 27.9 for macOS Mojave 10.14; http://pages.stat.wisc.edu/~loh/guide.html) [24], using either ACR20, ACR50, ACR70, or the 28-joint disease activity score using CRP (DAS28-CRP) as clinical outcomes. Of those, rpart and C5.0 did not provide generalizable approaches for patient stratification. GUIDE provided unbiased variable selection and cross-validation, as well as a final tree that yielded a rule that fits our predefined criteria. The default options of GUIDE were used, including tenfold cross-validation.

Training of the Decision Tree Model

To maximize the information density for model training, a data subset from patients in the MOBILITY trial who had data for all selected biomarkers (n = 163; Table 1) was entered into GUIDE to train the decision tree model where ACR20 ultimately yielded the only cross-validated model. Patient data from the sarilumab 200 mg (n = 63) and 150 mg (n = 100) treatment groups in MOBILITY were pooled. The resulting model was manually reduced by two decision-nodes to achieve greater clinical applicability while maintaining most of its predictive power, and is henceforth referred to as the final model (see “Results”).

Validation of the Model

The final model for predicting sarilumab response was validated using the full dataset for MOBILITY, which included the training subset (Nnew = 1034 plus Ntraining = 163; sarilumab 150 mg or 200 mg, n = 799; placebo, n = 398) and datasets from MONARCH (N = 369; sarilumab 200 mg, n = 184; adalimumab 40 mg, n = 185), TARGET (N = 546; sarilumab 150 mg or 200 mg, n = 365; placebo, n = 181), and ASCERTAIN (N = 202; sarilumab 150 mg or 200 mg, n = 100; tocilizumab 4 mg/kg or 8 mg/kg, n = 102). For each study, the applicable endpoints (Supplementary Material, Table S1) were assessed at baseline and week 24 and placebo-corrected where possible. In addition, we developed a clinical application scenario based on the rule created by the GUIDE algorithm and the MONARCH data [21].

Statistical Analysis

Data analyses were performed using R v3.5.2 and Microsoft Excel 2010.

Compliance with Ethics Guidelines

This article is based on previously conducted studies and does not contain any new studies with human participants or animals performed by any of the authors.

Results

Choice of the Machine Learning Methodology

Retrospectively, we initiated a sensitivity analysis of our methodology by evaluating 14 additional machine learning methods available through PyCaret (https://pycaret.org; Supplementary Material, Table S2), including rule-based (decision trees), regression-based (logistic regression, quadratic discriminant analysis, linear discriminant analysis, ridge classifier), instance-based (k-neighbors classifier), naïve Bayesian, ensemble-based (extra trees classifier, random forest, gradient boosting, light gradient boosting, CatBoost, and Ada Boost), and support vector machines. The PyCaret results confirmed that the decision tree classifier model [25] GUIDE had the best recall, while other performance criteria were comparable. The resulting variable ranking of the PyCaret results (only the four best results are shown) frequently contained the GUIDE-derived variables, which were CRP level and anti-CCP status, but also suggested alternative variables and variable combinations (Supplementary Material, Figure S1).

Development of the Model and Identification of a Predictive Rule

The training and cross-validation dataset included 163 patients who received sarilumab in the MOBILITY trial and had data available for all relevant variables (Table 1).

A robust predictive GUIDE model for patient stratification was obtained using ACR20 as the measure of patient response. With ACR50 and ACR70, GUIDE could not generate cross-validated trees, but often included anti-CCP and CRP either as the first decision variable or within the first two branches of the decision tree (data not shown). In the resulting decision tree (Fig. 1A), the model contained, in the following order: anti-CCP, metabolite of type I collagen (C1M), CRP, and a weighted combination of soluble glycoprotein 130 (sgp130) and the erosion score. However, C1M and the weighted combination of sgp130 and erosion score were manually excluded from the resulting model. C1M had only a minor impact on the model performance and the weighted combination of sgp130 and erosion score was only available in MOBILITY. In addition, the exclusion of these variables allowed for greater clinical applicability of the final model.

Fig. 1
figure 1

Schematic of the resulting GUIDE decision tree classification approach model (A) and the reduced final model (B). Anti-CCP anti-cyclic citrullinated peptide, C1M metabolite of type I collagen, CRP C-reactive protein, sgp130 soluble glycoprotein 130

Among the 42 candidate variables, the combined presence of anti-CCP (yes/no) and CRP > 12.3 mg/l (chosen by GUIDE) was identified as a predictor (i.e., the rule) of better treatment outcomes with sarilumab. Of the 163 MOBILITY patients included in the training and cross-validation dataset, 84 had a positive status for both anti-CCP and CRP > 12.3 mg/l. The ACR20 response rate in this rule-positive group was 81% (68/84), which was higher than the rate observed in patients without anti-CCP antibodies (27% [4/15]) or in those who had anti-CCP antibodies but their CRP was ≤ 12.3 mg/l (59% [38/64]; Fig. 1B).

Results from the model were similar for both sarilumab doses (150 mg and 200 mg; not shown), so the analysis presented here is focused on the recommended sarilumab dose of 200 mg.

Supplementary file2 (WMV 2,12,490 kb)

Baseline Characteristics of Rule-Positive Patients

Patients who were anti-CCP-positive and had CRP > 12.3 mg/l, i.e., rule-positive patients, accounted for 34–51% of patients in the sarilumab groups across the four trials. On average, rule-positive patients had a more severe disease and more baseline factors suggesting poor prognosis than rule-negative patients (MOBILITY: Table 2; MONARCH, TARGET, ASCERTAIN: Supplementary Material, Tables S3, S4, and S5).

Table 2 Baseline characteristics of rule-positive and -negative patients in MOBILITY

Model Validation and Application of the Rule to Predict Clinical Response

At week 24, response rates in rule-positive sarilumab-treated patients from MOBILITY were superior to those in rule-negative patients, across all endpoints assessed. In the placebo group, responder rates for rule-negative patients were more favorable for CDAI remission, DAS28-CRP remission, and DAS28-CRP LDA, compared with rule-positive patients, whereas response rates for other outcomes were similar (Supplementary Material, Fig. S2A). Across all outcomes, placebo-adjusted response rates in rule-positive patients were higher by approximately 5–15%, compared with rule-negative patients (Fig. 2A).

Fig. 2
figure 2

Response rates in rule-positive and rule-negative sarilumab-treated patients. The patient stratification rule was the combined presence of anti-CCP and CRP > 12.3 mg/l. ACR20 ACR 20%, ACR50 ACR 50%, ACR70 ACR 70%, CDAI Clinical Disease Activity Index, DAS28-CRP 28-joint Disease Activity Score using C-reactive protein, DAS28-ESR DAS28 using erythrocyte sedimentation rate, HAQ-DI Health Assessment Questionnaire-Disability Index, LDA low disease activity, MCID minimal clinically important difference, REM remission

In MONARCH, rule-positive patients also had a consistently stronger response to sarilumab than their rule-negative counterparts, which was not observed in adalimumab-treated patients (Fig. 2B, C). Among rule-positive individuals, the ACR70 response rate in those receiving sarilumab was almost five times the rate observed in patients receiving adalimumab (34 vs. 7%), whereas adalimumab treatment resulted in a higher ACR70 response rate among rule-negative patients compared to the overall, unstratified population (16 vs. 12%). Based on MONARCH data, sarilumab treatment of 100 patients (35 of whom were rule-positive) without considering their rule status would result in the ACR70 response in 23 individuals. However, a nearly identical ACR70 response rate (22 individuals) would be achieved if only the 35 rule-positive patients received sarilumab and the remaining 65 rule-negative patients received adalimumab. Adalimumab treatment of 100 patients would result in ACR70 response in 12 individuals (Supplementary Material, Figure S3).

In TARGET, the only trial that enrolled patients with an inadequate response to TNFi treatment, findings were mixed (Fig. 2D). In both the placebo and sarilumab groups, the response rates in rule-positive patients for ACR scores, as well as the DAS28-CRP LDA and remission, were lower than those observed in rule-negative patients (Supplementary Material, Figure S2B).

In ASCERTAIN, sarilumab-treated rule-positive patients had more favorable 24-week response rates compared with rule-negative patients for all endpoints assessed, with a magnitude of difference ranging between 15 and 30% (Fig. 2E). In tocilizumab-treated patients, the rule applied to most clinical endpoints (Supplementary Material, Figure S4).

Overall, across MOBILITY, TARGET, and ASCERTAIN, rule-positive patients had higher odds of achieving a clinical response at week 24 than rule-negative patients, with some exceptions observed in TARGET (Fig. 3). The changes from baseline in continuous variables for each trial are shown in Table 3. Across all studies (including TARGET), we calculated a mean improvement in CDAI of 3.4 (± 2.8 [standard error]) and in DAS28-CRP of 0.8 ± 0.4 for rule-positive patients treated with sarilumab compared with rule-negative patients. Conversely, the treatment of rule-negative patients with adalimumab in MONARCH, resulted in a mean CDAI improvement of 3.1 ± 3.0 compared with rule-positive patients (Table 3). Mean change in DAS28-CRP was similar between adalimumab-treated rule-positive and rule-negative patients in MONARCH (Table 3), although response rates for DAS28-CRP remission and LDA showed improvement for rule-negative compared with rule-positive adalimumab-treated patients (Fig. 2B).

Fig. 3
figure 3

Odds ratios of achieving clinical response at week 24 in placebo- (MOBILITY, TARGET) or active-controlled studies (ASCERTAIN): rule-positive versus rule-negative patients. The patient stratification rule was the combined presence of anti-CCP and CRP > 12.3 mg/l. Data presented for MOBILITY and TARGET are placebo-adjusted. ACR20 ACR 20%, ACR50 ACR 50%, ACR70 ACR 70%, DAS28-CRP 28-joint Disease Activity Score using C-reactive protein, HAQ-DI Health Assessment Questionnaire-Disability Index, LDA low disease activity, MCID minimal clinically important difference

Table 3 Mean ± SE change from baseline to week 24 for applicable endpoints in rule-positive and rule-negative patients in MOBILITY, MONARCH, TARGET, and ASCERTAIN

Discussion

In this study, we used machine learning to identify a combination of baseline patient characteristics to predict treatment response to sarilumab and adalimumab. The method found that the presence of anti-CCP antibody and CRP level at a selected cutpoint of > 12.3 mg/l were predictive of a better response to sarilumab, and in one trial where adalimumab data were also available, predicted an incrementally larger response to adalimumab. This approach could facilitate choice of treatment in patients with RA.

Our algorithm identified a simple, clinically applicable rule that considered the large number of combinatorial possibilities between 42 variables and their values or thresholds. Therefore, our study demonstrates the potential of machine learning as a tool for systemic, fast, and deep analysis of the data that can yield rules applicable in clinical practice.

Previously, anti-CCP has been identified as a predictor of response to rituximab and abatacept, and high CRP has been identified as a predictor of response to TNF inhibitors and tocilizumab [26,27,28,29]. Our study confirms the predictive potential of the combined presence of these two parameters with data from four independent studies. Of note, biologically plausible parameters that have been identified as predictors of response to sarilumab, such as IL-6 concentration [30], were not included in the rule. This is not unexpected: in patients with RA, IL-6 and CRP levels are highly correlated [31, 32] and machine learning algorithms, which approach data in a non-biased fashion, are set to prefer one of the correlated parameters based on its ability to maximize the predefined outcome (in our case, ACR20). CRP was probably selected because IL-6 varies more between individuals [33] and has a more variable diurnal profile than CRP [34]. The more stable levels of CRP would make it a preferred choice as predictor of response, especially if there was a single biomarker measurement by visit.

It can be argued that the use of a composite endpoint such as ACR response, which includes the acute phase reactants CRP or ESR, may have influenced the algorithm to select CRP > 12.3 mg/l as one of the components of the rule for an IL-6R inhibitor. However, the rule also predicted CRP-independent endpoints, such as the CDAI and HAQ-DI. ACR scores are based on relative changes and, therefore, unaffected by a potential selection bias. For the other scores associated with low disease activity and remission (e.g., DAS28-CRP remission and LDA), where fixed, relatively low CRP thresholds are required, a selection of high CRP baseline values rather increases the necessary treatment response to achieve these thresholds.

Among the decision tree methods we considered, the GUIDE algorithm was the only one that provided a simple, clinically feasible rule. It also showed the highest precision and competitive accuracy, compared with other methods assessed, as well as a higher transparency and better interpretability, albeit with lower recall.

Since the algorithm was selecting responders regardless of treatment during the model training, patients treated with placebo (MOBILITY and TARGET) and adalimumab (MONARCH) were important controls during the testing phase. We found that rule-positive patients had higher levels of baseline factors associated with poor prognosis and a reduced response to adalimumab treatment or placebo, compared with rule-negative patients. In MOBILITY, the predictive power of the rule was greater for the placebo-adjusted than for the non-adjusted response. The absolute disease state such as CDAI or DAS28-CRP remission/LDA had a relatively low prevalence in these data, and placebo adjusting increases that prevalence, thus improving the performance of the rule. In addition, in settings with active instead of placebo control (e.g., MONARCH trial), our data suggest that the choice of treatment can be improved significantly using this methodology. We explored a clinical scenario and based on the results of the MONARCH trial, sarilumab was clearly favored in rule-positive patients, whereas rule-negative patients could be treated with either adalimumab or sarilumab, based on other priorities (e.g., patient preferences, erosion score, cost).

As noted in the Results section, the rule applied less consistently to patients from TARGET, who had poor tolerance for, or an inadequate response to, TNF inhibitors. Patients with RA who have failed treatment with one drug class are generally less likely to respond to subsequent treatments [26], which may account for some of the inconsistent rule applicability observed in our analysis. However, the overall percentages of TARGET patients achieving remission or low disease activity endpoints were particularly low, making it difficult to demonstrate differences between rule-positive and rule-negative patients in these disease scores.

A less consistent verification to TARGET data suggests that the rule has limits in generalizability such that the rule may not apply to patients who had inadequate response to TNFi. Also, since all data in the training and validation phases came from randomized, controlled clinical trials, which used stringent enrollment criteria, the rule may not apply to a real-world population of patients with RA in the same way. For example, all patients in these trials had to have elevated CRP, and selection of this variable by the machine learning algorithm as an important variable, as well as the exact CRP cutoff value chosen, may have been influenced by the cutoffs required by trials’ inclusion criteria. In addition, radiographic endpoints were only available in the MOBILITY trial, and with the lack of further validation data we excluded this important assessment from the model training. Inclusion of radiographic scores in the rule may be an interesting variable to further increase the accuracy in patient stratification, albeit at the expense of simplicity. Finally, the number of patients in the training set was relatively small. Using a larger set may have resulted in an even more robust rule.

Conclusions

This study used a machine learning approach to identify a simple rule to identify patients with RA who have an increased chance of achieving clinical response to sarilumab, based on laboratory parameters that are readily available in routine practice. In addition, such patients had a lower likelihood of response to placebo and adalimumab, which suggests that the rule can be used for treatment optimization. Real-world validation of this rule, and replication in other clinical trial datasets of other therapies, is merited.