Background

Many rheumatoid arthritis (RA) patients who are treated with biological disease-modifying anti-rheumatic drugs (bDMARDs) achieve long periods of low disease activity or remission [1]. However, bDMARDs may also lead to adverse events, call for self-injections or hospital visits, and are expensive [2,3,4]. Thus, tapering bDMARDs to the lowest effective dose is of great clinical interest and may support the sustainability of the healthcare system as a whole.

The guidelines of the European League against Rheumatism (EULAR) on the management of RA advise to consider tapering in patients that are in persistent remission [5]. In addition, numerous clinical trials and reviews provide supportive evidence to also consider tapering in patients with stable low disease activity (LDA) [6, 7]. This is in line with routine clinical practice, as maintaining a satisfactory low level of disease activity with a reduced medication dose is also of value.

The most successful and cost-effective strategy for tapering appears to be “disease activity-guided dose optimization” (DGDO) [8,9,10]. This means the dose is gradually tapered (usually by increasing the administration interval), until either disease activity flares or the bDMARD is discontinued. Two randomized trials have demonstrated that, using this strategy, 63–80% of patients can taper or even stop their bDMARD [8, 9]. No important difference was observed in the proportion of patients with LDA or remission after 18 months between DGDO and usual care.

However, since DGDO is a “trial and error” approach, flares occur frequently during the tapering process. In the case of a flare, the previously effective dose needs to be reinstated or additional therapy is necessary. Although these short-lived flares do not seem to relevantly affect radiographic progression or long-term disease activity, there is conflicting evidence regarding functional outcome and impact on quality of life [9, 11]. Therefore, it would be beneficial to predict whether, and to which extent, a bDMARD can be tapered in a particular patient without a flare occurring.

Several predictors for successful dose reduction or discontinuation of bDMARDs have been explored [12, 13]. However, these studies only included “baseline predictors” from before the start of the tapering process, and the strength of the evidence for these predictors is limited. Furthermore, “successful tapering” is often defined as reaching a lower bDMARD dose at some time point after the start of tapering, regardless of whether a flare occurred during the tapering process.

Therefore, this study aims to predict the likelihood of a flare occurring during bDMARD tapering at each consecutive dose reduction step. Such a dynamic prediction may be used to optimize the DGDO strategy for bDMARDs for an individual patient, as the decision for a further tapering step can be based on the predicted risk of a flare. This could minimize the number of flares during tapering, while retaining most of the bDMARD dose reduction. To facilitate future implementation of this approach in routine practice, we decided to exclusively use information easily obtainable in regular care.

Methods

Data extraction and preparation

EHR data for model development

For the development of the prediction model, electronic health record (EHR) data of two rheumatology clinics in the Netherlands were extracted for the period 2012–2019 and 2013–2019 respectively: the University Medical Center Utrecht (UMCU; an academic hospital) and Reumazorg Zuid West Nederland (RZWN; a non-academic treatment center for rheumatic diseases). In both centers, bDMARD tapering is regularly performed, but not yet standard practice. Data were extracted for all RA patients (based on ICD-10 codes) starting a bDMARD and reaching a Disease Activity Score assessing 28 joints (DAS28) < 3.2, i.e., LDA, after at least 24 weeks of treatment. The following bDMARDs were included: infliximab, adalimumab, etanercept, golimumab, certolizumab, tocilizumab, sarilumab, and abatacept. We selected patients with at least the following information available: bDMARD type and dose, seropositivity, disease duration, and ≥ 2 DAS28 measurements per year available. In addition, we aimed to extract the following data: age, gender, body mass index, concurrent and previous DMARD and glucocorticoid use, smoking status, and erosive disease.

To handle missing individual DAS28 components, we used all validated DAS28 formulae by calculating the mean of the 3- and 4-variable DAS28 formulae using ESR (erythrocyte sedimentation rate) as well as CRP (C-reactive protein) [14]. We allowed a 4-week time window between components. Flares were defined using a validated criterion: an increase in DAS28 > 1.2 compared to the previous visit, or an increase of 0.6 with a resulting DAS28 > 3.2 [15]. In addition, an “increase in bDMARD dose” was also considered a flare, to also capture flares if insufficient information was present to calculate the DAS28.

All data was extracted according to current ethical and privacy regulations in the specific hospitals. The Medical Research Ethics Committee Utrecht waived the need for informed consent, as the development data was already collected in routine care and was pseudonymized before analysis.

DRESS data for external validation

For external validation, we extracted data from the DRESS trial [9]. In this trial, RA patients with stable LDA or remission using adalimumab or etanercept were randomized to either DGDO (n = 121) or routine care (n = 59) and followed for 18 months. The study was performed between 2011 and 2014 in two Dutch clinics (Sint Maartenskliniek Nijmegen and Woerden). The DGDO group tapered the bDMARD in three steps by increasing the administration interval every 3 months, followed by discontinuation after 6 months as long as the patient did not flare. In case of flare, the last effective dose was reinstated, and no further dose reduction attempts were undertaken. If nevertheless flares persisted, the bDMARD dose was increased to the full dose and thereafter treatment was at the rheumatologists’ discretion. In DRESS, flares were defined by the DAS28-CRP increase from baseline values.

The DRESS study (Dose REduction Strategy of Subcutaneous TNF inhibitors) was approved by the local ethics committee (Committee on Research Involving Human Subjects region Arnhem-Nijmegen), and informed consent was signed by all included patients [9].

Model development

We developed a dynamic model to repeatedly predict the risk of a flare occurring in the next 3 months. This corresponds to a routine outpatient visit interval [13]. The model was developed using joint latent class mixed modeling, which combines a linear mixed effects- and a time-to-event model (R-package lcmm). Details of joint latent class models have been described elsewhere [16, 17].

First, in the linear mixed effects part of the model, the course (“trajectories”) of the DAS28 values over time are modeled for each patient. This is done by categorizing these trajectories into a number of subgroups: latent classes. The general form of these trajectories is defined using polynomials for the time variable. We explored models with a random intercept using 1–3 latent classes and 1st to 3rd order polynomials for the time variable, using a random slope for time. The best fitting model was selecting based on the lowest Bayesian Information Criterion (BIC) [18]. Based on the final model, each individual patient has its own predicted DAS28 trajectory.

Next, these DAS28-trajectories are used as variables in the time-to-event part of the model. The time-to-event part of the model also incorporates other variables. We explored all variables as mentioned above in “EHR data for model development” and selected those that had sufficient data to be extracted from the EHR. The time-to-event model was developed stepwise starting with a full model, excluding variables one by one to arrive at a final model. The decision to exclude a variable was based on clinical rationale, data availability, and improvement in model fit in cross-validation, defined by the decrease in the BIC. In short, to make individual predictions, an estimation is made about the individuals trajectory of the DAS28 over time. This trajectory is then combined with additional variables to calculate the probability of a flare occurring in the next 3 months.

We adhered to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline [19].

Model validation

We assessed the accuracy of 3-monthly flare predictions in the development data with 5-fold cross-validation, using all visits at which a DAS28 was available. The area under the curve of the receiver operating characteristic (AUC-ROC) was calculated over all time-points. Other performance indicators were assessed based on an optimal cutoff as defined by Youden’s Index in the development data [15]. This index is a summary measure for sensitivity and specificity.

External validation was performed by assessing the accuracy of flare predictions in data from the DRESS trial [9]. The AUC-ROC and other performance indicators were calculated using the optimal cutoff points as determined in the development data and in DRESS data, both defined by Youden’s Index [18].

Simulation of prediction-aided treatment

To evaluate the clinical utility of the flare predictions, we assessed the model’s potential impact on the number of flares and on the bDMARD dose used over 18 months. We simulated a new tapering strategy where the model’s predictions were used as a decision aid in the DGDO arm of the DRESS trial. At every 3-monthly visit, the predicted risk of a flare was taken into account when deciding to continue or to stop tapering. The predicted risk of flare was categorized into a high predicted risk (above or equal to the optimal cutoff point), or a low predicted risk (below the cutoff point). The simulation was based on the following assumptions:

  1. 1.

    If a flare occurred in the DRESS trial before the model predicted a high risk of flare, this flare also occurs in the simulation. The bDMARD dose is the same as in the trial. Thus, there is no impact of the predictions is observed in this case.

  2. 2.

    If the model predicted a high risk of flare in simulation and no flare had occurred in the DRESS trial thus far, the bDMARD is not tapered further (kept at a constant dose). No flares occur in simulation during the remaining follow-up, except for the scenario described in 4.

  3. 3.

    If a patient had completely discontinued the bDMARD in the DRESS trial when the model predicted a high predicted risk of flare, the bDMARD dose in simulation is increased to and kept at 50% of the full registered dose. This corresponds to the last tapering step. No flares occur during the remaining follow-up, except for the scenario described in 4.

  4. 4.

    If in the DRESS trial a flare occurred after the model predicted a high risk of flare and the bDMARD dose in DRESS was equal to or higher than the bDMARD dose in simulation, that flare also occurs in the simulation. The bDMARD dose is equal to the DRESS trial during the remaining follow-up.

The number of flares occurring, the proportion of patients experiencing at least one flare, and the proportion of the full registered dose were calculated. These were then compared between the simulation and the DGDO arm of the DRESS trial over 18 months. Confidence intervals (CI) were calculated using 1000-fold bootstrapping. As there is no obvious optimum in the trade-off between the reduction in the number of flares and the increase in bDMARD dose, we evaluated the clinical impact of prediction-aided treatment for several cutoffs around the optimal cutoffs as defined by Youden’s Index [18].

Results

Patient characteristics

Of the total number of 5226 RA patients in the EHR data, there were 757 bDMARD courses in which LDA was recorded after at least 24 weeks of usage (Fig. 1). In 279 bDMARD courses of 255 patients, sufficient data was available for model development (see the “Methods” section). Data for smoking, erosive disease, concurrent and previous DMARDs, and glucocorticoids were of insufficient quality (> 50% missing data) and/or could not be (easily) extracted from the EHR. The median follow-up time of the included bDMARD courses was 21 months, and the mean bDMARD dose was 76.7% of the full dose. Table 1 displays general patient characteristics of the development data and the data from the DRESS trial used for external validation. Significant differences between the populations were observed for age, DAS28 at baseline, the number of DAS28-measurements, flare rate, and bDMARD dose, among others.

Fig. 1
figure 1

Selection of bDMARD courses from EHR data for model development. a Low disease activity was defined as a DAS28 (ESR or CRP) ≤ 3.2. b Based on the availability of at least two DAS28 measurements per year, bDMARD type and dose, disease duration, and seropositivity. bDMARD, biological disease-modifying antirheumatic drug; EHR, electronic health record; RA, rheumatoid arthritis

Table 1 Patient characteristics in data for model development and external validation

Model development

The variables that were retained in the final prediction model and the corresponding hazard ratios are displayed in Table 2. The final model identified two latent DAS28-trajectories, defined by a linear and a quadratic time coefficient. Figure 2 shows the mean of these two trajectories (left), together with their respective time to flare (right). The course of disease activity in the class 2 DAS28-trajectory shows an increase in disease activity over time and a shorter time to flare, compared to the class 1 DAS28-trajectory. Variables that significantly increased the likelihood of a flare were seropositivity, bDMARD dose < 50% and an increase in tender joint count at baseline (compared to previous visit).

Table 2 Variables of the final flare prediction model
Fig. 2
figure 2

Mean DAS28-trajectories of identified latent classes and their relation to the occurrence of a flare. A The mean course of the disease activity score (DAS28) over time in patients assigned to one of the two “latent trajectory classes.” In class 1 (n = 182), a stable low disease activity is observed, whereas patients in class 2 (n = 97) display an increasing disease activity over time. B The probability of remaining free from flares over time for patients assigned to one the “latent trajectory classes” for disease activity, as displayed on the left. Patients in class 2 display a shorter time to flare as compared to patients in class 1

As the DAS28-trajectories represent the course of disease activity over time, this is a time-dependent variable. By default, only one continuous time-dependent variable can be included in a joint latent class model. In order to also add bDMARD dose as a second time-dependent variable, it was dichotomized in < or ≥ 50% of the full registered dose.

Model validation

Predictive performance in cross-validation and external validation is summarized in Table 3. In cross-validation, the model achieved an AUC-ROC of 0.76 (CI 0.69–0.83). The optimal cutoff in the development data was at a predicted probability of flare of 14.3% within the next 3 months. For external validation, sixteen patients were excluded from DRESS because of missing predictor information. Furthermore, as the variable “increase in SJC/TJC” was not available in the DRESS data, these were set to 0 (i.e. no increase) for the validation, with the rationale that patients that met the DRESS inclusion criteria had a stable low level of disease activity at baseline. Supplementary Figure S1 shows the AUC-ROC of the model in external validation (0.68 (CI 0.62–0.73), see Supplementary Figure S2 for the calibration plot). The optimal cutoff point in DRESS data was found to be at a predicted chance of flare of 31.5% within the next 3 months.

Table 3 Predictive performance in cross validation and external validation

Because the model cannot truly function as a “joint” model at baseline, since no longitudinal information is yet available, we also explored the performance when removing baseline predictions. This indeed improved the AUC in external validation to 0.71 (CI 0.64-0.77, Supplementary Figure S3 and Supplementary Table S1).

Simulation of prediction-aided treatment

We assessed the potential clinical impact of the model on the number of flares and the amount of bDMARD dose reduction, when used as a decision aid within a DGDO strategy. The clinical impact of prediction-aided treatment in simulation was evaluated for cutoffs from 15–45% in steps of 10% (Supplementary Table S2), and results were discussed to determine the optimal cutoff for clinical practice. A risk cutoff of 35% was deemed optimal, as this significantly reduced the number of flares per patient over 18 months from 1.21 (0.99–1.43) to 0.75 (0.54–0.96), while retaining most of the bDMARD dose reduction (64% vs 54% of full registered dose used). See Table 4. When using this optimal cutoff of 35%, only 1.0 flare occurred for each full dose that was tapered in the simulation of prediction-aided treatment, versus 2.0 flares in the DRESS DGDO arm. Furthermore, in the DRESS routine care arm, each prevented flare (compared with DRESS DGDO) came at a cost of 51% of a full bDMARD dose over 18 months, while this was only 22% in the simulated prediction-aided group. As the AUC-ROC improved when the predictions at baseline were not taken into account, we explored the simulation of prediction-aided treatment when removing the baseline predictions. However, the simulation results were hardly influenced by this (Supplementary Table S3).

Table 4 Flares and bDMARD dose in simulation of prediction-aided treatment

Discussion

The goal of this study was to develop and validate a flare prediction model to reduce the number of flares during bDMARD tapering, exclusively using data that can easily be obtained in routine care. Our simulation results show that the addition of our flare prediction model to a DGDO tapering strategy is both superior to routine care and to DGDO alone, when considering the ratio between the number of flares and amount of bDMARD dose reduction. To our knowledge, this is the first study not only developing a dynamic flare prediction model, but also performing an external validation and subsequent simulation of clinical impact in the context of bDMARD tapering.

As tapering bDMARDs is of great clinical interest, other studies have also investigated predictors in the context of tapering. Several studies and systematic reviews have investigated the predictive value of biomarkers, serum drug levels, or PET-scans during bDMARD tapering [12, 20,21,22]. However, none of these studies showed a clear predictive value of these markers. In addition, the study by Verhoef et al. showed that for a biomarker to be cost-effective during bDMARD tapering, it must be inexpensive and have high sensitivity and specificity [23]. If future studies do show a predictive value of (bio)markers during tapering, these can be included in the prediction model. The added predictive value of such markers and their cost-effectiveness should then be assessed. An important advantage of the current model is that it only includes variables that are routinely collected in RA clinical practice, thereby enhancing feasibility and cost-effectiveness.

A recent review [13] focused on predictors for successful discontinuation, rather than tapering, of bDMARDs. Similar to the current study, they found seropositivity, LDA, disease duration, and CRP/ESR to be possible predictors of value. In addition, they mention physical functioning and ultrasound measures as possible predictors. However, the studies included in this review were often small and too heterogeneous to compare in meta-analysis. Furthermore, only fixed baseline variables were included, rather than performing dynamic predictions using information over time.

Two studies have incorporated such dynamic variables to predict RA disease activity over time [24, 25]. The study by Norgeot et al. [24] found the Clinical Disease Activity Index (CDAI), CRP/ESR, glucocorticoid use, and other DMARD use to be important predictors. However, this study is not performed in the specific context of tapering bDMARDs. The model developed by Vodenčarević et al. [25] does focus specifically on bDMARD tapering. However, this model is developed and validated on the clinical trial data of 41 patients only and may therefore be difficult to extrapolate to routine care. Both of these dynamic prediction models were developed using machine learning techniques. We have previously also explored the potential of a machine learning model similar to Vodenčarević et al. [26]. However, we chose to pursue the joint latent class model as the performance was similar, and the joint latent class model is more transparent regarding the DAS28-trajectories used and the effects of covariates in the model (i.e., providing hazard ratios).

A major unique strength of this study is that the model’s performance is assessed in external validation. There were several significant differences between the patient populations from routine care used for developing the model and the DRESS pragmatic trial data for external validation regarding baseline characteristics, disease activity, and bDMARD treatment. However, despite these differences the model retained an adequate performance in the external validation, indicating that these differences do not invalidate the model. Another strength is that the clinical impact is evaluated in simulation. In this simulation, successful tapering was not only defined by reaching a lower bDMARD dose, but also by the number of flares during tapering. Furthermore, our model was developed using easily obtainable parameters from routine care EHR data, rather than, e.g., clinical trial data or specific biomarkers [27].

The AUC in cross-validation and external validation (0.76 and 0.68, respectively) may be interpreted as only a moderate performance. However, the AUC may not be the most suitable measure to assess the model’s clinical utility. The added value in clinical practice is determined by the effects of prediction-aided treatment on the rate of flares and the amount of bDMARD dose reduction, when compared to the available alternatives. The currently existing alternatives are either continuing the bDMARD at full dose or tapering until a flare occurs in a trial-and-error approach. Our simulation results show that prediction-aided treatment is superior to both these alternatives regarding the ratio between the number of flares and the amount of bDMARD dose reduction. Therefore, prediction-aided treatment may present the best available bDMARD tapering strategy. This is currently being investigated in the PATIO randomized controlled clinical trial (Dutch Trial Register number NL9798).

Interestingly, the AUC of the prediction model improved in external validation from 0.68 to 0.71 when baseline predictions were removed. This is likely because the model can only function as a “joint” model when longitudinal information is available. This effect on AUC was also observed in the development data, but due to the relative overrepresentation of baseline visits in the DRESS data compared to the development data, this was less pronounced. As the removal of baseline predictions had almost no effect on the simulation of clinical impact, we chose to retain these predictions. Including disease activity measures prior to the start of tapering could potentially improve the performance of our model, as this would ensure that longitudinal information is available at baseline.

A challenge in this study was the limited data quality regarding the frequency of DAS28 measurements in the development data. This might also have contributed to the different flare rates and resulting discrepancy between the optimal cutoff points in the development data and external validation data from the DRESS trial. When implementing a prediction-aided bDMARD tapering strategy in clinical practice or clinical studies, a treat-to-target (T2T) strategy with regular (e.g., 3 monthly) DAS28 measurements should be used, in line with EULAR recommendations [5]. As the DAS28 measurement frequency in the DRESS trial best reflects these recommendations, the optimal cutoff point found in simulation (i.e. 35%) is likely the most suitable for implementation of the model in clinical practice.

Besides the DAS28 measurements, several other parameters were also difficult to extract as structured data from the EHR, such as smoking, concurrent csDMARDs, and erosiveness of disease. We explored imputation to increase the amount of these data points, but this did not improve the model’s performance in cross-validation. Improved registration of these parameters and the optimization of free text mining techniques could allow for future inclusion of these parameters in model development and possibly a better performance. Importantly, the results from external validation are not biased by missing data, since the DRESS data had a standard measurement frequency and very few data missing on disease activity. Therefore, we think our simulation should be an accurate representation of the potential clinical impact of using the models predictions as an decision aid added to a DGDO strategy.

Since prediction-aided treatment could reduce the number of flares during bDMARD tapering, patients and physicians may be more willing to start tapering with such a prediction model than without [28]. Furthermore, our prediction model can be used as an add-on to DGDO, retains most of the bDMARD reduction as attained by DGDO, and is a low cost intervention. Therefore, the model might prove to be an even more cost-effective strategy than DGDO alone [10]. The clinical implementation may be relatively straightforward, as it uses only predictors usually available in the EHR.

Conclusions

In conclusion, we developed and validated a dynamic prediction model to predict the risk of a flare occurring within 3 months during a bDMARD tapering strategy. In simulation, we showed that a prediction-aided treatment strategy has the potential to significantly reduce the number of flares, while maintaining most of the bDMARD dose reduction. As this simulation is inevitably based on certain assumptions, we are currently investigating the clinical impact of prediction aided treatment in the PATIO randomized controlled trial. The current study and the PATIO-trial provide the next step towards the successful implementation of personalized medicine using clinical decision support systems.