Introduction

Rheumatoid arthritis (RA) is a complex disease with a broad spectrum of manifestations that requires an early intensive therapy in order to avoid joint destruction and physical disability. In order to measure the effect of therapy in daily practice and in clinical trials, many variables are recorded and different composite indices have been proposed to measure the remaining disease activity or the response to treatment. Those variables may cover items such as patient self-reported questionnaires, physician's scores including different joint scores, and serum markers of systemic inflammation.

Infliximab, in combination with methotrexate (MTX), is a highly effective therapy for a majority of RA patients. After an induction scheme at weeks 0, 2 and 6, the indicated dose of this therapy is 3 mg/kg every 8 weeks, although the ATTRACT trial suggested that a higher dose of 10 mg/kg every 8 weeks or a shorter perfusion interval may add benefit [13].

The present study is based on an expanded-access program in which patients suffering from active refractory RA were treated with intravenous infusions of infliximab (3 mg/kg + MTX) at weeks 0, 2, 6 and every 8 weeks thereafter. At week 22, patients not optimally responding to treatment could receive a dose increase of 100 mg (1 vial) per infusion from week 30 onwards [4]. The effect of dose escalation for the patients of this cohort has been discussed previously [4]. The decision to increase the dose was based on the treating rheumatologist's clinical judgment and can be considered as a measure of insufficient response to infliximab. It might be questioned which variables can be measured to best evaluate the effect of therapy and remaining disease activity in daily practice (and in clinical trials). The aim of the present analyses was to evaluate whether the decision to increase the dose could be reflected by using single variables or composite indices, alone or together in a model. We also wanted to evaluate whether this decision was mainly based on differences over time or on momentary disease activity.

Methods

Study population

A total of 511 patients, suffering from active refractory RA [5], were treated with intravenous infusions of infliximab (3 mg/kg) at weeks 0, 2, 6 and every 8 weeks thereafter in combination with MTX (a minimal dose of 15 mg/kg was recommended). Between week 0 and week 22, 37 patients dropped out for the following reasons: 16 patients stopped due to side effects (four infusion reactions, five infections, one malignancy, one pancytopenia, five disease-related complications), 12 patients stopped for withdrawal of consent and 9 patients stopped for protocol violation. Of the remaining 474 patients, 102 (22%) patients, who were not optimally responding to treatment according to the treating rheumatologist's opinion, received a dose increase of 100 mg (1 vial) per infusion from week 30 on. Throughout the first 22 weeks, dosage of MTX, steroids and non-steroidal anti-inflammatory drugs remained unchanged.

Evaluated variables

When designing the model, we took the following single variables into account at weeks 0, 6, 14 and 22: 28 and 66/68 swollen/tender joint counts, erythrocyte sedimentation rate (ESR; mm/h), C-reactive protein (CRP; mg/l), Health Assessment Questionnaire (HAQ; 0–3), physician's global assessment of disease activity (visual analogue scale (VAS); 0–100 mm), patient's global assessment of disease activity (VAS 0–100 mm), patient's assessment of pain (VAS 0–100 mm), patient's assessment of fatigue (VAS 0–100 mm) and all subscales of the SF-36 questionnaire (0–100 points) [6]. DAS28 (Disease Activity Score including a 28-joint count) [7] and other composite scores such as simplified disease activity index (SDAI), clinical disease activity index (CDAI) [8, 9] and the alternative DAS28 scores [10, 11] (Table 1) were calculated after data collection so that the treating rheumatologist was unaware of the exact values of those composite scores. Also, differences over time and the DAS28 response (no, moderate or good) and the ACR (American College of Rheumatology) response (no/20/50) were computed [12, 13].

Table 1 Formulae to calculate the different DAS and SDAI score

Statistics

We opted to use only statistical methods that are available in a classical statistical package (SPSS 12.0; SPSS, Inc, Chicago, IL, USA) or could be computed manually. When needed, the continuous variables were normalized (by taking the square root of the joint counts and the natural logarithm of CRP and ESR). Robustness of the discriminant analyses and logistic regressions was confirmed by the use of a random train and test set. Missing values were handled by pairwise complete case analysis. This means that a case with no missing values for a group of variables is included in the analysis of that group of variables. The case may have missing values for variables used in other analyses. Confidence intervals (95% CI) for sensitivity or specificity were calculated based on the method proposed by Harper [14]. The areas under the curves (AUCs) of receiver operating characteristic (ROC) curves were calculated. A higher AUC indicates that a single variable has better discriminative characteristics. A statistical test to compare AUCs of two variables tested on the same population has been described by Hanley [15]. Continuous and categorical variables were compared by adapting the cut-off of the continuous variables to the same specificity level as the categorical variable so that sensitivities could be evaluated and compared [16]. The selection and comparison of variables by curve analysis was performed since this method gives a valid ranking of variables and does not (in contrast to ranking methods based on p values) depend on the number of subjects available for that specific variable [17]. In order to find the true maximal model and to avoid sticking at a local maximal model, we used different strategies for the construction of the final model: binary logistic regressions and discriminant analyses were performed with the default options of SPSS 12.0 and stepwise construction of models was performed by conditional forward and backward elimination for logistic regression and by Wilk's lambda for discriminant analysis using the strategy described by Hosmer and Lemeshow [18].

Ethics

All patients signed informed consent. This study was approved by the local ethics committees.

Results

Ranking of continuous variables

In order to select the most important variables that correlate with the decision to give a dose increase at week 22, we calculated the AUC of ROC curve analysis for all continuous variables and ranked them based on this AUC [17]. Since crossing over of ROC curves may affect the diagnostic properties of a variable without changing the AUC, we also ranked the variables based on sensitivity levels by adapting the cut-off to a given preset specificity level of 95% [16].

Both ranking methods displayed that the DAS28 score at week 22 had the highest ability to discriminate the physician's decision to give a dose increase. Table 2 displays the 10 most important variables ranked by AUC of ROC curve analysis and by the sensitivity at the 95% specificity level. Using the method described by Hanley [15], we found that there was a significant difference in AUC between the two first ranked parameters: DAS28 at week 22 and the 28 tender joint count at week 22 (AUC = 0.840 versus 0.797, p = 0.02). Additionally, most variables were ranked in such a way that each variable was represented first by its measure at week 22 before it was represented by a measure at another week.

Table 2 Variables with the highest ranking based on ROC curve AUC and sensitivities at 95% specificity

Evaluation of the response scores

To evaluate categorical scores, we adapted the cut-off of the variable with the highest ranking (DAS28 at week 22) to the specificity of the categorical score and compared the sensitivities [16]. For the decision to give a dose increase, ACR response not reaching the ACR20 criterion ('no ACR response') had a sensitivity of 69.6% (95% CI: 65.2–74.0) and a specificity of 64.2% (95% CI: 59.6–68.8). When we adapted the cut-off of the DAS28 at week 22 to a specificity of 64.2% (DAS28 = 4.01), we obtained a sensitivity of 80.0% (95% CI: 75.2–84.7). 'No DAS28 response' had a sensitivity of 46.7% (95% CI: 40.8–52.6) and a specificity of 83.3% (95% CI: 78.9–87.7). When we adapted the cut-off of the DAS28 to a specificity of 83.3% (DAS28 = 4.77), we obtained a sensitivity of 67.5% (95% CI: 61.9–73.1). Similar results were obtained when looking at the ACR50 and the good DAS28 response criterion (Table 3).

Table 3 Sensitivity and specificity of the response scores compared with DAS28 set at equal specificity

Additionally, we fitted a logistic regression model with the decision to give a dose increase as a dependent variable and DAS28 at week 22, DAS28 response and ACR response as categorical covariates. These analyses retained DAS28 at week 22 as the only significant covariate in the model (data not shown).

Effects of change of scores over time on the physician's decision

To evaluate the effect of differences over time, we plotted the means of the most important normalized continuous variables over time (Fig. 1). The plot of the variable with the highest ranking (DAS28) shows that patients who get a dose increase have a (significantly) higher disease activity at baseline and, after an initial decrease of disease activity, regain disease activity from week 6 on. To evaluate this, we calculated differences in DAS28 scores between baseline and week 22 (delta DAS28 0–22), and between week 6 and week 22 (delta DAS28 6–22). Indeed, patients who get a dose increase regain some disease activity between week 6 and week 22 (mean delta DAS28 6–22: -0.4 versus +0.4, p < 0.001), which is reflected in a smaller decrease of disease activity between baseline, and week 22 (mean delta DAS28 0-22: -2 versus -1, p < 0.001). However, the AUC of the ROC curve of delta DAS28 0–22 was 0.725 (95% CI: 0.659–0.790) and the AUC for delta DAS28 6–22 was 0.672 (95% CI: 0.590–0.754), which is much lower than the AUC of the momentary DAS28 (0.840) at week 22. Additionally, when we fitted a logistic regression model with the decision to give a dose increase as a dependent variable and DAS28 at week 22, delta DAS28 0–22 and delta DAS28 6–22 as covariates, only DAS28 at week 22 was a significant variable in the model.

Figure 1
figure 1

Plot of the mean scores over time. Act, activity; ESR, erythrocyte sedimentation rate; HAQ, Health Assessment Questionnaire; SJC, swollen joint count; TJC, tender joint count; pt, patient; Phys, physician; SQRT, variable normalized by taking the squared root; ln, variable normalized by taking the natural logarithm; VAS, visual analogue scale.

Similar analyses were performed for the other variables. The AUC of the differences between weeks 0–22, weeks 6–22 and weeks 14–22 of the other variables were all less than 0.700 (data not shown). These analyses indicate that, although the differences in disease activity over time are statistically significant, those differences over time are not important enough to incorporate in a model to discriminate the physician's decision.

Building a model to discriminate the physician's decision to give a dose increase

The first three analyses (ranking of continuous variables, evaluation of the response scores and effects of change of scores over time on the physician's decision) allowed us to narrow the selection of variables for the model by eliminating variables that are already incorporated into the DAS28 (or are highly related to them such as CRP and 68 tender joint and 66 swollen joint count) and taking into account only those variables at week 22. This resulted in the following list: DAS28, HAQ, physician global VAS, patient pain VAS, patient fatigue VAS and the scores of the SF36 questionnaire at week 22. We screened those variables using forward and backward elimination in a logistic regression model and by the stepwise Wilk's lambda method. The probability scores of the logistic regression and discriminant scores we thus obtained were compared using ROC curve analysis. The model with the highest AUC was a model from discriminant analysis with the following variables (and standardized canonical discriminant function coefficients): DAS28 week 22 (0.863), physician global VAS (0.796), patient pain VAS (0.735), and physical functioning (-0.227). The discriminant score of this model had an AUC of 0.870 (95% CI: 0.828–0.912) with a sensitivity at the 95% specificity level of 45.5% (95% CI: 38.7–50.3).

Evaluation of the discriminant score of the variables of DAS28

To validate the score and coefficients of the DAS28, we calculated a discriminant function using the (normalized) variables of the DAS28 score: 28 tender and swollen joint count, ESR and patient global VAS. After rescaling, we obtained the following discriminant coefficients: 0.52 for 28 tender joint count (28TJC), 0.28 for 28 swollen joint count (28SJC), 0.56 for ESR and 0.025 for patient disease activity. This discriminant score had an AUC of 0.844 (0.797–0.891) and a sensitivity at the 95% specificity level of 43.8% (95% CI: 38.1–49.2), which is equal to the DAS28 at week 22. The Pearson's correlation coefficient between this discriminant score and the DAS28 was 0.986 (Fig. 2). We also performed logistic regression with similar results (data not shown).

Figure 2
figure 2

Validation of the DAS28 score and coefficients (see text). ESR, erythrocyte sedimentation rate; VAS, visual analogue scale.

Comparison with the other DAS scores and SDAI/CDAI

Since different alternative methods are available to calculate the DAS scores (Table 1), we additionally evaluated the properties of those alternative scores. We also evaluated the SDAI and CDAI [8, 9], after normalization, by taking the squared root. The Pearson's correlation coefficient of those alternative scores with the DAS28 at week 22 was 0.982 for the DAS28-3, 0.952 for the DAS28-CRP, 0.928 for the DAS28-CRP-3, 0.914 for the SDAI and 0.893 for the CDAI. The AUC and sensitivity at the 95% specificity level are shown in Table 1 and indicate that all those alternative scores perform similarly or slightly worse than the original DAS28.

Detailed ROC curve analysis of the DAS28

We plotted the ROC curve of the DAS28 in Fig. 3 and listed sensitivities and specificities in Table 4. Also, predictive values and the accuracies of classification in function of the different DAS28 cut-offs are shown in Table 4. Beneath a cut-off of 3.2, we found a high predictive value for continuing the current dose as a measure of good response. The maximal accuracy of 84% could be found at a cut-off of 5.5.

Figure 3
figure 3

ROC curve analysis of the DAS28 at week 22 (plotting the 1-specificity versus the sensitivity). Also the accuracy, PPV and NPV are plotted. PPV, positive predictive value (predictive value to give a dose increase as a measure of insufficient response); NPV, negative predictive value (predictive value to continue on the current dose as a measure of good response).

Table 4 Performance at different cut-offs of DAS28 at week 22 for dose increase

Discussion

The aim of the present analyses was to evaluate which single or composite variables, combined in a model, could discriminate the treating rheumatologist's decision to give a dose increase of infliximab to RA patients not optimally responding to an indicated dose of 3 mg infliximab every 8 weeks. Since different variables on different time points were available, we started to rank the continuous variables based on the AUC of ROC curves and sensitivities at the 95% specificity level. This strategy has previously been proposed for microarray data [17]. The calculation of sensitivities at the 95% specificity level is important in order not to overlook some variables with a relative small AUC but with a high specificity [16]. So, both methods ranked the DAS28 at week 22 as the variable which best discriminates the decision to give a dose increase. In a second and third analysis, we looked at whether response scores and differences in disease activity over time could give additional information to discriminate the rheumatologist's decision. Those analyses indicated that variables, including differences over time, seem to be less important than the momentary remaining disease activity at week 22, to discriminate the rheumatologist's decision.

After the prior selection of variables, based on the findings of the previous steps, we built the final model to discriminate the rheumatologist's decision, which was only slightly better than the DAS28. We think that the small gain in discriminative properties in comparison with the DAS28 is not enough to accept the increased complexity of this model. Moreover, in contrast to the DAS28, this model included the physician's global assessment of disease activity (VAS), which is investigator-dependent and has the draw-back that it cannot be calculated by a study nurse. All four analyses together indicated that the DAS28 is an important variable for evaluating insufficient response to infliximab therapy (especially in daily practice) and that this variable can only slightly be improved by adding supplemental variables.

DAS was developed in the early 1990s [19, 20] and later on, it was transformed into the DAS28 [7] in an era when therapy with biologicals was not yet available. In those initial studies, patients were scored by the same two independent nurses and the decision to change disease-modifying antirheumatic drug (DMARD) therapy during a follow-up period of up to 3 years was considered as a measure of insufficient response [20]. The present study is a multi-center study where patients were scored by the treating physician and the decision to give a dose increase of infliximab could happen only at one time point. This difference in study design and therapy may explain why in the present study the AUC of DAS28 is smaller than in other studies (AUC = 0.840 versus 0.933) [21]. Therefore, it is remarkable that despite those differences in study design, we could calculate a discriminant function (in the fifth analysis) that correlated so well with the DAS28 by using the 28SJC, 28TJC, ESR and patient disease activity VAS as independent variables and the physician's decision as a grouping variable. Not only the discriminant scores, but also the coefficients of this discriminant function were quite similar to the coefficients of the DAS28, indicating the robustness of the scores and coefficients of the DAS28 score.

In another, final analysis, we evaluated the alternative DAS scores and the squared root transformed SDAI and CDAI. All those alternative scores have a slightly worse AUC than the original DAS28, but seem good enough to be useful when some other variables are not available. We think the use of the DAS28 is feasible and time-effective using a preprogrammed calculator, spreadsheet or web-based calculator [11]. The unique characteristics of the DAS score make it a useful measure in a lot of applications. DAS28 as a continuous variable is a sensitive tool for measuring response to treatment in randomized controlled trials and facilitates the use of more complex statistical methods that can handle repeated measures over different time points [2224].

Other studies demonstrated that a low DAS is an important prognostic factor of persistent remission and that DAS correlates with radiological progression [25, 26]. DAS may also be a useful parameter in daily clinical practice as a treatment goal and to evaluate the actual disease activity (which cannot be assessed by the categorical response scores) [2731]. Our findings that the physician's decision to give a dose increase can best be modeled by a combination of measurements of remaining/momentary disease activity, represented by the DAS28 does not reduce the value of the response scores such as ACR response or DAS response scores. Indeed, those scores are important for measuring differences over time as a measure of global treatment effects in clinical trials [12, 13] but, as demonstrated by the present study, are not useful for evaluating the momentary disease activity in a single patient, which is important in daily practice. The continuous properties of the DAS28 score provide the additional opportunity for a cut-off, which can be chosen as a function of the purpose. Interestingly, we found a high predictive value for continuing the current dose as a measure of good response below a cut-off of 3.2. It is noteworthy that a DAS score of 3.2 is an important threshold for a good DAS response according to the EULAR criteria [12]. In contrast, for classification purpose, a higher cut-off (5.5) is more appropriate since this level displayed the highest accuracy. One should be aware that the displayed predictive values and accuracies may be highly influenced by the prevalence of insufficient response, reflected by the need for a dose increase, which was 21.5% in the present study. A lower a priori chance of the need for a dose increase may increase the accuracy of DAS (given the fixed cut-off of 5.5) and vice versa. Indeed, at a cut-off with a high specificity, the accuracy will increase when the a priori chance decreases (applying formula c given in the legend to Table 4).

Conclusion

The results of the present analyses indicate that the momentary DAS28 as a continuous composite index correlates best with the decision to give a dose increase of infliximab, which is a measure of insufficient response. The discriminative characteristics of the DAS could be slightly improved by the use of supplemental variables, although this results in the disadvantage of a more complex model and calculations. This study also demonstrates the robustness of the scores and coefficients of the DAS28 in a cohort of RA patients under infliximab therapy and therefore validates the DAS28 as a measure of disease activity in patients under treatment with biologicals.