Background

Systemic sclerosis–associated interstitial lung disease (SSc-ILD) is common, is associated with a poor prognosis, and is the leading cause of death in people with SSc [1]. The pathogenesis of SSc-ILD involves a complex interplay of vascular injury, inflammation, and fibrosis (reviewed in [26]). The most common pathological finding in lung biopsies of patients with SSc-ILD (approximately 78 % patients) is nonspecific interstitial pneumonia [7]. Usual interstitial pneumonia, the pathological finding in idiopathic pulmonary fibrosis (IPF), as well as other patterns are present in approximately 10–15 % of patients with SSc-ILD [7]. However, open lung biopsy is usually not performed for SSc-ILD, and high-resolution computed tomography (HRCT) has become the gold standard for diagnosis and classification of ILD [8, 9]. In addition to diagnosis of ILD, moderate to severe fibrosis or total lung involvement (TLI) by SSc-ILD visualized on the baseline HRCT scan is an independent predictor of response to cyclophosphamide (CYC) therapy [10], poor survival [11], and future decline in percentage of predicted forced vital capacity (FVC% predicted) [12].

Recent trials have included forced vital capacity (FVC), which has traditionally served as the primary endpoint in SSc-ILD clinical trials as it is available. It has low measurement error (if done using standardized methodology) and is sensitive to change in treatment. However, treatment with CYC had only modest effects on FVC in the Scleroderma Lung Study I (SLS I) [13, 14] and the Fibrosing Alveolitis in Scleroderma Trial (FAST) [15]. Therefore, there is increased interest in enriching this cohort of patients with rapid disease progression for effective identification of patients at high risk of ILD progression, as well as for early intervention [16].

Different HRCT staging systems have been developed to quantify the extent of lung involvement–semiquantification by visual assessment or quantification using computer-assisted methodology. Kazerooni et al. developed a semiquantitative measure to assess ground-glass opacity (GGO), reticulations with architectural distortion and traction bronchiectasis (“fibrosis”), and honeycomb cysts (HCs) [17], and the overall score correlated well with assessment of ILD in pathological specimens [17]. A modified Kazerooni visual scoring system was used for SLS I, a placebo-controlled trial of oral CYC in patients with symptomatic SSc-ILD in which the extent of reticulations (“fibrosis”), GGO, and HCs was scored semiquantitatively. In addition, a novel algorithm was developed to quantify the presence and extent of both fibrosis and total ILD (sum of scores for fibrosis, GGO, and HCs) using the computer-aided diagnosis (CAD) technology in three area-equivalent zones (upper, middle, and lower) as well as in the whole lung (WL) in SLS I [14, 18, 19]. CAD is based upon measurement of the density or texture features of each pixel and assignment of a score for the amount of abnormal lung tissue present. Quantitative assessment of total extent of interstitial lung disease (QILD) and of lung fibrosis (QLF) correlates well with the visual scoring systems and provides an objective determination of treatment efficacy in patients with SSc-ILD [20] (Additional file 1: Figure S1). Separately, Goh and Wells developed and validated a visual semiquantitative staging system for TLI (i.e., fibrosis, GGO, and HCs) in an observational cohort of patients with SSc-ILD [11].

We used individual patient data from the SLS I for current analysis. SLS I was a multicenter, double-blind, randomized controlled trial (RCT) conducted to evaluate the effectiveness and safety of oral CYC administered for 1 year in patients with symptomatic SSc-ILD who had evidence of ILD. The SLS I was the first RCT to demonstrate the effectiveness of CYC in FVC, relative to placebo, at the end of the 1-year treatment period [14]. Although the physiological benefits of CYC compared with placebo were modest (2.53 % and 4.09 % improvements in FVC% predicted and total lung capacity, respectively, at 12 months; p < 0.03), these results were supported by parallel findings of improvement in patient-reported outcomes (health-related quality of life, cough, and dyspnea) [21, 22], as well as greater stability of fibrosis visualized by HRCT ([2325] and summarized in [16]) and skin thickness scores. In addition, follow-up HRCT scans obtained at 12 months revealed that the change in extent of fibrosis from baseline was significantly worse in the placebo group than in the CYC treatment group [25]. Our objective was to compare the performance in a post hoc analysis of different HRCT staging systems on FVC and diffusing capacity for carbon monoxide (DLCO) in SLS I over a 1-year period. Specifically, we sought to determine (1) whether an HRCT staging system can enrich for subjects who will most likely decline in the placebo group and (2) the effects of HRCT staging system on the expected changes in FVC and DLCO in the CYC group that may inform the design of future trials. We chose a 1-year period on the basis of expert consensus that SSc-ILD trials should be at least 1 year [16].

Methods

Patient population

SLS I consisted of 158 participants randomized to receive either oral CYC or a matching placebo for 1 year, followed by an additional year of observation off treatment, as previously published [14]. Ethical approval was received from each participating institution and written informed consent was obtained from each subject. Briefly, inclusion criteria included age ≥18 years, duration of disease ≤7 years from onset of the first non-Raynaud’s symptom of SSc, FVC% 40–85 %, DLCO ≥40 % predicted (or 30–39 % predicted in the absence of clinical evidence of pulmonary hypertension), and evidence of any GGO and/or positive bronchoalveolar lavage (≥3 % neutrophils and/or ≥2 % eosinophils). All subjects provided written informed consent, and the study was approved by the medical institutional review board at each clinical center. Please see the Acknowledgments section for the list of centers that participated in the study.

Baseline measurements

Baseline measurements included full pulmonary function tests (PFTs), including spirometry, lung volume (by body plethysmography), and DLCO. PFTs were read centrally for quality assurance. In addition to ascertainment of disease duration and presence of limited or diffuse cutaneous SSc, patient-centered measures (including dyspnea and quality-of-life indices) were obtained. HRCT scans were obtained at baseline with the patient in prone position and at maximal inspiration. The images were acquired from scanners with at least four multidetectors according to a standardized protocol. Nonvolumetric computed tomographic scans of –2-mm slice thickness acquired at 10-mm increments were acquired contiguously. More details are available elsewhere [24].

HRCT staging systems

In SLS I, HRCT scans were scored by two independent radiologists who used a Likert scale (0 = absent, 1 = 1–25 %, 2 = 26–50 %, 3 = 51–75 %, and 4 = 76–100 %) for extent of four categories of parenchymal abnormality (pure GGO), lung fibrosis, HCs, and emphysema) [14]. The scoring was performed for each of the three zones (upper, extending from apex to aortic arch; middle, from aortic arch to inferior pulmonary vein; and lower, from inferior pulmonary veins to base) in each lung, as well as for the WL. The visual maximum fibrosis score (MaxFib) was individually reviewed by two thoracic radiologists who scored fibrosis in the zone of maximal involvement (ZM). Discordant interpretations were reviewed with a third radiologist to achieve consensus in a face-to-face meeting; no average scores were calculated. Quantitative maximum extent of fibrosis was also determined by CAD in the ZM.

Goh and Wells system

Goh and Wells developed a prognostic algorithm for patients with SSc-ILD, integrating PFTs and HRCT [11]. TLI was assessed in an observational cohort of patients with SSc-ILD. HRCT images were scored by two independent radiologists at five levels: (1) origin of great vessels, (2) main carina, (3) pulmonary venous confluence, (4) halfway between the third and fifth sections, and (5) immediately above the right hemidiaphragm. HRCT variables were total disease extent that incorporated the extent of a reticular pattern, GGO, and HCs. The extent of ILD was estimated as a percentage of total volume to the nearest 5 % in each of the five sections. Global extent of disease determined by HRCT was calculated as the mean extent score in the five scored sections. This was stratified as <20 % vs. >20 % TLI (termed Goh and Wells unadjusted stratification). For indeterminate cases (extent of TLI 10–30 % because there may be measurement error by visual read), an FVC threshold of 70 % is an adequate prognostic substitute. On the basis of these observations, Goh and Wells staged global extent of disease as limited or minimal disease (minimal disease determined by HRCT or, in indeterminate cases, FVC ≥70 %) or extensive disease (severe disease determined by HRCT or, in indeterminate cases, FVC <70 %) [11]. This aspect has been termed Goh and Wells adjusted stratification.

Computer-aided diagnosis

To obviate interreader variation and to standardize data across multiple sites, a CAD was developed using the SLS I data [18]. The HRCT scans from SLS I were reconstructed with sharp or manufacturer-recommended overenhancing filters. The CAD system segmented each lung of each patient into three area-equivalent zones (upper, middle, and lower). After semiautomated lung segmentation, the images were entered into a quantitative image workstation to produce separate quantitative scores for reticulations (fibrosis), GGO, and HCs automatically [19]. The QILD score was the sum of all abnormally classified scores, including scores for fibrosis, GGO, and HCs. HRCT QLF scores were determined using the percentage of counts in which the classified abnormal pattern comprised reticular opacity with architectural distortion. QILD and QLF scores were summed for both the WL, including both lungs, and the ZM.

A 25 % threshold for the QILD and QLF in the ZM was agreed a priori to be consistent with a visual MaxFib cutoff of 25 % (which assesses area of maximum fibrosis). In addition, we agreed to a 20 % cutoff for QILD and QLF in the WL to be consistent with a Goh and Wells threshold of 20 % for TLI.

Statistical analysis

We analyzed HRCT data in two different ways. First, we assessed the strength of associations between each staging system and FVC or DLCO percentage of predicted value at baseline, as well as the change in FVC and DLCO percentage of predicted value from baseline to 12 months using Pearson correlation coefficients. A coefficient ≥0.40 was considered to be a moderate association [26]. Second, we stratified the staging system (e.g., 0–25 % vs. 26–100 % for visual MaxFib) on the basis of published data to assess if this could enrich for subjects in the placebo group who would decline over a 1-year period. Two-sample t tests were used to assess statistical significance for absolute and relative change (percentage of predicted change from baseline) in FVC and DLCO over a 1-year period. Fisher’s exact test or a χ2 test was used to compare categorical variables among categories within the different staging systems. All tests were two-sided, and a p value <0.05 was considered statistically significant. All analyses were performed using SAS 9.3 software (SAS Institute, Cary, NC, USA).

Results

Of the 158 patients, 93 (48 patients in the placebo group and 45 patients in the CYC group) had FVC data available at baseline and 12 months as well as good-quality baseline HRCT scans. These patients were included in the analysis. There were no significant differences in baseline characteristics between the two treatment groups (Table 1). These patients’ mean age was 47 years, and their mean disease duration was 3.2 years. Their mean (standard deviation [SD]) FVC was 67.7 % (11.9) of predicted, and their mean (SD) DLCO was 46.3 % (12.7) of predicted.

Table 1 Baseline patient characteristics, stratified by disease duration

First, we assessed the relationships between the scoring systems as continuous variables and change in FVC and DLCO. In the placebo group, when absolute changes in FVC and DLCO were considered, correlations between the staging systems were largely negative but none were significant (Table 2). Conversely, these correlations were positive in the CYC group, suggesting that CYC treatment had a positive impact on FVC and DLCO in patients with a greater degree of ILD (assessed using different staging systems).

Table 2 Correlation coefficients between the staging systems vs. the PFT parameters (FVC and DLCO at baseline and after 12 months of treatment)

Next, we assessed whether various scoring systems, when dichotomized into mild vs. extensive disease, could predict change in PFTs. The absolute decline in FVC based on different staging systems is shown in Table 3 and Fig. 1. In the placebo group, regardless of the staging system used, there was a decline in FVC in patients with more extensive disease. For example, for MaxFib, the mean (SD) percentage decline in FVC was −6.2 (12.5) with >25 % involvement and 0.1 (9.0) (mild improvement) with <25 % involvement (p = 0.01). When we used the Goh and Wells unadjusted stratification for extent of ILD involvement, the mean (SD) declines in FVC were −1.6 (10.2) with <20 % lung involvement and −5.5 (8.0) with >20 % involvement. Similar trends were observed in the CAD staging system. When we used QILD for WL involvement, patients with >20 % involvement had a decline of −4.9 (9.5) vs. improvement of 0.3 (6.8) for <20 % involvement. Similar trends were also seen when change from baseline was expressed as relative decline in FVC percentage of predicted value (Additional file 2: Table S1).

Table 3 Absolute decline in FVC percentage of predicted value (compared with baseline) over 12 months
Fig. 1
figure 1

Absolute changes in percentage of predicted forced vital capacity (FVC%) determined using different high-resolution computed tomography (HRCT) staging systems. Data are shown as box plots. Each box represents the interquartile range (IQR), indicating the first (25th percentile) and third (75th percentile) quartiles. Lines inside the boxes represent the medians. Whiskers represent 1.5 times the upper and lower IQRs. Circles indicate outliers. p Value is based on two-samples t test. a Visual semiquantitative fibrosis score. b Goh and Wells unadjusted stratification. c Quantitative assessment of total extent of interstitial lung disease (QILD) in whole lung. d Quantitative percentage with fibrosis (QLF) in zone of maximal involvement

In the CYC arm, nonsignificant changes in FVC were noted across the various staging systems, most notably in patients with greater involvement visualized by HRCT (Table 3 and Additional file 2: Table S1). In the MaxFib group, a small improvement in FVC was seen in patients with >25 % involvement (1.2 [6.6]; p = 0.04). Although not statistically significant, similar changes were seen in the Goh and Wells unadjusted stratification, the Goh and Wells adjusted stratification, and the CAD staging systems.

The absolute decline in DLCO in relation to the choice of staging systems is shown in Table 4. In the placebo group, a statistically significant difference in DLCO was observed in QILD-WL staging by CAD (<20 %, 4.8 [8.6]; >20 %, −4.3 [10.6]; p = 0.01). Although statistical significance was not reached in the other staging systems, the mean changes in DLCO showed a trend toward a significant difference in patients with greater HRCT involvement in the placebo group. A larger variability was noted in the CYC arm, with a statistically significant effect on Goh and Wells adjusted stratification (minimal −1.3 [7.9] vs. extensive −6.9 [6.9]; p = 0.4) (Table 4).

Table 4 Absolute decline in DLCO from baseline over 12 months

Discussion

As somewhat effective therapies for other manifestations of SSc (e.g., renal, pulmonary arterial hypertension, and articular) have emerged [27, 28], the morbidity and mortality of ILD have become increasingly apparent [2931]. Traditionally, the severity of SSc-ILD is defined by the degree of ventilatory restriction in conjunction with the magnitude of diffusion impairment. These physiological measures are indirect and highly variable surrogates for the extent of structural disease abnormality [32]. In contrast, the extent of ILD visualized on HRCT images (fibrosis, GGO, and/or HCs) is a more direct and precise indicator of the severity of the underlying pathological process [1012, 20] and is associated with mortality [3].

With increasing interest in optimizing the design of clinical trials for evaluation of interventions for SSc-ILD, it is important to reliably identify cohorts of patients with a higher risk of disease progression and a greater likelihood of a favorable response to disease-modifying therapy. This process of cohort enrichment consists of the selective enrollment of these patients in treatment studies, reducing the patient numbers required to demonstrate a treatment effect, and increasing the average amplitude of such a benefit [16]. Our group previously published post hoc multivariate regression analyses [10] using the SLS I and identified that fibrosis at baseline determined by HRCT, the modified Rodnan skin thickness score (MRSS), and the Mahler Baseline Dyspnea Index as independent correlates of treatment response to CYC. When patients were stratified on the basis of whether 50 % or more of any lung zone was involved by reticular infiltrates in the ZM as determined by HRCT, as assessed by visual scoring, and/or whether patients exhibited an MRSS ≥23 (0–51 scale), a subgroup of patients emerged in whom there was an average CYC treatment effect of 9.81 % at 18 months (p < 0.001). Conversely, there was no treatment effect (−0.58 % difference) in patients with less severe HRCT findings and a lower MRSS at baseline.

The present study represents another step toward defining cohort enrichment for clinical trials. We compared three different staging systems used to quantify the extent of ILD on HRCT: the visual MaxFib score, the Goh and Wells criteria, and the CAD quantitative scores for fibrosis (QLF) and TLI (QILD). In the placebo group, patients categorized as having moderate to extensive ILD on the basis of any of the three staging systems had a larger absolute decline in FVC (Table 3). Although MaxFib had the highest statistical significance for the placebo group, the differences in absolute changes in FVC between different staging systems were small and had similar trends in decline of FVC with greater HRCT involvement. In the CYC arm, there was stabilization in FVC in patients with extensive disease visualized by HRCT across all the staging systems. Interestingly, higher HRCT-assessed involvement was associated with stabilization of FVC in the CYC group vs. average decline in the other HRCT group. Both MaxFib and QILD-ZM showed statistically significant changes. This is consistent with correlation coefficient data where there are positive correlations with change in FVC vs. different staging systems in the CYC group (Table 2). We also included detailed analysis for the CYC group based on preliminary data from the SLS II, a double-blind study of mycophenolate mofetil vs. CYC in patients with SSc with symptomatic ILD treated with oral mycophenolate mofetil for 2 years compared with oral CYC for 1 year followed by placebo during the second year [33]. Interestingly, it appears that background CYC therapy negates the enrichment strategy using HRCT staging systems. The change in FVC was positive (suggesting stabilization and/or improvement) in the more severe HRCT lung involvement with all staging systems. This analysis can inform trial design in future studies in which researchers consider background immunosuppressive therapies.

FVC was used as the primary outcome measure in the SLS I and FAST studies. The treatment with CYC had only a modest effect on FVC in the SLS I [13, 14] and FAST trials, and clinicians have debated the meaning of these results in clinical care [34, 35]. On the basis of a recent viewpoint published by the U.S. Food and Drug Administration on FVC in IPF [36], we explored whether cohort enrichment in the SLS I population could have provided a more clinically meaningful change in FVC compared with the entire sample. Although not established for SSc-ILD, a change of 2–6 % is considered a minimally important change in IPF [37]. Using different staging systems, we found that patients with extensive lung involvement determined by HRCT had clinically meaningful declines.

Using the SLS I and II, we recently showed that DLCO is the single best correlate of the extent of lung involvement determined by HRCT [32] and supported by the correlation coefficients between different staging systems vs. baseline DLCO (Table 3). However, DLCO has the high measurement error and lack of specificity (as it is influenced by both ILD and pulmonary vascular disease [16, 38]), and none of the staging systems were correlated with the change in DLCO over 1 year, highlighting that DLCO is a poor outcome measure in ILD trials.

Our analysis may have significant impact on clinical trial design. This information can be used to enrich patients who are recruited in future ILD trials, calculate sample size, and judge the feasibility of the trial. For example, using visual MaxFib score,  73 % of patients who participated in the SLS I would qualify for an enriched protocol. Although this analysis does not provide guidance regarding which staging system to incorporate, recent post hoc analyses from SLS I suggest that the CAD system is more sensitive to change than a visual scoring system [19, 23, 25, 32]. Therefore, if HRCT is planned as an outcome measure (in addition to an enrichment criterion), then CAD is the preferred system [19, 23, 24], depending on its availability. Also, Goh and Wells criteria are applied only to the baseline HRCT and have not been evaluated in a longitudinal fashion. However, they have the advantage that they can easily be incorporated into observational studies [39]. Conversely, the CAD system is not universally available, which may limit its feasibility.

Our study has much strength in its comparison of three staging systems that have been published for grading the extent of SSc-ILD and have been shown to be feasible for use in a clinical trial. In addition, we validated the Goh and Wells criteria in SLS I.

Our study is not without limitations. The analysis is a post hoc analysis and is limited to participants enrolled in a clinical trial with specific entry criteria, thereby limiting the generalizability of the findings. The number of subjects in the study is low, and further validation is needed in another cohort to confirm the results. Use of the staging system in other cohorts (including clinical trials and observational cohorts) should be carefully assessed before the findings are generalized.

Conclusions

The extent of HRCT-quantified ILD is a predictor of decline in FVC over a 1-year period and is independent of the staging system used to classify extent of disease. The choice of the staging system in a clinical trial depends on feasibility and available expertise but should be validated before incorporating it in future studies.

Data-sharing statement

Anonymized data from SLS I are available to investigators by application to the SLS I Executive Committee (DPT: dtashkin@ mednet.ucla.edu).