Introduction

Conventional chest radiography is currently the most commonly used method of monitoring lung structure in cystic fibrosis (CF) patients and annual radiographs are recommended [1]. Several chest radiograph scoring systems to quantify CF-related lung abnormalities have been described in the literature [28]. These scoring systems are widely used, both clinically and in (drug) research [9]. Scores correlate significantly with lung function test results and various other clinical parameters [8, 10, 11]. Chest radiographs are abnormal in 85% of children at the age of five years [12] and scores seem to worsen more rapidly over time than spirometry [1214]. Studies also showed significant treatment effects as measured by chest radiograph scores that were not reflected by spirometric measurements [15]. It is known that high-resolution CT (HRCT) is more sensitive than chest radiographs in detecting structural lung abnormalities in CF and early abnormalities can be seen on HRCT in a substantial number of patients with normal radiographs [1620]. However, because of the increased radiation dose and costs of HRCT and a non-quantified amount of benefit from more accurate structural HRCT information, chest radiographs are still the most commonly used method for structural lung disease assessment in CF in most centres. Large sets of radiographs are available to (retrospectively) address research questions.

Scores are potentially associated with substantial observer variation and scoring can be more time-consuming than making a clinical report. Our ultimate aim is to develop an automated version of the modified Chrispin-Norman score [2, 6]. In 1974, Chrispin and Norman [2] described their structured methods of semi-quantifying the morphological features that are commonly seen in CF patients. For dose-saving purposes Benden et al. [6] described a modified method for frontal chest radiographs only. The reproducibility between observers of the (modified) Chrispin-Norman scoring system as described in the literature is good with an intraclass correlation coefficient of 0.91 [6], but the reproducibility for the individual scoring system items has not yet been described. Also the illustrations available for the (modified) Chrispin-Norman scoring system are limited. The aim of our study was to determine the reproducibility of the modified Chrispin-Norman score including its components and to test whether a consensus meeting or reference illustrations could improve observer agreement.

Materials and methods

Study population

Chest radiographs were obtained from all 238 children in our CF clinic at the time of annual review A paediatric pulmonologist (observer 3) selected 2 sets of 25 CF children with a range of disease severity who were able to undergo spirometry. For each patient the most recent annual chest radiograph was used for this study. The retrospective study was approved by the institutional review board and informed consent was waived.

Chest radiograph scoring

The modified Chrispin-Norman score is presented in Table 1. Briefly, the modified Chrispin-Norman score uses the frontal radiograph to quantify structural lung disease in CF. The radiograph is divided in four quadrants and for each quadrant the severity of bronchial line shadows, ring shadows, mottled shadows and large soft shadows is scored on a scale from 0 to 2. Bronchial line shadows are thought to represent bronchiectasis and bronchial wall thickening, ring shadows may also represent bronchiectasis, mottled shadows are most likely to be mucus plugs in small and large airways and large soft shadows represent larger lung consolidations with or without loss of lung volume. In addition to these abnormalities the degree of overinflation is scored on a scale from 0 to 2 by assessing both the level of the diaphragm, the degree of hyperlucency and the shape of the thoracic cage.

Table 1 Modified Chrispin-Norman score

Three observers who interpret chest radiographs from CF patients on a regular basis were involved in the study (one radiologist with a special interest in chest imaging, one radiology resident with an interest in chest imaging and one paediatric pulmonology fellow with previous research experience in chest radiograph scoring in CF). There were two datasets of 25 radiographs and three scoring rounds. For round 1 the observers were provided with the relevant literature related to the modified Chrispin-Norman score [2, 6, 21] and scored the first dataset of 25 chest radiographs blinded to clinical characteristics except age and gender. Before round 2 a consensus meeting was organised and the second dataset of 25 chest radiographs was scored by the three observers, again independently. During the consensus meeting several chest radiographs with discrepant scores from round 1 were discussed by the observers. The individual scoring system items were individually discussed in consecutive order. For overinflation is was decided to use the level of the posterior rib at the right hemi-diaphragm. For thoracic shape and the degree of hyperlucency no consensus definition was reached and the observers found these abnormalities difficult to define. For ring shadows, bronchial line shadows and mottled shadows mainly differences were noticed between present but not marked and marked. Differences between observers with regard to this cut-off were discussed interactively, but no specific definition was developed. For large soft shadows we discussed that it should be scored when the heart borders or diaphragms were not visible due to lung consolidation. After the second round a set of reference illustrations was developed for each scoring system item (Figs. 1, 2, 3, 4, 5 and 6) by relating HRCT findings to chest radiograph findings for individuals in our cohort who had an HRCT and chest radiograph within one month. These HRCTs were obtained for clinical indications; no additional HRCTs were made as part of this study. None of the reference illustrations was included in the radiographs from set 2. The reference images were not developed by the observers although observer 1 checked the quality of the reference illustrations before they were used in the study. For round 3, the second dataset was scored again by the readers with help of the reference images. The interval between rounds 1 and 2 and between rounds 2 and 3 was at least 1 month.

Fig. 1
figure 1

Normal chest radiograph quadrants. Figure illustrates normal quadrants as used in the modified Chrispin-Norman score, corresponding normal high-resolution CT images are not shown

Fig. 2
figure 2

Illustration of assessment of overinflation in cystic fibrosis with the modified Chrispin-Norman score. Upper chest radiograph shows normal inflation level. The middle of the right hemi diaphragm is above the upper edge of the posterior 10th rib, there is no hyperlucency and the rib cage has a normal shape. Middle panel shows the right mid-diaphragm below the upper edge of the posterior 10th rib, but above the upper edge of the posterior 11th rib (score 1), the lung field was subjectively judged to be hyperlucent (score 2) and the shape of the rib cage abnormal (score 2). We did not obtain consensus on a clear definition for hyperlucent lung and rib cage shape. Subjectively the ribs were elevated and the posterior ribs too much in a horizontal position. The same applies to the lower radiograph. The lower radiograph shows a right mid-diaphragm at the upper edge of the posterior 11th rib (score 2), hyperlucency is not easy to evaluate with the various shadows projecting on the image, the rib cage was judged to be abnormal (score 2)

Fig. 3
figure 3

Illustration of bronchial line shadows in cystic fibrosis. Correlation of chest radiograph findings with corresponding high-resolution CT images. Upper two panels illustrate subtle and limited (score 1) bronchial line shadows (arrows) and the lower two panels illustrate obvious and extensive (score 2) bronchial line shadows (arrows). This corresponds to bronchial wall thickening (arrowheads) and/or bronchiectasis (arrows) on HRCT

Fig. 4
figure 4

Illustration of ring shadows in cystic fibrosis. Correlation of chest radiograph findings with corresponding high-resolution CT images. Upper panel illustrates subtle and limited (score 1) ring shadows (arrow) and the lower panel illustrates obvious and extensive (score 2) ring shadows (arrows). This corresponds to bronchiectasis (arrows) on HRCT

Fig. 5
figure 5

Illustration of mottled shadows in cystic fibrosis. Correlation of chest radiograph findings with corresponding high-resolution CT images. Upper panel illustrates limited (score 1) mottled shadows (arrows) and the lower panel illustrates extensive (score 2) mottled shadows (arrows). Most mottles are smaller than 0.5 cm as originally described by Chrispin and Norman [2], although some small mottles are superimposed and projected as slightly larger mottles

Fig. 6
figure 6

Illustration of large soft shadows in cystic fibrosis. Correlation of chest radiograph findings with corresponding high-resolution CT images. Upper panel illustrates limited (score 1) large soft shadows adjacent to the minor fissure (arrows) and the lower panel illustrates extensive (score 2) large soft shadows (arrows). Also note that the left heart border is no longer visible

Lung function testing

Spirometric measurements were available for all the 25 children in both sets of chest radiographs. Spirometry was obtained, as the chest radiograph, at a clinically stable period during the annual check-up. Measurements included forced vital capacity (FVC), forced expiratory volume in one second (FEV1) and mid-expiratory flow at 50% and 75% of VC (MEF50 and MEF75). Measurements were expressed as a percentage of predicted values using the reference values of Zapletal et al. [22] and the FEV1 to FVC ratio was calculated and expressed as a percentage.

Data analysis

Reproducibility between and within observers was assessed visually in scatter plots with a line of identity and by using an intraclass correlation coefficient for the modified Chrispin-Norman score and weighted Kappa values for the individual items of the scoring. The intraclass correlation coefficient takes the distance to the line of identity of the observers into account. An intraclass correlation coefficient between 0.6 and 0.8 represents moderate agreement and values above 0.8 represent good agreement. Kappa values of <0.20, 0.21–0.40, 0.41–0.60, 0.61–0.80 and 0.81–1 are generally considered to represent poor, fair, moderate, good and very good agreement, respectively. Correlation between the modified Chrispin-Norman score and lung function was assessed by using a Spearman correlation coefficient. SPSS 15.0 (Inc. Chicago, IL, USA) and MedCalc 11.2 (Mariakerke, Belgium) were used for data analysis. Data are presented as mean ± standard deviation (range) unless indicated otherwise.

Results

Study population

Age characteristics and lung function test results of the two groups are presented in Table 2. The children ranged in age from 5.3–18.6 years and in FEV1 from 33 to 124% predicted. Structural abnormalities were slightly more severe / more common in the first dataset (p = 0.04).

Table 2 Characteristics of the study populations

Reproducibility between and within observers of chest radiograph scoring

For all three rounds the kappa values for the scoring system items and the intraclass correlation coefficients for the modified Chrispin-Norman score are provided in Table 3. Also the within-observer agreement for dataset 2 (round 2 versus round 3) is presented in Table 3. Before the consensus meeting (round 1) the agreement between observers 1 and 3 and 2 and 3 ranged from poor to fair. After the consensus meeting (round 2) the levels of agreement improved between observers 1 and 3 and 2 and 3, especially for the mottled shadows, large soft shadows (fair to good levels of agreement) and modified Chrispin-Norman score, but not for overinflation, ring shadows and bronchial line shadows. Overall the agreement between observers 1 and 2 was slightly lower in round 2 compared with round 1, which might be related to the milder structural lung disease in the second round, although the lower scores may also be a result of the consensus meeting. In round 3 the agreement between the observers improved from poor-fair to moderate-good for overinflation. Also agreement between observers 1 and 2 and between observers 1 and 3 improved for mottled shadows and large soft shadows. Within-observer agreement was better than between-observer agreement.

Table 3 Agreement between and within observers of the adjusted Chrispin-Norman score

Correlation with lung function

The correlation between the observers’ modified Chrispin-Norman scores and lung function is presented in Table 4. For round 1 the modified Chrispin-Norman score significantly correlated with the lung function test results for observers 1 and 2, but not for observer 3. In rounds 2 and 3 the modified Chrispin-Norman score for all three observers showed a significant correlation with most lung function tests and the correlation further improved in round 3 when the reference images were used.

Table 4 Correlation between adjusted Chrispin-Norman score and spirometry

Discussion

We demonstrated that when experienced observers use the modified Chrispin-Norman score based on the described literature and published illustrations [2, 6, 21] between-observer agreement can differ substantially and can even be poor to fair, which is contrary to most previous studies. After a consensus meeting we were able to improve the between-observer agreement to more acceptable levels for the overall score and for the items mottled shadows and large soft shadows. Between-observer agreement improved for hyperinflation when reference illustrations were used and correlations with lung function improved for the modified Chrispin-Norman score. Our results indicate that differences between observers within a routine clinical setting or in research studies might easily occur. Contrary to previous studies we also provide insight into the individual scoring system items. Our data suggest that for bronchial line shadows and ring shadows it is difficult to achieve good agreement among observers.

Several options exist to obtain more consistent chest radiograph interpretation results. Within a specific study a simple consensus meeting can improve observer agreement. In our study overall agreement for the modified Chrispin-Norman score improved after the consensus meeting, which is explained by the items mottled shadows and large soft shadows. The consensus meeting did not improve agreement for bronchial line shadows and ring shadows. Also apparently the definition of over-inflation did not lead to improvement of among-readers agreement for this item. We have no good explanation for the differences in improvement between the individual scoring system items as each item was discussed in the consensus meeting. A second step is the development of a set of ‘reference’ images as has, for example, been done with the International Labour Office classification of radiographs of pneumoconiosis [23]. Because for the (modified) Chrispin-Norman score published illustrations were limited in the literature we developed a set of these images for children in whom a high-resolution CT was available. With help of these images, the modified Chrispin-Norman score did not further improve although correlations with lung function did improve. Also the assessment of hyperinflation in particular improved.

A further step would be computer aiding or fully automated scoring of chest radiographs for cystic fibrosis-related disease. Such tools have been developed for radiological investigations [24, 25], but to our knowledge such tools do not exist for chest radiographs in cystic fibrosis. We believe that such a tool might help in obtaining more consistent chest radiograph assessments in a clinical setting. Such a quantification tool might also be helpful to analyse (retrospectively) large sets of chest radiographs or large databases, both for clinical and research purposes. We started the process of developing a fully automated method, based on our previous experience in other diseases [2628]. Although several CXR systems have been developed for CF we chose to use the modified Chrispin-Norman score for two reasons. First, the Chrispin-Norman score has been modified for the frontal radiograph and our software is currently developed for frontal radiographs. Second, a previous study compared 6 CXR scores [8] and found the modified Chrispin-Norman score to have good reproducibility and correlation with lung function. We believe that although HRCT and magnetic resonance imaging might gain an important clinical role in several centres, many centres all over the world will continue to rely on plain chest radiography for monitoring structural lung disease in CF for many years to come.

Our study has several limitations. First, we included only three observers whereas more observers would have been preferable. However, all observers are routinely interpreting these images and we expect that the observer disagreements would not have been different if more observers had been included in the study. Second, we studied two datasets of which the second may have had slightly milder chest radiograph abnormalities, although possibly the lower scores were a result of the consensus meeting. It is more difficult to obtain good observer agreement when the range of disease is smaller; therefore, the improvement in agreement in round 2 is real but might have been larger if the second dataset had slightly milder structural disease. Third, we were unable to develop a clear definition for the overinflation sub-items chest wall shape and lung fields, although the reference images helped to improve the between-observer agreement for this item. Fourth, we used pulmonary function tests as a reference standard, while HRCT would have been a better comparator for structural lung disease, but we do not obtain HRCT scans routinely in our clinic.

Conclusion

Between-observer agreement for the modified Chrispin-Norman score ranged from poor-good before a consensus meeting was organised and reference images were developed and it improved thereafter to moderate-good levels. Observer differences can easily occur as illustrated in this study and our reference illustrations of the scoring system might improve observer agreement between observers and centres. In the future a fully automated version of the modified Chrispin-Norman score could be useful to obviate observer variation clinically and in research studies.