Background

Chest radiography is a very common investigation in children with lower respiratory infection, and knowledge of the diagnostic accuracy of radiograph interpretation is consequently important when basing clinical decisions on the findings. Inter- and intra-observer agreement in the interpretation of the radiographs are necessary components of diagnostic accuracy. Observer variation is however not sufficient for diagnostic accuracy. The key element of such accuracy is the concordance of the radiological interpretation with the presence or absence of pneumonia. Unfortunately there is seldom a suitable available reference standard for pneumonia (such as histological or gross anatomical findings) against which to compare radiographic findings. Diagnostic accuracy thus needs to be examined indirectly, including assessing observer agreement.

Observer variation in chest radiograph interpretation in acute lower respiratory infections in children has not been systematically reviewed.

The purpose of this study was to quantify the agreement between and within observers in the detection of radiographic features associated with acute lower respiratory infections in children. A secondary objective was to assess the quality of the design and reporting of studies of this topic, whether or not the studies met the quality inclusion criteria for the review.

Methods

Inclusion criteria

Studies meeting the following criteria were included in the systematic review:

1. An assessment of observer variation in interpretation of radiographic features of lower respiratory infection, or of the radiographic diagnosis of pneumonia.

2. Studies of children aged 15 years or younger or studies from which data on children 15 years or younger could be extracted. Studies of infants in neonatal nurseries were excluded.

3. Data presented that enabled the assessment of agreement between observers.

4. Independent reading of radiographs by two or more observers.

5. Studies of a clinical population with a spectrum of disease in which radiographic assessment is likely to be used (as opposed to separate groups of normal children and those known to have the condition of interest).

Literature search

Studies were identified by a computerized search of MEDLINE from 1966 to 1999 using the following search terms: observer variation, or intraobserver (text word), or interobserver (text word); and radiography, thoracic, or radiography or bronchiolitis/ra, or pneumonia, viral/ra, or pneumonia, bacterial/ra, or respiratory tract infections/ra. The search was limited to human studies of children up to the age of 18 years. The author reviewed the titles and abstracts of the identified articles in English or with English abstracts (and the full text of those judged to be potentially eligible). A similar search was performed of HealthSTAR, a former on-line database of published health service research, and the HSRPROJ (Health Services Research Projects in Progress) database. Reference lists of articles retrieved from the above searches were examined. Authors of studies of agreement between independent observers on chest radiograph findings in acute lower respiratory infections in children were contacted with an inquiry about the existence of additional studies, published or unpublished.

Data collection and analysis

The author evaluated for inclusion potentially relevant studies identified in the above search. Characteristics of study design and reporting listed in Table 1 were recorded in all studies of observer variation in the interpretation of radiographic features of lower respiratory infection in children aged 15 years or younger (except infants in neonatal nurseries). The criteria for validity were those for which empirical evidence exists for their importance in the avoidance of bias in comparisons of diagnostic tests with reference standards, and which were relevant to tests of observer agreement. The selected criteria for applicability were those featured by at least two of five sources of such recommendations. No weighting was applied to the criteria, except the use of the two most frequently recommended validity criteria (recommended by at least four out of five sources) as the methodological inclusion criteria [15]. No attempt was made to derive a quality score.

Table 1 Characteristics of study design and reporting

In studies meeting all the inclusion criteria for the review, the author extracted the following additional information: number and characteristics of the observers and children studied, and measures of agreement. When no measures of agreement were reported, data were extracted from the reports and kappa statistics were calculated using the method described by Fleiss [6]. Kappa is a measure of the degree of agreement between observations, over and above that expected by chance. If agreement is complete, kappa = 1; if there is only chance concordance, kappa = 0.

Results

A review profile is shown in Figure 1. For a list of rejected studies, with reasons for rejection, see Additional file 1: Rejected studies. Ten studies of observer variation in the interpretation of radiographic features of lower respiratory infection in children aged 15 years or younger were identified [716]. Contact was established with five of nine authors in whom it was attempted. No additional studies were included in the systematic review as a result of this contact.

Figure 1
figure 1

Review profile

The characteristics of the study design and reporting of the 10 studies of observer interpretation of radiographic features of lower respiratory infection in children are summarized in Table 1. Seven of the studies satisfied four or more of the seven design and reporting criteria. Four studies described criteria for the radiological signs. Six of the studies satisfied the inclusion criteria for the systematic review [810, 12, 14, 15]. Of the remaining four studies, three were excluded because a clinical spectrum of patients had not been used [7, 13, 16] and one because observers were not independent [11]. The characteristics of included studies are shown in Table 2.

Table 2 Characteristics of included studies

A kappa statistic was calculated from data extracted from one report [14], and confidence intervals in three studies in which they were not reported but for which sufficient data were available in the report [8, 9, 14]. A summary of kappa statistics is shown in Table 3. Inter-observer agreement varied with the radiographic feature examined. Kappas for individual radiographic features were around 0.80, and lower for composite assessments such as the presence of pneumonia (0.47), radiographic normality (0.61) and bacterial vs. viral etiology (0.27–0.38). Findings were similar in the two instances in which more than one study examined the same radiographic feature (hyperinflation/air trapping and peribronchial/bronchial wall thickening). When reported, kappa statistics for intra-observer agreement were 0.10–0.20 higher than for inter-observer agreement.

Table 3 Observer agreement: kappa statistics (95% confidence intervals)

Discussion

The quality of the methods and reporting of studies was not consistently high. Only six of 10 studies satisfied the inclusion criteria for the review. The absence of any of the validity criteria used in this study (independent reading of radiographs, the use of a clinical population with an appropriate spectrum of disease, description of the study population and of criteria for a test result) has been found empirically to overestimate test accuracy, on average, when a test is compared with a reference standard [1]. A similar effect may apply to the estimation of inter-observer agreement, in that two observers may agree with each other more often when aware of each other's assessment, and radiographs drawn from separate populations of normal and known affected children will exclude many of the equivocal radiographs in a usual clinical population, thereby possibly falsely increasing agreement. Only four of ten studies described criteria for the radiological signs, with potential negative implications for both the validity and the applicability of the remaining studies.

The data from the included studies suggest a pattern of kappas in the region of 0.80 for individual radiographic features and 0.30–0.60 for composite assessments of features. Kappa of 0.80 (i.e. 80% agreement after adjustment for chance) is regarded as "good" or "very good" and 0.30–0.60 as "fair" to "moderate" [17]. The small number of studies in this review however makes the detection and interpretation of patterns merely speculative. Only two radiographic features were examined by more than one study. There is thus insufficient information to comment on heterogeneity of observer variation in different clinical settings.

The range of kappas overall is similar to that found by other authors for a range of radiographic diagnoses7. However, "good" and "very good" agreement does not necessarily imply high validity (closeness to the truth). Observer agreement is necessary for validity, but observers may agree and nevertheless both be wrong.

Conclusions

Little information was identified on inter-observer agreement in the assessment of radiographic features of lower respiratory tract infections in children. When available, it varied from "fair" to "very good" according to the features assessed. Insufficient information was identified to assess heterogeneity of agreement in different clinical settings.

Aspects of the quality of methods and reporting that need attention in future studies are independent assessment of radiographs, the study of a usual clinical population of patients and description of that population, description of the criteria for radiographic features, assessment of intra-observer variation and reporting of confidence intervals around estimates of agreement. Specific description of criteria for radiographic features is particularly important, not only because of its association with study validity but also to enable comparison between studies and application in clinical practice.