Introduction

In the industrialised world, there is an increasing demand for radiology resources with an increasing number of images being produced, which has led to a relative scarcity of radiologists. With limited resources, it is important to question and evaluate work routines, to provide settings for high-quality output and high cost-effectiveness, but at the same time keep medical standards high and avoid costly lawsuits. One way to increase the quality of radiology reports may be double reading of studies between peers, i.e. two radiology specialists of similar and appropriate experience reading the same study.

Most radiologists hold a very firm view on the concept of double reading—either for or against. Arguments for are that it reduces errors and increases quality in radiology. Arguments against are that it does not increase quality significantly, is time-consuming, and wastes time and resources. Despite these firm beliefs, there is comparatively scant evidence supporting either view, and both systems are widely practiced [1]. In some radiology departments or department sections, it is accepted that no systematic double reading is performed between specialists of a similar or above a certain degree of expertise. In other departments, such double reading between peers is mandatory. A survey among Norwegian radiologists reported a double reading rate of 33% of all studies [1], which is consistent with a previous Norwegian survey [2].

The concept of observer variation in radiology was introduced in the late 1940’s when tuberculosis screening with mass chest radiography was evaluated [3, 4]. In a comparison between four different image types (35-mm film, 4 × 10-inch stereophotofluorogram, 14 × 17-inch paper negative, 14 × 17-inch film), it was discovered that the observer variation was greater than the variation between image types [3]. The authors recommended that “In mass survey work … all films be read independently by at least two interpreters”. Double reading in mammography and other types of radiologic screening is, however, not the purpose of the current study since the approach of the observer in screening work is different from that in clinical work. In screening, the focus leans towards finding true positives and avoiding false negatives, whereas in clinical work also false positive and true negative findings are of importance. Neither is the purpose of the current study the evaluation of double reading in a learning situation, such as the double reading of residents’ reports by specialists in radiology. In such cases, the report and findings of a resident are checked by a more experienced colleague. This has an educational purpose and serves to improve the final report to provide better healthcare, with a better patient outcome in the end. The value of such double reading is hardly debatable.

Double reading can be broadly divided into three categories: (1) both primary and secondary reading by radiologists of the same degree of sub-specialisation, in consensus, or serially with or without knowledge of the contents of the first report; (2) secondary reading by a radiologist of a higher level of sub-specialisation; (3) double reading of resident reports [5].

The concept of double reading is at times confusing and can apply to several practices.

In screening, the concept of double reading implies that if both readers are negative, the combined report is negative. If one or both readers are positive, the report is positive (i.e. the “Or” rule or “Believe the positive”). In dual reading, the two readers reach a consensus over the differing reports [6].

Some studies use arbitration: with conflicting findings, a third reader considers each specific disagreement and decides whether the reported finding is present or not. Similar to this is pseudo-arbitration: with conflicting findings, the independent and blinded report of a third reader casts the deciding “vote” in each dispute between the original readers. In contrast to the “true arbitration” model, the third reader is not aware of the specific disagreement(s) [7]. These concepts are summarised in Table 1.

Table 1 Various applications of single and double reading

Considering the paucity of evidence either for or against double reading among peers in clinical practice, the purpose of the current study was to, through a systematic review of available literature, gather evidence for or against double reading in imaging studies by peers and its potential value. A secondary aim was to evaluate double reading with the secondary reading being performed by a sub-specialist.

Materials and methods

The study was registered in PROSPERO International prospective register of systematic reviews, CRD42017059013.

The inclusion criterion in the literature search was: studies calculating the rate of misses and overcalls with the aim of establishing the added value of double reading by human observers. The exclusion criteria were: (1) articles dealing solely with mammography; (2) articles dealing solely with screening; (3) articles dealing solely with double reading of residents; (4) articles not dealing with double reading; (5) reviews, editorials, comments, abstracts or case reports; (6) articles without abstract; (7) article not written in English, German, French or the Nordic languages; (8) duplicate publications of the same data.

Literature search

A literature search was performed on 26 January 2017 in PubMed/MEDLINE and Scopus. The search expressions were a combination of “radiography, computed tomography (CT), magnetic resonance imaging (MRI) and double reading/reporting/interpretation” (Appendix 1).

Both authors read all titles and abstracts independently. All articles that at least one reviewer considered worth including were chosen for reading of the full text. After independent reading of the full text, articles fulfilling the inclusion criteria were selected. Disagreements were solved in consensus. The material was stratified into two groups depending on whether the double reading was performed by a colleague of similar or higher sub-specialty.

Results

The literature search resulted in 1,610 hits. Another eight articles were added after manual perusal of the reference lists. Of these, 165 articles were chosen for reading of the full text. Forty-six of these that fulfilled the inclusion criteria and did not comply with the exclusion criteria were selected for final analysis. The study flow diagram is shown in Fig. 1. Study characteristics and results are shown in Table 2. Excluded articles are shown in Appendix 2.

Fig. 1
figure 1

Study flow diagram

Table 2 Study characteristics and results

When perusing the material, it was found that there were not sufficient data to perform a meta-analysis. Instead, a verbal summary was performed. In the results, two distinct groups of studies appeared: studies reporting double reading by peers of similar competence level and studies reporting the second reading performed by a sub-specialist, often performed at a referral hospital.

Double reading by peers of similar degree of sub-specialisation

Fifteen articles evaluated double reading in CT.

  • In trauma CT, three papers found initial discordant readings of 26–37% [13,14,15]. However, in one of these articles patient care was changed in only 2.3% by a non-blinded second reader [13]. Eurin et al. [16] reported a high rate of missed injuries initially, predominantly minor and musculoskeletal injuries.

  • In abdominal CT, a discrepancy rate of 17% resulted in 3% treatment change when reviewed by a non-blinded second reader [12]. Five articles evaluated sensitivity and specificity. In CT of ovarian cancer and CT colonography, there was a non-significant trend towards higher sensitivity in double reading [18, 19], but double reading increased the false-positive rate [20].

  • In chest CT for pulmonary nodules, double reading increased sensitivity [8, 22, 23], but computer-aided diagnosis (CAD) was even more beneficial [8, 22]. Another article found clinically important changes in 9% of cases [24].

Eight articles evaluated double reading in radiography.

  • Two articles found negligible improvement by double reading in small-bowel and large-bowel barium studies, one study even reported increased false positives with double reading [27, 28].

  • In chest radiography, Hessel et al. [7] combined independent readings by eight radiologists. Using a third independent interpretation to resolve disagreements between pairs of readers (pseudo-arbitration) was the most effective method overall, reducing errors by 37%, increasing correct interpretations by 18%, and adding 19% to the cost of an error-free interpretation.

  • Quekel et al. [6] reported that double or dual reading increased sensitivity, at the same time reducing specificity.

  • Two articles quoted 3–9% disagreement between observers in general radiography [30, 31].

Mixed modalities.

  • Siegle et al. [33] evaluated general radiology in six departments, and found a mean rate of disagreement of 4.4%.

  • In another large study, 11,222 cases (3.3% of the total production) underwent randomised peer review using a consensus-oriented group review with a rate of discordance (“report should change”) of 2.7% [37].

  • Babiarz and Yousem [35] found 2% disagreement when 1,000 neuroradiology cases were double read by another neuroradiologist, all working in the same institution.

  • In breast MRI, double reading increased sensitivity from 80 to 91%, while reducing specificity from 88 to 81% [34].

  • Agrawal et al. [36] performed parallel dual reporting in teleradiology emergency radiology which resulted in 3.8% disagreements. The authors suggested that abdominal CT and head/spine MRI were the most common error sources and that a focused double reading of error-prone case types may be considered for optimum utilisation of resources.

Second reading by a sub-specialist

  • Six articles reported on abdominal imaging, five of these for distinct conditions, usually malignancy. The discrepancy rates for these varied from about 12% up to 50% [5, 38, 39, 41, 42].

  • Bell and Patel [40] reported on 1,303 cases of body CT with the primary report from non-sub-specialised radiologists and found a higher frequency of clinically relevant discrepancies in the 742 cases that were double read by radiologists with a higher degree of sub-specialisation.

  • In chest radiography, a statistically significantly higher rate of seemingly obvious misdiagnoses was found for non-chest speciality radiologists [43], while a thoracic radiologist had higher sensitivity and reported fewer indeterminate nodules in chest CT for colorectal cancer [44].

  • In neuroradiology, two articles demonstrated the benefit from sub-specialist second opinion [46, 47], while two did not [45, 48].

  • In paediatric radiology, Eakins et al. [49] found a high rate of discrepancies in neuroimaging and body studies, while discrepancies were much rarer in extremity radiography [50]. In abdominal trauma CT, 12 new injuries were found in 98 patients [51].

Discussion

This systematic review found a wide range of significant discrepancy rates, from 0.4 to 22%, with minor discrepancies being much more common. Most of this variability is probably due to study setting. Double reading generally increased sensitivity at the cost of decreased specificity. One area where double reading seems to be important is in trauma CT, which is not surprising considering the large number of images and often stressful conditions under which the primary reading is performed. Thoracic and abdominal CT were also associated with more discrepancies than head and spine CT [54]. Higher rates of discrepancy can be expected in cases with a high probability of disease with complicated imaging findings [5].

More surprising was the fact that double reading by a sub-specialist almost invariably changed the initial reports to a high degree, although the second reader was also the reference standard for the study, which might have introduced bias. This leads to the conclusion that it might be more efficient to strive for sub-specialised readers than to implement double reading. It might also be more cost-efficient considering the fact that in one study, double reading of one-third of all studies consumed an estimated 20–25% of all working hours in the institutions concerned [1]. In modern digital radiology it is easy to send images to another hospital, and it should thus be possible to include even small radiology departments in a large virtual department where all radiologists can be sub-specialised. However, even a sub-specialised reader is subject to the same basic reading errors and this needs further study comparing outcomes from various reading strategies.

The primary goal of the current study was to evaluate double reading in a clinically relevant context, i.e. where the second reader double-reads the case in a non-blinded context before the report is finalised. Only two studies used a method approaching this [12, 13]. Reinterpretation of body CT in another hospital was beneficial [12] but double reading of abdominal and pelvic trauma CT resulted in only 2.3% changes in patient care [13].

One method for peer review of radiology reports is error scoring such as is practiced in the RadPeer program [55]. This differs from clinical double reading in that it does not confer direct benefit for the patient at hand. The use of old reports can also be seen as a form of second reading [56].

Double reading has been evaluated in a recent systematic review which dedicated much space to mammography screening [57]. This review suggested further attention to other common examinations and implementation of double reading as an effective error-reducing technique. This should be coupled with studies on its cost-effectiveness. The literature search in the current study resulted in some additional articles and a slightly different conclusion, which is not surprising considering the wide variety of studies included. In a systematic review on CT diagnosis, a major discrepancy rate of 2.4% was found, even lower when the secondary reader was non-blinded [54]. There is also a Cochrane review on audit and feedback which borders on the subject in the current study, even though no radiology-specific articles were included [58]. Errors and discrepancies in radiology have been covered in a recent review article [59].

Observer variation analysis is now customary when evaluating imaging modalities or procedures, or when starting studies on larger image materials [60,61,62], and it is well known that observer variation can be small or large between observers, due to differences in experience and variations in image quality or ease of detection and characterisation of a lesion.

A quality assessment of the individual evaluated articles was not performed in the current study. It was judged to be not feasible to get any meaningful results out of this, due to the wide variability in subject matter and methods.

Limitations of the study are the widely varying definitions of what is a clinically important discrepancy, which makes a meaningful meta-analysis impossible. In studies with a sub-specialised second reader there is a risk that the discrepancy rate is inflated since the second reader decides what should be included in the report.

In conclusion, the systematic review found, in general, rather low discrepancy rates when double-reading radiological studies. The benefit of double reading must be balanced by the considerable number of working hours a systematic double reading scheme requires. A more profitable scheme might be to use systematic double reading for selected, high-risk examination types. A second conclusion is that there seems to be a value in sub-specialisation for increased report quality. A consequent implementation of this would have far-reaching organisational effects.