Discordant and false-negative interpretations at digital breast tomosynthesis in the prospective Oslo Tomosynthesis Screening Trial (OTST) using independent double reading

Objectives To analyze discordant and false-negatives of double reading digital breast tomosynthesis (DBT) versus digital mammography (DM) including reading times in the Oslo Tomosynthesis Screening Trial (OTST), and reclassify these in a retrospective reader study as missed, minimal sign, or true-negatives. Methods The prospective OTST comparing double reading DBT vs. DM had paired design with four parallel arms: DM, DM + computer aided detection, DBT + DM, and DBT + synthetic mammography. Eight radiologists interpreted images in batches using a 5-point scale. Reading time was automatically recorded. A retrospective reader study including four radiologists classified screen-detected cancers with at least one false-negative score and screening examinations of interval cancers as negative, non-specific minimal sign, significant minimal sign, and missed; the two latter groups are defined “actionable.” Statistics included chi-square, Fisher’s exact, McNemar’s, and Mann–Whitney U tests. Results Discordant rate (cancer missed by one reader) for screen-detected cancers was overall comparable (DBT (31% [71/227]) and DM (30% [52/175]), p = .81), significantly lower at DBT for spiculated cancers (DBT, 19% [20/106] vs. DM, 36% [38/106], p = .003), but high (28/49 = 57%, p = 0.001) for DBT-only detected spiculated cancers. Reading time and sensitivity varied among readers. False-negative DBT-only detected spiculated cancers had shorter reading time than true-negatives in 46% (13/28). Retrospective evaluation classified the following DBT exams “actionable”: three missed by both readers, 95% (39/41) of discordant cancers detected by both modes, all 30 discordant DBT-only cancers, 25% (13/51) of interval cancers. Conclusions Discordant rate was overall comparable for DBT and DM, significantly lower at DBT for spiculated cancers, but high for DBT-only detected spiculated lesions. Most false-negative screen-detected DBT were classified as “actionable.” Clinical relevance statement Retrospective evaluation of false-negative interpretations from the Oslo Tomosynthesis Screening Trial shows that most discordant and several interval cancers could have been detected at screening. This underlines the potential for modern AI-based reading aids and triage, as high-volume screening is a demanding task. Key Points • Digital breast tomosynthesis (DBT) screening is more sensitive and has higher specificity compared to digital mammography screening, but high-volume DBT screening is a demanding task which can result in high discordance rate among readers. • Independent double reading DBT screening had overall comparable discordance rate as digital mammography, lower for spiculated masses seen on both modalities, and higher for small spiculated cancer seen only on DBT. • Almost all discordant digital breast tomosynthesis-detected cancers (72 of 74) and 25% (13 of 51) of the interval cancers in the Oslo Tomosynthesis Screening Trial were retrospectively classified as actionable and could have been detected by the readers.


Introduction
Digital breast tomosynthesis (DBT, 3D) has emerged as a new screening technique since it has potential to resolve limitations of conventional digital mammography (DM, 2D).DBT showed reduced recall rates especially in retrospective US studies and increased cancer detection in prospective European trials using double reading [1].Meta-analysis has shown little evidence of a difference between DBT and DM in interval cancer rate [2], but a recent trial reported reduced rate in DBT screening [3].
Missed cancers are caused by detection (perception) or interpretation (classification) error [4], both representing a challenge in high-volume screening using batch reading.Studies have shown improved interreader reliability using DBT compared to DM, with increased confidence for architectural distortion, a commonly missed abnormality at screening [5].DM studies reported about 50% of interval cancers are visible on prior screening mammograms, of which 30% are "minimal sign" lesions and 20% false-negatives [6].Population-based 2D screening studies found 23% of cancers detected by only one of two readers [7].One prospective DBT study reported a significant decrease in discordant recalls for cancers, suggesting that usefulness of double reading is reduced using DBT [8].However, the challenges of perception and interpretation errors are potentially greater in DBT than in DM screening due to more images, complex hanging protocols, and long interpretation times with reader fatigue.
There is a lack of knowledge from double reading DBT versus DM screening regarding discordant (cancer missed by one reader) and false-negative (cancer missed by both readers) interpretations.The aim of our study was to compare discordant and false-negatives, including reading times, in high-volume screening using batch reading.

Materials and methods
The Oslo Tomosynthesis Screening Trial (OTST) was approved by the regional ethical committee (clinicaltrials.gov, NCT01248546).Hologic sponsored the study by providing equipment and financial support for additional image evaluation.The authors had full control of all data.OTST results have been reported [9][10][11][12][13][14].This article presents unpublished data.

Study participants
The prospective OTST invited women age 50-69 to twoview mammography.During the study period (November 22, 2010, to December 19, 2012), 59,009 women were invited and 34,740 (58.9%) attended.Women were asked to participate and undergo DBT in addition to DM, if there was availability of radiographers and imaging systems.Women with pacemakers, disabled women unable to stand, and women with implants were excluded.A total of 24,301 women were included in the OTST.

Imaging procedures
Examinations were performed using Hologic Dimensions systems using standard screening exposure control ("auto filter").Craniocaudal and mediolateral oblique views of each breast were obtained using combo mode (same compression DM and DBT).

Training and image evaluation
Eight radiologists (P.S., E.E.) participated in the OTST, and received training in DBT interpretation using fixed trial hanging protocols 2 weeks prior to the OTST.Training included 100 screening exams enriched with cancers.The radiologists had 2-31 years' experience in screening mammography.
Each screening exam was independently interpreted in parallel by four different radiologists in batch mode (usually 60 to 80 women per day) using four dedicated workstations, one for each trial arm: (1) DM; (2) DM plus computer-aided detection (CAD) (ImageChecker 9.3, Hologic); (3) DBT plus DM; and (4) DBT plus synthetic mammography (SM).Balanced assignment of radiologist with respect to arm was difficult in daily practice.After closing their session, radiologists had no access to other readers' ratings.Each radiologist rated exams per breast using the 5-point scale for probability of cancer implemented in the Norwegian program: 1 = negative or definitely benign; 2 = probably benign; 3 = indeterminate; 4 = probably malignant; and score 5 = malignant.Scores 2-5 are positive.We defined "discordant miss" as screening-detected cancer with highly suspicious score 4 or 5 by one radiologist and missed (score 1) by second reader.Double reading was considered positive if either of the constituent arms (DM: arm 1 or 2, DBT: arm 3 or 4) were positive.
Mammographic findings (circumscribed mass, spiculated mass, architectural distortion, calcifications ± density) were specified for each positive score.Spiculated mass and architectural distortion are merged in our analyses as spiculated cancers.Scores were recorded directly into the national screening database and locked after each session.Interpretation time was recorded automatically.
Examinations with at least one score of 2 or greater were discussed at a consensus-based meeting (minimum two readers participating) with all data available with a decision to dismiss or invite for diagnostic work-up.Consensus meeting was free to make their decision, but in general all exams with score 4 or 5 were recalled.Short-term follow-up was not given.Breast density (BI-RADS 4th edition) was given at consensus meeting (5th edition not used at beginning of the OTST).Default hanging protocols were preset for all arms, including 4 steps for both DM arms and 8 steps for both DBT arms.All readers used manual scrolling and rarely included slabs.

Histopathology and definitions
Cancers (n = 230) were confirmed through pathology.Follow-up was 24 months from screening.Cases were verified as negative by querying the national cancer registry.Interval cancer was defined as malignancy after negative screening before next scheduled examination.

Reclassification of screening examinations
Screening examinations (DM and DBT) of all screeningdetected cancers with at least one false-negative score (n = 130) and all interval cancers (n = 51) were mixed with normal/benign cases (n = 59).In a retrospective reader study carried out more than 6 years after the OTST (in March 2019), four radiologists (S.Y., T.L., E.E.(participated in the OTST), S.B.) independently reviewed these exams (blinded to trial interpretations).If a suspicious lesion was found, the reader had to specify localization and mammographic findings and give a malignancy score using the same 5-point rating scale.Readers first analyzed DM and gave a conclusion before DBT examination was reviewed.A final consensus session, including all four readers and with all information available, finally classified findings as follows: negative (cancer in retrospect not visible); non-specific minimal sign (cancer visible but subtle features); significant minimal sign (findings suspicious of cancer); and false-negative ("missed cancer").The two latter groups were defined as "actionable, " i.e., cancer should have been detected at screening.

Statistical analyses
Comparison of unpaired ratios was performed (B.H.Ø.) using the chi-squared test or using Fisher's exact test if expected counts were low.If the ratios were paired, comparison was carried out using McNemar's test.Comparison of reading times was performed using the Mann-Whitney U test.A p-value p ≤ .05 was considered statistically significant.The analysis was performed using Stata 17 (StataCorp).

Women included
Among 34,740 attending women, the following were excluded: 8824 women underwent DM only, six with non-cancer malignancies, two with palpable cancer, one with local recurrence, all second exams of 1603 women attending twice, and three women scheduled for recall that did not return for work-up.Hence, 24,301 women represent the study population (Fig. 1).

Screening-detected cancers
A total of 20,507 (84.4%) of 24,301 women had a negative score in all four arms and 3794 (15.6%) a positive score in at least one arm, of which 2856 (75.3%) were dismissed and 938 recalled (all arms recall 3.9%) at consensus.Screening-detected cancers were diagnosed in 230 women of which four had bilateral cancer (Fig. 1).DBT double reading detected 227 cancers (arm C 194, arm D 189, cancer detection rate 9.3 per 1000 exams) and DM double reading detected 175 cancers (arm A 146, arm B 152, cancer detection rate 7.2 per 1000 exams) (relative increase 18.5%; McNemar, p < 0.001).Three cancers were detected on DM only, 172 on DBT and DM, and 55 on DBT only (Fig. 2A).
For 49 DBT-only detected spiculated cancers, mean breast density was comparable for the concordant group (n = 21; mean score, 2.62) and the discordant group (n = 28; mean score, 2.57).Impact of breast density on performance in the OTST has been published [13].
Most missed screening-detected cancers were invasive ductal carcinomas with or without ductal carcinoma in situ (Table 1).DBT detected significantly more invasive lobular cancers than DM (28/30 vs. 19/30, McNemar's, p = 0.02).Three cancers missed at DBT but detected at DM included two invasive lobular cancers and one tubular carcinoma.Spiculated lesions and cancers with calcifications were dominant among discordant cancers (Fig. 3).No cancer presenting with calcifications was missed at double reading DBT but discordant rate was 33% (20/61).

Reading time
Median reading time for true-negatives (2 years follow-up) given score "1" by all readers was 25 and 28 s for DM and DM + CAD, and 62 and 58 s for DBT plus DM and DBT + SM, respectively.There was great variation for true-negatives and false-negatives among readers (Table 3).Reading times for true-positives were longer for both DM and DBT, 88 and 106 s for DM and DM + CAD, and 151 and 146 s for arm DBT plus DM and DBT + SM, respectively (Fig. 4A).False-negative reading times were longer than truenegatives, 48 versus 32 s for DM (arms A and B) and 78 versus 76 s for DBT (arms C and D) (Fig. 4A).Proportions of false-negative scores with shorter interpretation time than for true-negatives were comparable for both reading modes, 20% (10/50) for DM (arms A and B) and 27% (11/41) for DBT (arms C and D) (chisquared, p = 0.54).For DBT-only detected spiculated cancers, the difference between median reading time for true-positives and false-negatives was 224 vs. 58.5 s (p<.001,Mann-Whitney U test) for arm C and 208 vs. 69.5 s (p < 0.001) for arm D (Fig. 4A).Individual pairs of discordant DBT-only detected spiculated cancers revealed large differences of reading times, with 46% (13/28) of tumors classified "actionable" observed as having false-negative times shorter than the median reading time for true-negatives (Fig. 4B).

Discussion
There is lack of knowledge from prospective trials regarding false-negative interpretations at double reading DBT versus double reading DM.The OTST found that DBT double reading detected significantly more cancers than DM but overall discordant rate (cancer missed by one reader) for screening-detected cancers was comparable (DBT 31% [71/227] vs. DM 30% [52/175], p = 0.81).Discordant rate for spiculated cancers detected by both modes was significantly lower for DBT (DBT 19% [20/106] vs. DM 36% [38/106], p =.003) but was high (57% [28/49]) for DBTonly detected lesions.Rate of "discordant miss" (highly suspicious score by one reader and false-negative by the other) was more than twice as high for DBT.Retrospective classification of DBT screening exams classified 95% (39/41) of discordant cancers detected by both modes, 100% (30/30) of discordant DBT-only detected cancers, and 25% (13/51) of interval cancers as "actionable, " with a potential reduction of DBT interval cancer rate by 25%.Reading time for "actionable" discordant DBT-only detected spiculated cancers revealed large differences between true-positives and false-negatives, nearly half false-negatives having reading times shorter than true-negatives.
Our results neither confirm decreased DBT interobserver variability previously reported [5] nor reduced usefulness of double reading with DBT [8].The DBT discordant rate was high (31% [48/155]) for spiculated cancers.DBT may demonstrate small underlying masses in cancers presenting as architectural distortion on DM [15], and higher conspicuity and visibility of desmoplastic lesions on DBT have improved detection and reader confidence [16,17].Nevertheless, small spiculated cancers are occasionally seen on only one DBT view and a few slices [15,18] causing such lesions to represent a perception and interpretation challenge [19].The high discordant rate in our study for DBT-only detected spiculated cancers (57% [28/49]) agrees with this experience.There have been no previous studies presenting discordant interpretations in DBT screening.Our 2D discordant rate is comparable with results reported in 2D populationbased screening using similar rating scale [20].Regarding high discordant rate for cancers with calcifications using DBT, we suggest this is caused by interpretation rather than perception error.
Missed screening cancer might be caused by perception (detection) or cognitive (interpretation) error.Perceptual errors, considerably more common than interpretation errors, occur when an abnormality is determined to be present in retrospect but was not detected prospectively.Reasons for these missed cancers include poor lesion conspicuity, radiologist fatigue, and workplace distraction or interruptions [4].The most commonly missed and misinterpreted lesions include benign-appearing masses, one-view findings, developing asymmetries, subtle calcifications, and architectural distortion [21].Architectural distortion is one of the most frequently missed signs of breast cancer, and DBT may demonstrate suspicious lesions that Table 4 Retrospective classification of discordant and false-negative cancers Discordant (Disc)=missed by one and false-negative (FN) missed by both readers at double reading screening examinations among the 230 screening-detected cancers and the 51 interval cancers.DM (2D), double reading digital mammography; DBT (digital breast tomosynthesis), double reading DBT (3D).Retrospective classification included four groups, and the two latter (significant minimal sign and overlooked) are merged as "actionable" (i.e., these cancers should have been detected at screening) TP = cancer detected at independent double reading, i.e., by one or both readers  are occult to DM [22].A missed cancer at DBT seems to be related to interpretative error regarding clearly visible lesions, a problem that may be reduced with increased experience [23].Several of the commonly missed lesions listed above might be correctly diagnosed using supplemental imaging (including fine-focus magnification views or ultrasound) or even at short-term follow-up.It is, however, important to keep in mind that the decision at our consensus meetings is purely binary (case recalled or dismissed) without including any report or description, and short-term follow-up for indeterminate findings (corresponding to BI-RADS category 3) is never used.Interval cancers and long-term outcomes after DBT screening have been a hot topic.Our retrospective analysis classifying 13 of 51 prior DBT screening exams as "actionable" with a potential to reduce interval cancer rate by 25% is of interest.To the best of our knowledge, only the Malmø trial [3] has reported a reduction of interval cancer rate in DBT screening.A recently published large retrospective US study found no significant difference in the rates of screening-detected advanced cancers or interval cancers [24].The prospective Italian RET trial reported that DBT screening in younger women (age 45-49) and women with dense breast having higher cancer detection at baseline was followed by a lower incidence of interval cancers [25].
Our median reading times for true-negative DBT exams are comparable with reported 56-77 s in other screening studies [26][27][28], although shorter times have been reported [29].Longer reading time for true-positives is expected, but longer times for false-negatives are more interesting because readers might have seen the suspicious lesion but after some consideration dismissed the finding.A median (or mean) value does not reflect the complexity of missed cancers, and we observed large variation in reading times among radiologists.Discrepancy for pairs of discordant DBTonly detected spiculated cancers may indicate that some false-negatives with long reading times represent interpretation errors whereas short reading times represent perception errors due to fast reading.Rush and fatigue might cause readers to rely too much on poor 2D cancer conspicuity with incomplete analysis of DBT.We noticed that poor cancer visibility on DM was associated with discordant rates at DBT, but our numbers are insufficient for final conclusions.
Screening studies comparing DBT versus DM have so far paid little attention to the image interpretation process itself.Suboptimal reading environment, heavy workload using batch reading, and incomplete use of the complex hanging protocols with many DBT images may all cause cancers to be missed.Increased number of false-negatives throughout batch and reduced reading time for later image positions within batch was reported in DM screening [30].High-volume DBT screening using batch reading requires greater cognitive resources than DM and might exacerbate association between fatigue and reader performance [31].An experimental study found that readers were beginning to show signs of visual fatigue after 20 DBT cases [32].Small cancers are often difficult to identify, and one study reported a higher proportion of cancers detected on only one view at DBT (10.5%) than at DM screening (4.7%) [33].Small spiculated cancers seen on only a few slices require a systematic use of hanging protocols of both views in order not to miss cancers even in women with fatty breasts (Fig. 3).Simplified hanging protocols using slabs only may reduce reading time but have a negative impact on sensitivity [34].On the other hand, DBT screening strategies using artificial intelligence (AI-based) systems have the potential to improve cancer detection and reduce reading time and workload, and could allow for more cost-effective breast cancer screening with DBT [35].
Our study has several limitations.First, it was conducted at a single institution with equipment from a single vendor.Second, radiologists often carried out image evaluation during overtime, and fatigue and rush may have contributed to false-negatives.Third, the study used no short-term follow-up which might have influenced decision-making.Fourth, the two DBT arms were not identical using DM in one and SM in the other arm in combination with DBT, but studies have shown comparable diagnostic accuracy using these two reading modes [11,36].
In conclusion, overall discordant rate was comparable for DBT and DM, but significantly lower for DBT in spiculated cancers detected at both modes.Retrospective analysis of screening exams at baseline showed that DBT screening had a potential to reduce the interval cancer rate.DBT-only detected spiculated lesions revealed high discordant rate.False-negatives remain a major challenge in DBT screening.Most false-negative or discordant DBT exams were retrospectively classified as "actionable." High-volume DBT screening using batch reading is a demanding task, and future studies should consider how implementation of artificial intelligence-based computer-aided detection and simplified hanging protocols could contribute to reduce workload and improved accuracy.

Fig. 2 A
Fig. 2 A, B Flowchart comparing results of double reading digital mammography (DM or 2D) versus double reading digital breast tomosynthesis (DBT or 3D).A For all 230 screening-detected cancers (SDC).B For 158 SDC presenting as spiculated cancer (spiculated mass or architectural distortion).True-positive: both readers (concordant) or one reader (discordant) had true-positive score.False-negative: both readers had negative score

Fig. 3
Fig. 3 Screening images of a 67-year-old woman with a cancer in the left breast detected by digital breast tomosynthesis (DBT) only in arm C (DBT + digital mammography).A Craniocaudal (CC) and (D) mediolateral oblique (MLO) images show a small nonspecific density posteriorly (box) among several others in the breast.Zoomed DM images (B, E) show a nonconclusive small mass but zoomed DBT images (C, F) demonstrate a small spiculated mass highly suggestive of cancer.Histology: invasive ductal carcinoma 7 mm, grade 3, no axillary lymph node metastases

Fig. 4 A
Fig. 4 A Boxplot showing reading time with interquartile range for true-negative (TN), true-positive (TP), and false-negative (FN) interpretations in the four arms (arm A: digital mammography DM; arm B: DM + CAD; arm C: digital breast tomosynthesis DBT + DM; arm D: DBT + synthetic mammography SM), and for TP and FN interpretations of DBT-only detected spiculated cancers.Median reading time for TN interpretation in all four arms (n = 20,140): arm A = 25 s; arm B = 28 s; arm C = 62 s; arm D = 58 s.The number of TP/FN in the four arms: arm A 146/84; arm B 152/78; arm C 194/36; and arm D 189/41, respectively.Numbers of TP/FN for DBT-only detected spiculated cancers (n = 49) are as follows: arm C 37/12 and arm D 33/16.Outliers are indicated by points above the quartile range.B Bar graph showing reading time (seconds) for the 28 discordant pairs of DBT-only detected spiculated masses and architectural distortions, including 12 false-negatives in arm C and 16 in arm D. Seven (7/12 or 58%) of FN in arm C and six (6/16 or 38%) of FN in arm D had shorter reading times than the median reading time for TN interpretation (indicated by horizontal lines)

Fig. 5
Fig. 5 Screening images of a 67-year-old woman with interval cancer 6 months later.Digital mammography (DM) and digital breast tomosynthesis (DBT) screening examinations of the right breast are presented.Readers in all four arms gave a normal score.A-C Craniocaudal (CC) and (D-F) mediolateral oblique (MLO) views.Reader in all four arms gave a normal score.Zoomed DM images show a suspicious finding on CC view (B) but normal findings on MLO view (E) (circle).Zoomed DBT images (C, F) demonstrate a spiculated mass consistent with cancer on both views (circle).Screening examination was retrospectively classified as non-specific minimal sign at DM and as missed cancer at DBT. Histology: multifocal invasive ductal carcinoma (foci 11, 5, and 5 mm), grade 2, and two axillary lymph nodes metastases

Table 2
Screening examinations and screening-detected cancers (SDC) interpreted by each reader and each arm in the Oslo Tomosynthesis Screening Trial (OTST)

Table 3
Median interpretation time (seconds) for true-negative and for false-negative interpretations by each reader and for each arm TN = true-negative score in all four arms (n = 20,140).DM, digital mammography; CAD, computer-aided detection; DBT, digital breast tomosynthesis; SM, synthetic mammography *One radiologist (excluded from analysis) read only one session (arm B) which included one cancer, thus 77/229 cancers in arm B