Introduction

Studies have shown that breast cancer screening with digital breast tomosynthesis (DBT) in combination with standard digital mammograms (DM) or synthetic 2D mammograms (SM) is associated with higher rates of screen-detected cancer compared to standard DM [1,2,3,4]. The effect of DBT on interval cancer rates is still unclear due to small number of cases included in studies [5]. Only one study has reported a reduction in interval cancers among those screened with DBT versus DM [6]. In 2021, a meta-analysis of pooled data from prospective European trials and observational US studies showed that screening with DBT resulted in an increase in screen-detected cancers in Europe, while the recall rates decreased to a larger extent in the USA [5].

The Tomosynthesis trial in Bergen (To-Be 1) was a randomized controlled trial (RCT) comparing DBT + SM and DM in BreastScreen Norway, 2016–2017 [7]. The trial did not reveal a statistically significant higher cancer detection rate for DBT + SM [7]. In the follow-up study, To-Be 2, a prospective cohort study, all women were offered screening with DBT + SM. A higher cancer detection rate was observed in To-Be 2 in contrast to To-Be 1 [8]. It is unclear why there was no difference in cancer detection for DBT + SM compared to DM in To-Be 1 [7, 9].

Retrospective blinded reviews have shown varying rates of false negatives in mammography screening studies due to different study designs and definitions of false-negative examinations [10,11,12,13,14,15,16,17]. Previous informed consensus-based reviews have classified 13–33% of the screen-detected and interval cancers from mammography studies as false negative [14,15,16, 18]. To shed light on possible reasons for nonsignificant differences between DBT + SM and DM screening in To-Be 1, we invited five expert breast radiologists not involved in the To-Be trial to perform a retrospective review of DBT + SM screening examinations resulting in interval cancers detected in To-Be 1 and screen-detected cancers in To-Be 2. The objective was to determine whether interval cancers in To-Be 1 and screen-detected cancers in To-Be 2 were due to interpretive error in the To-Be 1 trial.

Material and methods

The To-Be trials were approved by the Regional Committee for Medical and Health Research Ethics in Norway (no. 2015/424) and registered at ClinicalTrials.gov (NCT02835625 and NCT03669926). All women participating in the To-Be trials signed an informed consent.

The two prospective trials were performed in Bergen, as a part of BreastScreen Norway, an organized population-based screening program, administered by the Cancer Registry of Norway [7, 19].

The trials are described in detail elsewhere [7, 8]. Briefly, To-Be 1 was a randomized controlled trial (RCT) comparing screening outcomes of DBT + SM, with standard DM. To-Be 1 recruited women in 2016–2017 and randomly assigned them to screening with either two view DBT + SM or DM [7]. In the following 2 years, 2018–2019, To-Be 2 was performed where all enrolled women were screened with DBT + SM [8]. All screening examinations were independently read by two radiologists using an interpretation score ranging from 1 to 5 for each breast. A score of 1 indicated a negative result, 2 probably benign, 3 intermediate suspicion, 4 probably malignant, and 5 high suspicions of malignancy. All cases with a score of 2 or higher by one or both radiologists were discussed at consensus to determine if a woman should be recalled. The examinations were performed with GE Healthcare units, SenoClaire in To-Be 1, and with Senographe Pristina in To-Be 2.

Blinded review

The blinded individual review included 239 DBT + SM screening examinations from women screened in To-Be 1 with a false-positive screening result (n = 39), interval cancer (n = 19), consecutive round screen-detected cancer (n = 91), and consecutive round negative screen (n = 90) (Fig. 1).

Fig. 1
figure 1

Study design for the blinded individual and informed consensus-based retrospective review of mammograms performed in To-Be 1

The five external radiologists, not involved in the trials, registered mammographic density (BI-RADS, ad) [20] and described lesions by features (mass, spiculated mass, calcification, asymmetry, architectural distortion, and density with calcification), location in the breast (breast quadrant), visibility on DBT and on SM, and view (craniocaudal, CC, and mediolateral oblique, MLO). The reviewers also marked the conclusion as a malignancy score of 1–5 for DBT and for SM separately on both breasts. They were not asked to classify the cases as false negative, minimal sign, or true.

The definition of false negative was retrospectively created by the authors and included cases with a score of 2 or higher by two or more radiologists (Fig. 2).

Fig. 2
figure 2

The definition of false negative cases in the blinded individual review and false negative, minimal sign significant, minimal sign non-specific, and true negative cases in the informed consensus-based review for screening examinations prior to interval cancer and consecutive round screen-detected cancer

Informed review

After the blinded review, the five radiologists performed the informed consensus-based review (Additional file 1: Appendix B). The review included 110 negative screening examinations of women diagnosed with interval or consecutive round screen-detected cancer (Fig. 1). Prior screening mammograms, diagnostic images, and histopathological findings for all 110 cases were available for the radiologists.

Data recorded at the informed review included mammographic features and conclusion for DBT + SM (false negative, minimal sign significant, minimal sign non-specific, and true negative). If all five radiologists scored 1 for both breasts in the blinded review, the examination was considered true negative, and the images were not reviewed (n = 53). If one or more radiologists scored 2 or higher in the blinded review, the examination was discussed (n = 57), and it was jointly decided if the cancer was false negative, minimal sign significant, minimal sign non-specific, or true negative. False negative cases were defined as examinations with obvious findings at the cancer site [12, 21] (Fig. 2).

Reviewers’ characteristics and variables of interest

The reviews were performed by five breast radiologists with the following years of experience in screen reading of DM/reading of DBT: AG 16/9, KL 12/12, RM 10/8, TH 13/10, and SRH 11/11. All images were free from clinical annotations. Data about histopathologic tumor characteristics, including tumor diameter (mm), histologic grade (1–3 by Nottingham scale), lymph node status (positive or negative), estrogen, progesterone and human epidermal growth factor receptor 2 (HER2) status [22], and immunohistochemical subtypes (luminal A, luminal B HER2 − , luminal B HER2 + , HER2 + , and triple negative) [22], were extracted from the cancer registry.

Statistical analyses

The number and proportion of screening examinations scored 2 or higher in the blinded review were presented for examinations prior to interval cancers and consecutive round screen-detected cancer, by radiologist. Results for examinations with a negative and false positive screening result were shown in the Additional file 1: Table C1. Numbers and proportions of examinations prior to interval or consecutive round screen-detected cancer scored 2 or higher in the blinded review by one or more, two or more, and three or more radiologists were presented for DBT + SM, DBT, and SM. The same results were shown for negative and false positive cases in the Additional file 1: Table C2. The proportions were compared using a chi-square test.

Numbers and proportions of screening examinations prior to interval cancer and consecutive round screen-detected cancer were presented for scores of 2 and 3 or higher by one, two, three, or more radiologists. The same results were shown for true negative and false positive examinations in the Additional file 1: Table C3. The number and proportion of false negatives for screening examinations prior to interval cancer and examinations resulting in consecutive round screen-detected cancer were presented.

Numbers and proportions of false negative, minimal sign significant, minimal sign non-specific, and true negative cases among screening examinations prior to interval and consecutive round screen-detected cancer were shown as assigned in the informed review. These examinations are presented by mammographic density and histopathologic tumor characteristics. A two-sided p-value of < 0.05 was considered statistically significant.

Results

Blinded individual review

The blinded individual review included 239 DBT + SM screening examinations from To-Be 1 (Fig. 1).

There was substantial variation in interpretation scores across radiologists (Table 1). For example, for screening examinations prior to interval cancer, Radiologist 4 assigned a score of 2, 3, 4, or 5 once (5.3%), while Radiologist 5 assigned those scores to four cases (21.1%). For consecutive round screen-detected cancer, Radiologist 4 scored 2, 3, 4, or 5 in 16 cases (17.6%), while Radiologist 1 assigned those scores to 45 (49.5%) cases.

Table 1 Number and proportion of screening examinations with a score of 2, 3, 4, and 5 by radiologists in the blinded individual review of 19 screening examinations prior to interval cancer and 91 screening examinations resulting in consecutive round screen-detected cancer

For the negative examinations, the variation of the number of cases assigned with a score of 2 ranged between 10 (11.1%) and 24 (26.7%) (Additional file 1: Table C1). For screening examinations with a false positive result, the number of cases varied from 10 (25.6%) to 26 (66.7%).

For screening examinations resulting in consecutive round screen-detected cancer (n = 91), the proportion of a score of 4 or 5 by one or more radiologists was 30.8% for DBT alone versus 8.8% for SM alone (p < 0.001) (Table 2). The proportion of a score of 3 by two or more radiologists was 16.6% for DBT alone versus 5.5% for SM alone (p = 0.02). Results for negative and false positive screening results were shown in the (Additional file 1: Table C2).

Table 2 Numbera and proportion of 19 screening examinations prior to interval cancer and 91 screening examinations resulting in consecutive round screen-detected cancer, scored 2, 3, and 4 or 5 by one or more, two or more, and three or more of the five radiologists for digital breast tomosynthesis (DBT) + synthetic 2D images (SM), DBT alone, and SM alone in the blinded individual review

Screening examinations considered false negative after the blinded review (assigned with a score of 2, 3, 4, or 5 by two, three, four, or five radiologists) included 10.5% (2/19) of examinations prior to interval cancer and 42.9% (39/91) of examinations resulting in consecutive round screen-detected cancer. A score of 2, 3, 4, or 5 by two, three, four, or five radiologists was also assigned to 47.8% (43/90) of negative examinations and 89.7% (35/39) of examinations with a false positive result (Table 3 and Additional file 1: Table C3).

Table 3 Number and proportion of screening examinations prior to interval cancer and examinations resulting in consecutive round screen-detected cancer for a score of 2 or higher and a score of 3 or higher by one or more radiologists, two or more radiologists, and three or more radiologists

Informed consensus-based review

The informed review included a total of 110 DBT + SM screening examinations prior to diagnosis of interval cancer (n = 19) and consecutive round screen-detected cancer (n = 91).

A total of 5.3% (1/19) of the interval cancers were considered false negative, 10.5% (2/19) minimal sign significant, 10.5% (2/19) minimal sign non-specific, and 73.4% (14/19) true negative (Table 4). For screening examinations resulting in consecutive round screen-detected cancer, 18.7% (17/91) were assigned as false negative, 15.4% (14/91) as minimal sign significant, 15.4% (14/91) as minimal sign non-specific, and 50.6% (46/91) as true negative. The total number of screening examinations assigned as false negative, interval plus consecutive round screen-detected cancer, was 16.4% (18/110).

Table 4 Number and percentage of false negative, minimal sign significant, minimal sign non-specific, and true cancers on screening examinations prior to interval cancer (n = 19) and resulting in consecutive round screen-detected cancer (n = 91) based on the informed consensus-based review

Among examinations resulting in consecutive round screen-detected cancer and assigned as false negative, 76% (13/17) had BI-RADS density c, 35% (6/17) lesions were architectural distortion, 88% (15/17) were invasive cancers, 67% (10/15) had tumor diameter < 21 mm, 67% (10/15) had histologic grades 1 or 2, and 60% (9/15) had luminal A immunohistochemical subtype (Table 5).

Table 5 Distribution of screening examinations prior to interval cancer after To-Be 1 (n = 19) and resulting in consecutive round screen-detected cancer in To-Be 2 (n = 91) classified into false negative, minimal sign significant, minimal sign non-specific, and true negative in the informed consensus-based review, by mammographic density and histopathologic tumor characteristics

Discussion

As far as we are aware, no prior studies have performed blinded and informed review of prior DBT images to classify cancer cases as false or true negative. According to our definitions, 10.5% (2/19) of the screening examinations prior to interval cancer, 42.9% (39/91) of the screening examinations resulting in consecutive round screen-detected cancer from the To-Be 1 trial were scored ≥ 2 and classified as false negative after the blinded individual review. The same score (≥ 2) was assigned to 47.8% (43/90) of the negative and 89.7% (35/39) of the false positive examinations. The informed consensus-based review by five experienced breast radiologists not involved in the To-Be trials classified 5.3% (1/19) of the screening examinations prior to interval cancer and 18.7% (17/91) of screening examinations resulting in consecutive round screen-detected cancer as false negatives.

As expected, the malignant lesions were more frequently visible on DBT compared with SM, specifically for examinations resulting in consecutive screen-detected cancer scored 4 or 5 by one or more radiologists in the blinded review [23, 24]. Previous studies have shown that small-detail detectability could be reduced for SM compared to DBT, specifically for detection of the desmoplastic processes associated with spiculated masses and architectural distortions visible solely on one or few DBT planes, but not on SM [25, 26]. However, these findings could not corroborate any assumptions on increased breast cancer detection for DBT versus DM or SM in To-Be 1. Moreover, the proportion of negative examinations and examinations with a false positive screening result scored ≥ 3 by one or more radiologists was significantly higher for DBT compared to SM, implying that DBT may be associated with an increased rate of false positives [27, 28].

Prognostically favorable histopathologic tumor characteristics of the examinations resulting in consecutive round screen-detected cancers classified as false negative in the informed review suggest that earlier detection of these cancers would be of limited clinical value. Moreover, only 1 of the 19 screening examinations resulting in interval cancer was classified as false negative in the informed review, indicating fast growing tumors among the interval cancers [12, 29,30,31,32].

Previous DBT interpretive studies have not assessed the rates of false negative cases [33,34,35]. Informed review studies performed on DM and screen-film mammography (SFM) have reported a false negative rate varying from 12 to 36%, including DM studies from BreastScreen Norway showing 19–34% of false negatives for examinations prior to interval cancers and 20–22% of false negatives for examinations resulting in consecutive round screen-detected cancers [10,11,12,13,14,15,16,17,18]. The lower percentage of false negatives for examinations prior to interval cancers in our blinded and informed review compared to the results of prior studies using DM and SFM may be due to the low number of study cases. Furthermore, the percentage of false negatives has been reported to be influenced by the comparability and similarity between the study setting and a normal screening setting [12].

Strengths and limitations

Strengths of our design included the reviews being conducted by five external breast radiologists, not involved in the To-Be trials, reducing the risk of bias associated with the interpretation. The blinded review set included negative and false positive examinations simulating a normal screening setting. Finally, our data were from a population-based breast cancer screening program with a high completeness of histologically verified breast cancer cases.

However, different review procedures used in the blinded individual and informed consensus-based review limited the possibility to compare or combine the numbers and proportions of false negatives. There were large variations in results for scores ≥ 2, ≥ 3, etc. by different combinations of radiologists, including high percentages of scores ≥ 2 and ≥ 3 for negative screening examinations, which restricted our definition of false negatives in the blinded review. The blinded review was, however, important for the external radiologists in terms of familiarization with the dataset and showed large individual differences in the interpretation, which most likely arose due to different classification systems in other countries and/or image quality for DBT systems. Nevertheless, BreastScreen Norway uses independent double reading with consensus with a recall solely assigned by consensus consisting of at least two radiologists, and we assume our definition of false negatives in the blinded review led to somewhat overestimated results but was analogous to the possibility of concordant choice of a score of 2 or higher by at least two radiologists [12]. The presence of two minimal sign categories in the informed review might have led to a lower number of false negatives versus use of one category only. The percentage of false negatives identified by the external reviewers in consensus could be a result of an experimental effect of reduced specificity inherent to the informed review methodology [17]. Furthermore, as image quality is an important aspect for breast cancer detection [36, 37], the external radiologists questioned the technical image quality in To-Be 1, performed with GE SenoClaire. The imaging equipment might be of influence for the low number of false negative cases in the informed review. The image quality may therefore represent one possible reason for the low rate of false negative examinations in the informed review but also for the lack of expected increase in cancer detection for DBT. The radiologists stated that a substantially better quality was observed for GE Pristina; however, no objective measurements of the image quality were collected; therefore, we were not able to draw any conclusions. Moreover, the postprocessing or reconstruction of study images was not performed. The experience with DBT for the external radiologists could have been associated with different vendors compared to those used in the study, which might have affected the ability to detect suspicious lesions. The large differences in percentages of examinations resulting in cancer, as well as false positive and negative examinations, scored ≥ 2 by the external readers might underline the uncertainty in the obtained results. It is also possible that To-Be 2 diagnosed more screen-detected cancers due to use of DBT + SM as the only screening technique for all women, which might have resulted in the overestimation of the basis for false negative cases among screen-detected cancers in our study, for both the DBT- and DM-arm.

Finally, this review did not include images from the DM-arm. However, blinded and informed reviews on DM and SFM have been performed with comparable results as in this DBT review [10,11,12,13,14, 17, 38, 39]. Results from a mixed blinded individual review of interval cancers from SFM performed in 2005 showed that 20% (46 of 231) of cases were false negative [13], while another blinded review of screening mammograms 2 and 4 years prior examinations of screen-detected cancer showed that 31% (32 of 103) of cases were false negative [38]. These numbers can be used to calculate the rates of false negatives in the potential blinded review of the DM-arm of the To-Be 1 trial. In the DM-arm, the original number of screen-detected cancers was 87 [7], and the number of interval cancers was 29, while the number of subsequent screen-detected cancers following the DM-arm was 101 [8]. Therefore, the number of false negatives resulting in interval cancers in the DM-arm could have been about 6 (29 × 20/100) and the number of false negative screen-detected cancers about 31 (101 × 31/100) in the potential blinded review. This would have resulted in 37 extra cases (false negatives) of screen-detected cancers in the DM-arm, accounting for 124 (87 + 37) in total, and a rate of 0.86%. When including the results from our blinded review on the number of false negatives for the DBT + SM-arm (41 cases), a total of 136 (95 + 41) in 14,380 women for DBT + SM corresponds to a detection rate of 0.95%, versus 0.86% in the DM-arm (124 cases in 14,369 women), and a p-value of 0.42.

Results from two informed consensus-based reviews of DM examinations from 2004 to 2016 BreastScreen Norway, including 24% of false negatives for interval cancer (246 of 1010) and 22% of false negatives for consecutive round screen-detected breast cancer (266 of 1225) [11, 12], can also be used to estimate the number of potential false negatives in the DM-arm of the To-Be 1 in the informed review. The number of false negative cases resulting in interval cancers in the DM-arm would have been about 7 (29 × 24/100) and the number of false negative screen-detected cancers about 22 (101 × 22/100). This would have resulted in 29 extra cases (false negatives) of screen-detected cancers in the DM-arm, accounting for 116 (87 + 29) in total and a rate of 0.81%. When including the results from our informed review on the number of false negatives for the DBT + SM-arm (18 cases), a total of 113 (95 + 18) in 14,380 women for DBT + SM corresponds to a rate of screen-detected cancer of 0.79%, versus 0.81% in the DM-arm (116 cases in 14,369 women, p = 0.79) [7]. Under these assumptions, the DBT + SM versus DM cancer detection rate would not differ statistically significantly. However, these findings should be interpreted with caution. The number of false negatives in the DM-arm may have been overestimated, as DBT + SM was used to detect breast cancer in the follow-up of the DM-arm and the rates of false negatives in the DM-arm were calculated based on the data from 1990s to 2016 from different countries and programs [13, 14, 38].

In conclusion, this study examined potential false negative interval and consecutive round screen-detected cancers in the To-Be 1 trial and demonstrated that the percentages determined by both individual and consensus expert reviews were consistent with prior DM review studies. The results of this review indicate that the nonsignificant difference in cancer detection between DBT + SM versus DM in the To-Be 1 trial is complex and not caused by interpretive error alone.