Introduction

Digital breast tomosynthesis (DBT) has been shown to have a higher sensitivity for breast cancer detection than the current standard two-view digital mammography (DM) with mediolateral oblique (MLO) and craniocaudal (CC) views [1,2,3]; however, because it is a more complex examination with multiple slices causing a longer reading time, DBT is a more resource-intensive process [4]. Double reading is practiced in many screening programmes, especially in Europe, further amplifying the workload [5]. Moreover, the radiation dose with DBT is generally higher than with DM [3]. The radiation dose of one-view wide-angle DBT (MLO) is lower than standard two-view DM but has a higher sensitivity and similar performance to two-view DBT combined with two-view DM [1]. One-view wide-angle DBT screening reduces the interval cancer rate, which is often used as a surrogate measure for breast cancer mortality [6]. Many studies also suggest the use of two-view DBT, which avoids the risk of any information only available in the CC-view being lost and can result in an even higher sensitivity, although it can have a higher radiation dose than DM [2, 3, 7, 8]. However, a slightly higher dose might be acceptable if it meant a gain in sensitivity. Apart from radiation dose, the major remaining obstacle to implementing full DBT screening is the increased workload caused by longer reading time. In previous studies, the reading time for two-view DBT was 38–76% longer than for DM [7,8,9,10,11].

The use of artificial intelligence (AI) for the interpretation of DM examinations has shown promising results, both as a decision support tool for a reading radiologist [12, 13] and as a stand-alone reader [14,15,16,17]. AI as a stand-alone reader has been proposed to identify normal cases (cases which could safely be excluded from radiologist readings [14, 15]), to replace the second reader [16, 17], and to identify high-risk cases for more thorough assessment [15, 16]. While there are several studies of AI for DM, studies of AI for DBT are relatively scarce, with only a few studies of DBT on screening material at the time of writing, including two focusing on DBT reading workload reduction [18,19,20,21]. Several reader studies on the use of AI as decision support for DBT with cancer-enriched datasets have shown a reduction in reading time per examination with maintained or increased accuracy [22,23,24,25,26,27]. An AI model for predicting future short-term cancer risk from DBT has also been developed [28].

If AI could ease the burden of reading DBT, this could open up possibilities for the broad introduction of DBT in population-based screening programmes. The overall aim of this study is to retrospectively assess the cancer detection performance of a commercially available AI system on a single-view, wide-angle DBT screening material with paired two-view DM screening as a reference. More specifically, we will investigate our hypotheses that AI can be used to make DBT screening more efficient in terms of reading workload, by identifying normal cases that can be excluded from human reading, replacing the second reader, or replacing both readers. This is compared to DM and DBT screening with double reading, and the characteristics of detected and missed cancers are studied.

Methods

Study population

This retrospective study used data from the prospective population-based screening trial Malmö Breast Tomosynthesis Screening Trial (MBTST; ClinicalTrials.gov number NCT01091545) [1]. A total of 14,848 women were examined with both one-view wide-angle DBT (MLO) and two-view DM (MLO + CC) with separate double reading and recall decisions after consensus discussion. The readers (seven radiologists with 2–41 years of experience in breast radiology) used a five-point cancer probability scale (1: no abnormalities, 2: benign findings, 3: non-specific findings with low cancer probability, 4: cancer-suspicious findings, 5: highly cancer-suspicious findings), later called the “radiologist score”. Consensus discussions were held for all cases with values ≥ 3 by at least one reader or if marked for discussion by at least one reader. Decisions from each reader were extracted from the radiology information system. Reading time data were unavailable. All examinations were performed with a Mammomat Inspiration (Siemens Healthineers). Cancers diagnosed during the follow-up until the next screening (18 or 24 months, depending on age) were included as interval cancers. The study was approved by the Local Ethics Committee at Lund University (official records number: 2009/770).

Some examinations had to be excluded from the present study, as illustrated in more detail in Figure 1, together with the number of consensus discussions, recalls and screening-detected cancers for DM and DBT reading arms. In total, 14,772 women were included, with 157 women diagnosed with cancer, including 135 screening-detected cancers and 22 interval cancers.

Fig. 1
figure 1

Chart of the study population, including exclusions, recalls, screening-detected cancers, and interval cancers. aIncluding two cancers. bNot supported by the AI system. cInterval cancers detected during a follow-up period of 1.5 or 2 years, depending on age

Artificial intelligence system

The DBT examinations were analysed separately from DM using the commercially available DM and DBT AI system Transpara v1.7.0 (ScreenPoint Medical) [12, 14, 19, 26, 27, 29]. The system classifies each examination with a score between 0 and 9.99. The system was not trained on any data from the studied population. In this study, the AI system was used retrospectively as a stand-alone reader. In clinical use, the AI system automatically analyses all screening examinations and can present the results integrated into the PACS (picture archiving and communication system) user interface.

Evaluation of the AI system

The performance of the AI system on DBT was compared to radiologist double reading and single reading DBT. To reduce the DBT screen-reading workload to the same or lower level as with DM screening, several strategies for using AI were evaluated, as illustrated in Figure 2.

  1. A.

    AI Gatekeeper, where the AI system is used to exclude normal examinations from human reading, while the other examinations are double-read. Different AI score thresholds were evaluated. Special focus was placed on a threshold where 50% of the examinations would be excluded, giving an unchanged screening workload compared to DM screening (conservative assumption of DBT reading time twice as long as DM [7,8,9,10,11]).

  2. B.

    Single reading + AI, where the AI system is used to replace the second reader, also leads to a reduction by half in the total number of readings. In order to facilitate comparison, the AI system was used as a stand-alone reader with a high specificity threshold selected to have the same number of consensus discussions as with DBT double reading.

  3. C.

    AI alone, where screening examinations are analysed only by the AI system. The AI system analysed all the examinations, and examinations were classified as sent to consensus discussion if the AI score was above a threshold where the number of discussions is the same as with DBT double reading.

Fig. 2
figure 2

Chart of different ways of implementing AI in DBT screening workflows and workflows with human reading of DBT without AI used for comparison. AI score threshold for a high sensitivity > 3. AI score threshold for high specificity was selected separately for single reading + AI and AI alone in order to achieve about the same number of consensus discussions as DM double reading. * In cases where no actual consensus discussion is available, a surrogate consensus discussion is used by assuming recall when the AI score is among the highest 2%

Results from actual consensus discussions prompted by DBT double reading were used when available, but since no consensus discussions had taken place for cases where only AI prompted a consensus discussion, it was necessary to use surrogate consensus discussions for some cases in the single reader + AI and AI alone workflows. Since these cases had not been flagged for discussion by any of the readers, the recall rate at those consensus discussions would likely have been lower than the total recall rates for DM and DBT screening double reading arms at 2.5% and 3.6%, respectively [1]. As an approximation, cases with an AI score among the highest 2% in the population were assumed to be recalled.

For each model, the number of human readings, detected cancers, missed cancers, and false positives were investigated and compared to single and double reading DBT, respectively. Ground truth was defined using all screening-detected cancers (DM + DBT) and interval cancers. Some interval cancers might be detectable at screening, and by including interval cancers in the ground truth, the AI system is allowed to detect cancers undetected by radiologists. The number and characteristics of missed cancers were studied. The results were also compared with the current standard screening with double-read DM.

Cancer characteristics

The number of detected and missed cancers with AI gatekeeper, single reader + AI, and AI alone, respectively, were calculated for subgroups including breast density (BIRADS 4th edition), histological type, histological grade (for invasive cancers), nuclear grade (for in situ cancers), tumour size, presence of lymph node metastases and radiographic appearance. The corresponding results from double reading DM or DBT are provided for comparison.

Statistical analyses

The distributions of the AI scores of all examinations and cancer cases were analysed with descriptive statistics, and 95% confidence intervals for proportions were calculated with the Clopper-Pearson method. The cancer detection performance of the AI system was analysed with receiver operating characteristics (ROC) with ground truth based on DBT + DM screening results combined with interval cancers. Additional analyses were performed with ground truth defined by DBT screening results and DBT + DM screening results, respectively. Corresponding AUCs were calculated, and bootstrapping with 1000 replicas was applied to yield 95% confidence intervals. Differences in the number of discussions, recalls, and sensitivity between different workflows were tested with exact McNemar’s test using R 4.0.5 (R Foundation for Statistical Computing). All other statistical analyses were performed in MATLAB 2020a (The MathWorks).

Results

The distributions of AI scores in the whole study population and the cancer cases are presented in Figure 3. Of all cancers, 88 got an AI score of 10, while 85 of the screening-detected cancers had an AI score of 10. The ROC curves for cancer detection by the AI system with different ground truths are presented in Figure 4. The corresponding operating points for double reading and single reading DBT are shown for comparison. While the operating points for double reading are clearly higher than the corresponding ROC curves, the operating points for single reading are very close to the ROC curve. The AUC for AI system cancer detection was 0.92 with a 95% CI [0.88; 0.94] when DBT screening was used as the ground truth.

Fig. 3
figure 3

Distribution of AI scores in the whole population and cancer cases

Fig. 4
figure 4

Receiver operating characteristics (ROC) analysis for cancer detection using the AI system with different definitions of ground truth (GT). Ground truth defined by DBT screening results is a direct comparison between AI and radiologists in DBT screening, but means that the radiologists by definition will be better than AI as they define the ground truth. Ground truths defined by DBT + DM screening results and DBT + DM screening results combined with interval cancers are also evaluated to theoretically allow the AI system to perform better than radiologists reading DBT. Operating points for single and double readings of DBT are provided for reference. Single-reader sensitivity is defined by using recalled after consensus discussion. Single-reader specificity is defined by using marks for discussion. IC, interval cancer

For the AI gatekeeper approach, an exclusion threshold of 3.0 can be considered suitable, as this would result in 7312 (49.5%) examinations being excluded from human reading (Table 1). The excluded examinations included nine cancers (6%), seven of which were screening detected, which would be missed with the AI gatekeeper approach. The number of false-positive recalls would be reduced by 111 (27%).

Table 1 Cumulative number of examinations, cancers, and false positives depending on DBT AI score threshold

The number of necessary screening procedures, such as readings, consensus discussions, and recalled women, for different workflows are shown in Table 2, together with the resulting number of detected cancers, missed cancers, and false-positive recalls. The proportion of cases sent to discussion, being recalled, screening workload, and sensitivity with the different workflows are illustrated in Figure 5, including comparisons with DM and DBT double reading.

Table 2 DBT screening procedures and number of detected cancers, depending on workflow
Fig. 5
figure 5

Proportion of examinations sent to consensus discussion or recalled, relative screen reading workload, and cancer detection sensitivity (ground truth defined by DM + DBT screening-detected cancers and interval cancers) for different workflows. Comparison of AI workflows with DBT double reading and DM double reading by subtraction. Discussion and recalls are on the left scale (black), while sensitivity is on the right scale (blue). Whiskers show the 95% confidence intervals calculated by normal approximation. The hourglass symbol illustrates the screen reading workload, where one hourglass equals the workload with the current standard method DM double reading. The reading time of DBT is assumed to be about double that of DM double reading. Differences in the proportion of examinations sent to discussion and recalled are given as percentage points (pp), with percentage changes in brackets. p values calculated with exact McNemar’s test. All calculations were performed with high precision and rounding was applied only on presented values, which may cause some small variations from expected values in the comparisons

The AI gatekeeper approach, focusing the radiologist reading time on the high-risk cases by double reading cases with AI scores of 4–10, would require 14,920 DBT readings and detect 121 cancers (Table 2, Fig. 5), which is 95% [0.90; 0.98] of the DBT double reading screening-detected cancers, which is a small but still significant difference (p = 0.031). Compared to DM screening-detected cancers, 27% more cancers would be detected, which is a significant difference (p < 0.001). The number of false-positive recalls with this approach is only 8% higher than with DM double reading and 27% lower than with full DBT double reading. The positive predictive value is 0.29 (121/415).

Single reading + AI, with an AI score threshold for consensus discussion set to obtain the same number of discussions as DBT double reading, would result in the detection of 120 cancers (Table 2, Fig. 5). That means 94% [0.89; 0.98] of the DBT double reading detected cancers, which is a minor but significant difference (p = 0.016). Compared to DM screening, 26% more cancers would be detected, which is significantly more (p < 0.001). The proportion of false-positive recalls is increased by 62%. The positive predictive value is 0.21 (120/561).

AI alone, with an AI score threshold selected to achieve the same number of consensus discussions as DBT double reading (Table 2, Fig. 5), would detect significantly less cancers than DBT double reading (22%, p < 0.001) but about the same number of cancers as DM screening with double reading (p = 0.689), while the number of false-positive recalls would be substantially reduced.

Figure 6 shows an example of a cancer case that was detected with DBT and AI but not with DM double reading, and a cancer case that was detected with DBT double reading but missed with DBT and AI. The characteristics of all cancer cases detected using different methods are presented in Table 3. The numbers are small, but the cancers detected with AI gatekeeper and with single reader + AI exhibit no apparent differences from the DBT double-reading detected cancers, apart from a slightly lower proportion of detected in situ cancers. While DBT screening read by AI alone detects about the same number of cancers as DM double reading, a larger proportion are invasive cancers.

Fig. 6
figure 6

Examples of detected and missed cancers when using DBT and AI. a–c Example of cancer detected with AI on DBT and with DBT double reading, but not with DM double reading. DBT MLO (a), with the cancer identified by the AI system (red circle). DM MLO (b) and DM CC (c). The AI system gave a score of 10. The cancer was a 26 mm invasive lobular carcinoma. df Example of cancer missed with AI on DBT, as well as DM double reading, but detected with DBT double reading. DBT MLO (d), with the cancer lesion (yellow circle) not identified as suspicious of cancer by the AI system. DM MLO (e) and DM CC (f), where the readers did not detect the subtle lesion, probably due to partly overlapping tissue in the MLO projection (red circle) but clearly visible in the CC projection (blue circle). The AI system gave an examination score of 2 (yellow circle), meaning that this examination would be discarded without human reading with the AI gatekeeper workflow. The cancer was a 10 mm tubular cancer

Table 3 Examination AI risk score and detection for different cancer types and characteristics

Discussion

We retrospectively evaluated an AI system for cancer detection in DBT examinations and studied different ways of implementing this system in a screening workflow to reduce the DBT reading workload, with the aim of exploring the field and building a basis for future prospective studies. An AI gatekeeper approach, where the AI system excludes low-risk cases and instead focuses radiologists’ time on double-reading high-risk cases, was shown to detect almost all the cancers detected with DBT double reading while reducing the number of false-positive recalls. With a single reader + AI, where AI replaces the second reader, slightly fewer cancers would be detected, and under the assumptions used in this retrospective study, an increase in false-positive recalls was found. The AI-alone approach to DBT could reach the level of sensitivity of DM screening without human readers.

AI as a stand-alone reader

As a stand-alone reader, the AI system performs on a par with single reading but is still inferior to double reading of DBT. In the MBTST, DBT double reading detected significantly more cancers than DM double reading [1]. However, this study shows only a small difference in AI AUC, with 0.90 at DBT compared to 0.88 at DM in the same population [30]. This might be a sign that AI for DBT still cannot utilise all the additional information in DBT. The availability of DBT training data is still limited, but in the future, more training data might improve performance, and new applications may emerge.

Approaches for workload reduction and clinical implications

Both the AI gatekeeper and the single reader + AI approaches could almost halve the number of DBT readings, which, if DBT reading time was assumed to be about 75% longer than DM reading time, would lead to a reduction in total reading workload, even compared to baseline DM screening. With the AI gatekeeper, the number of false-positive recalls would be almost the same as with DM screening. This could increase the cost-effectiveness of DBT screening enough to enable the introduction of DBT in population-based screening programmes, and more cancers could be detected without large increases in screening expenses. The recall rate from single reader + AI would probably be lower in a prospective situation where the reader used AI interactively and real consensus discussions were held to select cases of recall, and would probably not differ considerably from that of the AI gatekeeper. However, the transition to DBT screening would require further education of breast radiologists followed by a learning phase. The reading workload could be further reduced if AI alone were used on one-view DBT and still performed on a par with two-view DM double reading, with a slightly lower radiation dose, and AI alone could be an option in cases where reading resources are very limited. No clear differences in detection among cancers with different characteristics were shown in this study, but the possibility that such differences might appear if a larger population were to be studied cannot be excluded.

Changing the workflow might affect the readers, which could limit generalisability to the prospective situation. The composition of the screening population seen by the radiologists would be affected in an AI gatekeeper workflow. Since the workflow would otherwise be unchanged with double reading and consensus discussions, the effects on reading performance ought to be limited. In single reader + AI, knowledge of being the only reader could increase discussion or recall rates. The surrogate consensus discussions are based only on AI results, while real consensus discussions combine AI findings, radiologist interpretation, and screening history and would probably result in another selection of cases for recall.

Previous studies have proposed the use of AI as a decision support tool for a single-reading radiologist [23, 24, 31], which might increase single-reader cancer detection but comes with the risk of increasing false-positive recalls for areas identified by AI. Heavily relying on AI findings might reduce sensitivity for cancers not detected by AI.

Implementing AI in the workflow in a way that replaces a human reader can be a big step both legally and psychologically. An approach with a single reader + AI would probably more easily gain acceptance and be easier to implement, since all examinations would still be read by a human. Of women taking part in breast cancer screening, 59% reported they would trust a computer-only reading, while 84% would trust AI combined with a human reader [32]. A study of mammography AI preferences among primary care providers reported an equal inclination to recommend both approaches [33]. However, with AI successively becoming more common in various applications, AI will probably grow to become more accepted and trusted.

Considering the different aspects of the different screening methods, we believe that single reader + AI could be the easiest way to implement AI in DBT screening, as this would avoid many of the legal and psychological obstacles of implementing AI as a sole reader. This advantage probably outweighs the advantage of the slightly better performance seen with the AI gatekeeper.

Comparison with previous studies

This study confirms a previous study that showed a 25% increase in sensitivity compared to DM double reading, with AI-triaged two-view DBT screening excluding about 70% of the examinations from human reading [19]. However, the comparison with our study is complicated by differences in the study design and characteristics of the screening programme and population. The AUC in our study is somewhat lower than in two previous studies of stand-alone AI on a DBT screening material [18, 20] and slightly higher than in one previous study [21].

Limitations

This study has some limitations, including being a single-centre study with images acquired with a single vendor DM/DBT unit and being analysed with a single AI system. The reading time was not measured, but had to be estimated from the literature; therefore, precise effects on the total reading workload cannot be concluded. One-view DBT was used, while screening with two-view DBT could detect more cancers and has also been more thoroughly studied in a number of studies [2, 3, 7]. The use of AI with two-view DBT might provide different results and should also be studied. While including interval cancers in the ground truth allows the AI to find undetected cancers, including cancers detected at the following screening would further increase the possibility of undetected cancers being found. Furthermore, this study is retrospective, meaning that it cannot analyse how the use of AI affects readers. The use of AI as decision support has not been studied. The consensus discussions did not include AI results, which could potentially have affected decisions. The surrogate consensus discussions were based only on AI scores without any assessment by radiologists.

Conclusions

AI can reduce the reading workload of DBT screening by either excluding low-risk cases from readings or replacing the second reader. This could enable us to gain most of the benefits of DBT screening without increasing the total reading workload and thus advance the implementation of DBT in screening. Considering the legal and psychological obstacles to having examinations not read by a human, the replacement of the second reader with AI might be the most feasible strategy. Alternatively, AI can replace both human readers in DBT screening, with the same sensitivity as DM double reading and fewer false-positive recalls. Prospective studies are needed to obtain more realistic consensus discussion results and investigate how AI affects the behaviour of readers.