Introduction

Desirable innovations in mammographic screening include reducing false positive recalls while maintaining sensitivity, as well as minimising interval cancers. An innovation involving AI could have the added advantage of reducing workload for radiologists, who are in short supply [1, 2].

Evaluation of AI in screening mammography is fraught for a number of reasons. A recent systematic review [3] indicated that only 5 of 36 AI systems performed better than a single radiologist reader, and that none exceeded the performance of a dual reader system with arbitration. However, few studies have data on interval cancers following the index screening round or data on cancers detected at the subsequent screening round. Assessment of AI has also frequently occurred in a “research environment” using highly enriched data sets as opposed to in a “work as usual” environment using an unselected population where the prevalence of true positives is typically < 1.2% [4].

The aims of our study were to investigate the performance of an established AI product in women undergoing their prevalent full-field digital screening mammogram during one calendar year at one BreastScreen Australia (BSA) service, including lesions identified during the prevalent screen, interval cancers, and next round (verified 3-year follow-up). The sensitivity of the BSA program for prevalent round cases is 85% (age-adjusted) [5]. The potential impact of the incorporation of AI into practice was also investigated.

Materials and methods

This was a retrospective analysis of consecutive prevalent round screening mammograms conducted in 2017 from the Monash BreastScreen service in the state of Victoria.

Participants

The BSA program is a national population-based program, commenced in 1991, that sets, and annually reviews, performance standards for individual screening services and is responsible for the oversight of service audits every 4 years. Women in the target age group (50–74 years), identified from the electoral roll, are routinely invited for a screening mammogram every two years. Although women aged 40–49 years are not invited, they are eligible for screening. For prevalent screens, BSA has a performance benchmark for recall in the target age group of less than 10% [6].

The study participants were women attending for their prevalent mammogram through Monash BreastScreen, a BSA-accredited service in metropolitan Melbourne, screening up to 60,000 women annually. The service operates through eight separate screening clinics and the images are transmitted to a central facility where they are read by members of a team of 16 radiologists. Monash BreastScreen serves a major city and inner regional population. As false positive screens are recognised as a bigger issue for prevalent than incident screening rounds [5], this study focussed on first-round attendances.

Standard practice

All two-view digital screening mammograms are read independently by two breast specialist radiologists (readers) and each decides to either clear or recall the case. If recalled, the radiologist scores the case as 3 (equivocal), 4 (suspicious), or 5 (malignant) [7]. Discordant results between the two readers are arbitrated by a third highly-experienced reader. Women with a score of ≥ 3 after arbitration are recalled to a single assessment clinic and those cleared return to routine screening.

Pathology specimens from the assessment clinic are analysed by Monash Health expert breast pathologists (National Association of Testing Authorities accredited). These results are correlated with imaging at a weekly clinical multi-disciplinary meeting. The final histopathology outcomes from subsequent breast surgery are entered into the BreastScreen Victoria (BSV) database.

One of the key measures of the performance of a screening program is false negatives. In the case of screening mammography, women who are identified with interval cancers prior to the next screening round or who are diagnosed with cancer in the next screening round are worth assessing as potential false negative cases from the index screening round. Not all such cases will be false negatives as some cancers develop de novo and have had no signs present during the index round, so in order to assess the issue of false negatives, all such cases need careful review. In our study, interval cancers were defined as invasive BCs detected in the 24 months after a negative screening episode [4]. These cancers are detected outside the screening program and reported to the Victorian Cancer Registry (VCR) under legislative guidelines. The VCR collates and subsequently reports these data back to BreastScreen Victoria [4]. All women with a 2017 prevalent screen who were diagnosed with IBC at their first incident round 2 years later were also reviewed.

For the purposes of this study, all data for the interval cancers and cancers diagnosed in the subsequent round were reviewed with the relevant histopathology data provided by BSV.

Consensus review was undertaken by three experienced radiologists (J.E., J.W., and A.L.) who had a combined experience of over 60 years with arbitration and interval cancer review, on full-definition workstations in a two-stage manner as described in other studies [8]. An initial review was done blinded to both the Transpara score and computer-aided detection (CAD) markings. Once consensus had been reached in relation to any abnormalities on the 2017 images, a second review was undertaken with the AI data included to confirm that the expert panel and the AI program were identifying the same mammographic feature. On the basis of this review, there were no mammographic signs present in the 2017 round, they were called true negatives (true intervals), if there were recognisable signs of a relevant abnormality in 2017 they were called false negatives and if there were subtle signs present in 2017, the cases were classified as having “minimal signs.”

“Ground truth” data set

This data set is considered to be “ground truth” as it includes all prevalent screening outcome data as well as verified 3-year follow-up. Women identified with IBC outside of the BreastScreen program after a minimum of 24 months from the time of their 2017 scan were classified as “lapsed attenders” and not included with those women diagnosed in 2019.

AI product

Digital mammograms were obtained using Siemens DR, Mammomat and Inspiration, Sectra DR, and Hologic Dimensions units. The AI software used was Transpara (version 1.7.0), a commercially available product that obtained international regulatory approval in 2018 (US Food and Drug Administration, European Co-operation on Accreditation standard and the Australian Therapeutic Goods Administration). ScreenPoint Medical BV provided the Transpara AI software, which was integrated into the BSV service Sectra IDS7 PACS Platform. Transpara 1.7.0 has been trained on over 1 million mammograms from established test sets. The mammograms used in this study were not used in any part of the algorithm training. The software analyses image information only with no input of patient demographics such as age or BC risk factors, nor does the current version have the capacity to compare studies with prior exams. Image analysis uses deep learning convolutional neural networks to detect calcifications [9] and soft tissue lesions [10] and provides an overall score ranging between 1 and 10, representing the BC risk (where 10 represents the highest chance of malignancy) [11]. The scores of 1–10 represent deciles so that ~10% of women are assigned a score in each of the 10 categories. The current analysis is based solely on the overall score of 1–10 and is not a lesion-specific analysis.

Ethics approval

Ethics approval for this project was gained by Monash Health Human Research Ethics Committee Low Risk Panel Monash Health (Monash Health Ref: RES-20-0000-166L-61177). All women who are screened through BreastScreen sign a form giving permission for their anonymised test results to be used for research purposes.

Data analysis

The performance of the radiologists and the AI system was assessed in terms of all lesions (IBC and DCIS) identified in 2017, IBCs identified in 2017, interval BCs (false negatives and minimal signs, not true negatives) and IBCs identified in the subsequent screening round (minimal signs but not true negatives). Data are presented as frequencies and percentages.

The parameters reported include sensitivity, specificity, and positive and negative predictive values, including percentages with confidence intervals. Proportions were compared using the chi-squared test or Fisher’s exact test and the associated p values provided.

This study was exploratory in nature and the size of the sample used was pragmatic and determined by the feasibility of having the AI program provide scores for prevalent screens in a single calendar year and for follow-up data to be available for both interval cancers and cancers identified in the subsequent screening round.

Results

Participants

In 2017, 53,584 women attended Monash BreastScreen and a subset of 7829 mammograms were prevalent screens. Women having their prevalent screen ranged in age from 40 and 85+ years. The proportion in the target age group 50–74 years was 68.4%.

Of the 7829 prevalent screens, 296 were unable to be given a Transpara score for a range of technical reasons, so 7533 (96.2%) women were included in this analysis (Fig. 1). There were 2 cases of DCIS and one case of IBC as well as two interval IBCs within the group of women who did not have a Transpara score.

Fig. 1
figure 1

Flow chart showing the number of women with prevalent screens, with Transpara scores available, recalled for further evaluation and diagnosed in the prevalent round

A total of 728 women from the 7533 (9.66%) were recalled for further assessment with 54 diagnosed with IBC and 13 with DCIS on the basis of mammographic abnormalities. Three women within the recalled group whose cancers were not the recalled abnormality and were diagnosed using ultrasound have been excluded from the analysis as their cancers were mammographically occult (Fig. 1).

A total of 798 women (798/7533 10.6%) received a Transpara score of 10.

Of the 728 women recalled by the radiologists, 36.4% had a score of 10 (Table 1). The overlap between the group of women recalled by the radiologists and the women with a score of 10, along with the lesions diagnosed in 2017, is shown in Fig. 2. There were 265 women who were both recalled and had a score of 10.

Table 1 Number, percentage, and cumulative percentage of women recalled to the assessment clinic with Transpara scores 1–10 in the 2017 prevalent round
Fig. 2
figure 2

Women recalled to the clinic, women with a Transpara score of 10, and women in both of these groups

Sixty-three of the 67 lesions diagnosed at the 2017 screening (51 IBCs and 12 DCIS) round were within the 265 women who were both recalled by the radiologists and scored a 10. The other 4 lesions (3 IBC and 1 DCIS) were within the group recalled by the radiologists but did not score a 10 (Table 2 and Fig. 3). This represents a sensitivity of 63/67 (94%) for a score of 10. Two of the four cases not classified as high risk by AI are shown in Fig. 4.

Table 2 The distribution of Transpara scores in women recalled, diagnosed with IBC or DCIS, false positives and women not recalled, in the 2017 round of screening (number and % by column)
Fig. 3
figure 3

The distribution of women diagnosed with IBC, DCIS, or identified as false positives in the prevalent round of screening (by percentage)

Fig. 4
figure 4

Examples of cancers missed by the AI system. A A 15-mm-large invasive ductal cancer (grade 2) with a soft density appearance (arrows) in both views lacks distinct margins. AI risk score of 5/10 and detected by one of two radiologists. MLO, mediolateral oblique view; CC, craniocaudal view, and ultrasound detail. B A 29-mm-large invasive ductal cancer (grade 3) (US1) is masked within the dense asymmetric fibroglandular tissue in the right upper outer quadrant (circle) with significant axillary lymphadenopathy (arrow and US 2). Both radiologists recalled this case—AI algorithm fails to recognise the context of the lymph node and provides low-risk score of 7. MLO, mediolateral oblique view; CC, craniocaudal view and US, ultrasound details

From this 2017 cohort, there were 12 interval IBCs diagnosed and 16 IBCs diagnosed in the incident round of 2019. There were no cases of interval DCIS or DCIS in 2019 in women cleared by the radiologists in 2017. A further three women were identified with IBC outside of the BS program, after 24 months, and were classified as “lapsed attenders.” One of these three women had a score of 9 on her 2017 screen.

Of the 12 interval cancers, 5 were considered false negatives, and all 5 of these intervals scored a 10 in 2017 (images from 2 of the 5 are shown in Fig. 5). Four of the 12 intervals were classified as “minimal signs” in 2017 and of these, 2 scored a 10 and 2 scored 1–9 (Table 3). Four of the 16 IBCs diagnosed in 2019 were considered to have had minimal signs in 2017 and one of these four scored a 10 in 2017.

Fig. 5
figure 5

Examples of false negative interval cancers detected by AI. A Author’s arrow left CC detail image of a 10-mm-large soft tissue mass in a postero-medial location (red circle CAD marking)—in one view only, scored 10/10 by AI. Adjacent microcalcifications (white diamond CAD markings) attributed a high score also. Clinical presentation 11 months later with a 25 mm invasive ductal cancer (grade 2) on histopathology. CC, craniocaudal view. B In the upper left MLO the red circle CAD marking—in one view only—identifies the findings of an asymmetric density with associated architectural distortion of the adjacent glandular tissue (author’s arrows), scored 10/10 by AI. Clinical presentation 7 months later with a 70 mm invasive lobular cancer (grade 2) on histopathology. MLO, mediolateral oblique view CC, craniocaudal view

Table 3 Interval invasive breast cancers and invasive breast cancers diagnosed in 2019 in relation to the Transpara score in 2017

The performance of the radiologists and the AI program in relation to 2017 diagnoses, interval cancers (excluding true negatives), and 2019 cancers (excluding those with no signs in 2017) is shown in Table 4. The AI program missed 4 lesions detected by the radiologists in 2017 but the AI program identified as high risk some IBCs that later presented as either interval cancers or IBCs in the 2019 screening round. Despite this, across all the comparisons (IBC and DCIS in 2017; IBC only in 2017; IBC in 2017 + interval false negatives + interval minimal signs; IBC in 2017+ interval false negatives + interval minimal signs + 2019 minimal signs) the differences in the performance of the radiologists and the AI program in terms of sensitivity, specificity, positive and negative predictive value were small, or had wide confidence intervals. The details of the invasive cancers detected by the radiologists and/or scored 10 by AI in the 2017 prevalent round are provided in Table 5.

Table 4 Performance of the radiologists and Transpara in relation to lesions identified in 2017, then combined with selected interval IBCs and IBCs identified in 2019
Table 5 Details of 2017 prevalent invasive cancers and interval cancers (false negative and minimal signs)

Discussion

Our study is an independent (not industry-led) assessment of the performance of an established AI program in the prevalent round of screening in a “work as usual” accredited screening mammography program where the likelihood of a true positive lesion is up to 1.2% [4], and where data on interval cancers and next round cancers were available (ground truth) [12]. Our focus on a prevalent screening round was deliberate as recall rates are consistently higher in prevalent than incident rounds [5] and the high negative predictive value for AI demonstrated in this study has the potential to reduce unnecessary recalls (false positives) in this group (Table 2 and Fig. 3).

A Transpara score of 10 identified 63/67 cases of IBC or DCIS in the 2017 screening round, however, a Transpara score of 10 did identify some interval IBCs and IBCs identified in the subsequent screening round, although the difference in the sensitivity between the AI score and the radiologists was not different statistically. A large review of interval cancers [13] noted the problem of small invasive tumours “masked” by dense fibroglandular tissue or with “minimal signs.” Typically, 20–25% of intervals were classified as false negatives, where observable mammographic features were missed by radiologists. In our study, 5/12 intervals were considered as false negatives at “blinded” expert review, and all scored 10 (highest risk) by Transpara and marked by the AI algorithm (images 5A and 5B). Our study confirms the role of AI in the minimisation of “false negatives” [14, 15]. It was notable that two of the four women who were missed by AI in the prevalent round had signs that were not missed by a radiologist [16], demonstrating the need for all images to be read by at least one radiologist [12]. We consider that commencing the task of integrating AI into screening mammography and upskilling radiologists to work in this setting is better started from a position supporting “human in the loop” collaboration [17]. Unlike some recent authors, we would not advocate for a scheme where some images are only analysed by AI and not read by a radiologist [14, 16, 18]. We envisage a system similar to that of McKinney [15] and Raya-Povedano [19]. An iterative review of reader performance will be required to avoid increasing the recall rate from the “cancer-enriched” groups [1]. A review of AI errors is also important [12]. Radiologists also need to understand the psychology of how AI affects their reporting [12, 18, 20]. Senior clinician oversight will be pivotal as protocols evolve which must achieve acceptable clinical standards both for organisations responsible for the governance of breast screening, as well as for the women being screened [12, 20,21,22]. The incorporation of AI into the reading of screening mammograms has been shown to reduce radiologist workload in some settings [23] although this is not universally the case [22]. Resources saved by no longer having all images read by two independent radiologists could be invested in optimising case review and arbitration.

A strength of this study is that the mammograms analysed represent a consecutive series in one calendar year from one screening service and the outcomes include IBCs and DCIS diagnosed in the prevalent round, as well as 3-year follow-up. Limitations included that a small proportion of mammograms could not be scored by the AI program for technical reasons and this would need to be addressed if AI was to be introduced into routine practice. Furthermore, the study was limited to prevalent round screens from a single calendar year.

Conclusion

Our study has shown that the AI program used in our study has a similar sensitivity to that of expert radiologists in the prevalent round, could reduce interval cancers (false negatives), and has a high negative predictive value for scores 1–9 demonstrating its potential role in false positive reduction.