The aim of this retrospective study was to assess the potential of using AI to reduce interval cancers in mammography screening. We found that AI could potentially aid radiologists in detecting up to 19.3% of the interval cancers at screening that in addition showed at least minimal signs of malignancy. Since interval cancers in general are more aggressive than screen-detected cancers, the clinical benefit could be considerable. In this cohort, 8% of the women had interval cancer with grave outcome, of which 23% were correctly located and classified as high risk by AI. Since the shortest follow-up period was 3 years, the number of interval cancers with grave outcome was likely on the lower end.
In a retrospective study on screening data from the USA and UK, McKinney et al showed that a mammography-AI system could reduce false negatives by 9.4% and 2.7% (US and UK dataset, respectively) . In this study, including a larger number of cases, we found a larger reduction of interval cancer. As far as we are aware, no other published study includes an in-depth analysis of AI performance in relation to false negative interval cancers.
The majority (61%) of interval cancers were classified as true negative, of which 82% had dense breasts, a well-known risk factor for interval cancer [2, 24]. Over all, the study population had a high proportion of women with dense breasts, similar to a previously reported interval cancer cohort . Using a screening modality that is less affected by breast density than mammography could be one way of increasing the sensitivity of the screening examination. Breast tomosynthesis can reduce the problem with dense tissue although the results of screening with tomosynthesis in terms of reduction of interval cancer have been conflicting [26, 27]. Supplementary screening with ultrasound and magnetic resonance imaging has been shown to reduce interval cancer rate, but at the expense of false positives and increased cost [28, 29]. This study suggests that AI can be used in a simple way to enhance the sensitivity of mammography screening without introducing supplementary modalities.
We do not suggest that all screening exams with high AI risk should be recalled, which would result in an unacceptable high recall rate (10%). The cancer frequency in mammography screen exams with risk score 10 is about 44/1000 , which means that the majority of the exams are cancer-free. In a prior retrospective study on screening data, we found that the highest proportion of false positives were found in risk group 10, which implies that the mammograms were challenging to analyse both for humans and AI . In addition, reader awareness of high AI risk could influence radiologists to lower the threshold to recall, resulting in a reduction of false negatives at the expense of an increase in false positives . To address the potential clinical utility of using AI to lower interval cancer rate at a clinically acceptable specificity, we therefore chose to confine the potential interval cancer reduction to retrospectively visible cancers that were correctly CAD-marked as high risk. Roughly 1/3 of interval cancers received risk score 10, but only half of these were considered to have a suspicious finding that was correctly located with a CAD-mark. It is important to bear in mind that even if a cancer is correctly CAD-marked, it does not necessarily mean that it will be recalled by radiologists, as was shown in a retrospective reader study by Nishikawa et al , nor that a cancer necessarily will be diagnosed in the work-up [32, 33], which applies especially to those with minimal signs at screening.
The potential reduction of interval cancer using AI was modest, but involved women diagnosed with interval cancer with grave outcome that most likely would have benefitted from an early detection. Furthermore, even with the use of a high-sensitivity modality such as MRI, not all interval cancers will be detectable at screening . The tumour biology of certain subtypes of breast cancer has a rapid growth rate and/or with an initial subtle or benign radiographic appearance, such as the triple negative subtype [4, 23]. AI performance in relation to tumour biology and stage of interval cancers will be included in future studies.
Notably, the interval cancer cohort in this study included a high proportion of women with prior breast surgery, including surgery of cancer, benign lesions and breast reduction. The surgical deformation of normal breast parenchymal architecture can lead to a tumour masking effect that might compose an independent risk factor of interval cancer. Since we do not have data on how common surgical procedures are in a screening population, a conclusion cannot be drawn. To the best of our knowledge, prior breast surgery has not previously been reported as a risk factor for interval cancer and warrant further studies.
There was a significant correlation between classification groups of interval cancer and AI risk scores. This finding raises an intriguing question whether AI could be used in the clinical audit of interval cancers , taking advantage of AI as an interval cancer classifier that is free from hindsight bias. However, this has to be further studied, considering that the review process of interval cancers in this study was subjected to limitations, informed review of a cohort consisting solely of interval cancers. This review method has been shown to lead to a higher proportion of interval cancers classified as false negative compared with a review process that is blinded or with a mix of cases, or even seeded into routine screening [8, 9].
The limitations of this study are several. The informed review process of interval cancer could have inflated the number of false negatives, as mentioned above. The generalizability is further limited due to the use of a single AI system. A study comparing the performance of other AI systems on the same interval cancer cohort is ongoing. In addition, the AI algorithm used in this study has since study completion been updated to an improved version, implying that the potential reduction of interval cancers could be higher. The study was performed in a Swedish screening setting, e.g. starting at a younger age with initial shorter screening intervals than European recommendations . The recall rate, cancer detection rate and interval cancer rate in this screening setting are aligned with European recommendations (approx. 3%, 6/1000 screened women, and 2/1000, respectively). The screening exams were acquired using different mammography devices but did not cover all major mammography vendors. The main limitation is, however, the retrospective design that only provides a theoretical estimation on interval cancer reduction. The use of AI in screening and how the risk scores and CAD-marks influence radiologists’ decisions, and whether AI should be added to double reading or replace one reader, has to be further evaluated in a prospective setting, taking false positives into account.
In conclusion, this study has shown that an AI system detected 19% of interval cancers at the preceding screening mammograms that in addition showed at least minimal signs of malignancy. Importantly, these cancers were correctly located and classified as high risk by AI, thus obviating supplementary screening modalities. AI could therefore potentially aid radiologists in their screen reading to reduce the number of interval cancer and consequently contribute to a further reduction of breast cancer mortality. The implications in a screening programme have to be evaluated in a prospective study.