Frequency and characteristics of errors by artificial intelligence (AI) in reading screening mammography: a systematic review

Zeng, Aileen; Houssami, Nehmat; Noguchi, Naomi; Nickel, Brooke; Marinovich, M. Luke

doi:10.1007/s10549-024-07353-3

Frequency and characteristics of errors by artificial intelligence (AI) in reading screening mammography: a systematic review

Review
Open access
Published: 09 June 2024

Volume 207, pages 1–13, (2024)
Cite this article

Download PDF

You have full access to this open access article

Breast Cancer Research and Treatment Aims and scope Submit manuscript

Frequency and characteristics of errors by artificial intelligence (AI) in reading screening mammography: a systematic review

Download PDF

1427 Accesses
12 Altmetric
1 Mention
Explore all metrics

Abstract

Purpose

Artificial intelligence (AI) for reading breast screening mammograms could potentially replace (some) human-reading and improve screening effectiveness. This systematic review aims to identify and quantify the types of AI errors to better understand the consequences of implementing this technology.

Methods

Electronic databases were searched for external validation studies of the accuracy of AI algorithms in real-world screening mammograms. Descriptive synthesis was performed on error types and frequency. False negative proportions (FNP) and false positive proportions (FPP) were pooled within AI positivity thresholds using random-effects meta-analysis.

Results

Seven retrospective studies (447,676 examinations; published 2019–2022) met inclusion criteria. Five studies reported AI error as false negatives or false positives. Pooled FPP decreased incrementally with increasing positivity threshold (71.83% [95% CI 69.67, 73.90] at Transpara 3 to 10.77% [95% CI 8.34, 13.79] at Transpara 9). Pooled FNP increased incrementally from 0.02% [95% CI 0.01, 0.03] (Transpara 3) to 0.12% [95% CI 0.06, 0.26] (Transpara 9), consistent with a trade-off with FPP. Heterogeneity within thresholds reflected algorithm version and completeness of the reference standard. Other forms of AI error were reported rarely (location error and technical error in one study each).

Conclusion

AI errors are largely interpreted in the framework of test accuracy. FP and FN errors show expected variability not only by positivity threshold, but also by algorithm version and study quality. Reporting of other forms of AI errors is sparse, despite their potential implications for adoption of the technology. Considering broader types of AI error would add nuance to reporting that can inform inferences about AI’s utility.

Effect of artificial intelligence–based computer-aided diagnosis on the screening outcomes of digital mammography: a matched cohort study

Article 16 May 2023

Reader bias in breast cancer screening related to cancer prevalence and artificial intelligence decision support—a reader study

Article Open access 02 January 2024

AI in breast screening mammography: breast screening readers' perspectives

Article Open access 09 December 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Artificial intelligence (AI), a rapidly evolving field of data science in which computer algorithms are developed to perform complex tasks, has been applied to screening mammography for the early detection of breast cancer with the aim of improving outcomes for screening participants [1]. AI has the potential to identify cancers in mammograms that are not perceptible to human readers, thereby potentially increasing the sensitivity of screening and improving outcomes for women through initiation of treatment for early-stage disease. Other proposed benefits of AI include fewer false positive findings that lead to anxiety and unnecessary investigations, and workforce efficiencies for screening programmes that may translate to lower programme costs and improvements in the screening experience for women. Such benefits assume that AI performs at least as accurately as human readers in detecting breast cancer, and research has therefore focussed on evaluating the comparative accuracy of algorithms and human readers. However, there is recognition that even when algorithms exhibit high performance in selected research datasets, AI errors in cancer detection (false positives, FP; false negatives, FN) may be greater when algorithms are applied in “real-world” settings or transferred between populations [2]. Furthermore, technological updates can produce subtle changes to medical images which may not be obvious to humans but can alter the AI’s output [3]. Such errors may be difficult to detect and explain by humans [4] and may strongly influence decision making by human readers (automation bias) [5]. Given the theoretical ease for AI algorithms to be scaled up and applied to large populations, unpredictable or unexpected errors may lead to harmful consequences.

Beyond the potential for FP or FN cancer findings, the concept of AI “error” in automated mammography interpretation has not been clearly delineated. Other types of error may include a (true positive) cancer detected in the wrong location, or technical errors that result in the algorithm failing to process images or generate a result. Earlier systematic reviews presented AI error as FP and FN, which is consistent with the focus on test accuracy in the literature [6,7,8]. However, imaging or lesion features associated with these FP and FN were not elaborated, and other potential forms of error were not reported. At present, it is unclear what forms of AI error are reported in the literature, as well as the frequency and lesion or imaging features of these AI errors.

In this study, we aim to identify the range of outcomes that have been reported as AI errors; quantify the frequency AI errors; and to describe the study, imaging, or lesion features associated with AI errors in practice. To meet this objective, we performed a systematic literature review of external validation studies of AI algorithms for independent mammographic interpretation using real-world screening data.

Materials and methods

This systematic review was performed and reported in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA) statement [9]. The review protocol was registered in the International Prospective Register of Systematic Reviews (PROPSERO) (CRD42022340331).

Information sources and literature search

A literature search was conducted without language restrictions for diagnostic accuracy studies published from 1 January 2010 to 11 July 2022. To capture contemporary AI algorithms, the search was limited from January 2010, coinciding with technical and hardware developments that facilitated efficient processing of machine learning [10]. Databases searched include MEDLINE, EMBASE, SCOPUS and a pre-print database, ArXiv. We reviewed reference lists of relevant systematic reviews to identify the additional studies. Details of the search strategy are listed in Online Resource 1.

Study selection

One reviewer (A.Z.) independently screened titles and abstracts and subsequent full-text articles against eligibility criteria (Online Resource 2). A second reviewer (M.L.M.) independently screened a 25% sample of titles and abstracts and the final set of included full-text studies for quality assurance.

Eligible study designs were external validation studies performed in population breast cancer screening settings where the AI algorithm acted as an independent reader (defined as a standalone system for replacement of radiologist reading, or as a pre-screen to triage whether the mammogram requires further review by a radiologist). Where studies included both conventional mammography and tomosynthesis, data on mammograms only were included.

Studies were excluded if more than 5% of included mammograms were incomplete; AI was used as a prediction tool (e.g. cancer subtypes, lesion characteristics or risk) or to assist radiologist reading (meaning the read was not solely from the AI algorithm); or AI was implemented for other imaging types (e.g. magnetic resonance imaging or ultrasound). Studies were excluded if outcomes did not include AI errors.

Data extraction and risk of bias assessment

Two reviewers (A.Z. and N.N.) independently extracted data on a pre-designed standardised collection form. Data that were systematically extracted included study design and setting, population characteristics, commercial availability, frequency and characteristics of pre-specified AI errors [false positives (FP), false negatives (FN), location error, technical error] and reference standard. Other forms of AI error were extracted when reported. FP was defined as incorrect presence of a suspicious finding (in cases where no cancer is found). FN was defined as cancer not detected by AI but detected by radiologist(s) or found at follow-up. Location error was defined as correct diagnosis of cancer, but the region of interest indicated by AI does not correspond with cancer location. Technical error was defined as failure of AI to process and interpret the mammogram or output a finding.

From studies reporting AI accuracy, we extracted raw values to derive 2 × 2 tables cross-classifying the AI result (positive or negative) and the reference standard finding (cancer present or absent). From these values, we calculated false positive proportions (FPP) and false negative proportions (FNP) per study (FP or FN divided by total number of examinations). When studies reported data at comparable positivity thresholds (including multiple possible thresholds per study), we extracted data and calculated FNP and FPP, and sensitivity and specificity estimates at those thresholds. Only common positivity thresholds across studies were reported.

Two reviewers (A.Z. and M.L.M.) independently assessed methodological quality using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool modified to the review question and QUADAS-AI preliminary extension [11] (Online Resource 3). Risk of bias of individual studies was assessed under four domains including (i) patient selection, (ii) index test, (iii) reference standard and (iv) flow and timing. The first three domains were assessed in terms of concerns regarding applicability. Reference standards were recorded to assess comparability across studies. All discrepancies were resolved by discussion and consensus.

Data synthesis

Narrative synthesis was conducted because of methodological variations between studies. FPP, FNP, location, technical and other errors were tabulated. FNP and FPP estimates and their 95% Wald confidence intervals (CIs) were plotted in a forest plot. Estimates were pooled within each positivity threshold using inverse variance random-effects meta-analysis with the restricted maximum likelihood estimator [12, 13]. Tests for subgroup differences between thresholds were not calculated because data in each subgroup were not independent.

Sensitivity and specificity of single and consensus readers were plotted against AI positivity thresholds (when reported) in receiver operating characteristic (ROC) space to complement FPP and FNP estimates.

Analyses were undertaken using the metafor package and visual summaries were generated using the ggplot2 package [14, 15] in R version 4.2.2 (R Project for Statistical Computing in Vienna, Austria).

Results

After deduplication, 1760 unique results were screened, of which 73 potentially eligible full-text articles were assessed. Seven studies were included in this review [16,17,18,19,20,21,22]. Figure 1 summarises the screening and eligibility process and documents reasons for exclusion.

Study characteristics of included studies

Characteristics of the seven included studies, comprising 447,676 examinations, are presented in Table 1. One study reported AI error as the location of false markings on the mammogram [20]. The remaining six studies reported AI sensitivity and specificity [16,17,18,19, 21, 22], and five of those reported AI errors as FNR and FPR according to the positivity threshold applied [16,17,18,19, 22]. One study reported additional error information as AI technical failures [19].

Table 1 Summary of study characteristics of included studies

Full size table

Two studies evaluated datasets on real-world screening populations from Sweden, and one each evaluated screening populations from Denmark, Norway, Germany, Spain, and the United States (US). Evaluation datasets were sourced from screening programmes [18, 19, 21], sub-cohorts of randomised controlled trials [17, 22] and specialist cancer centres [16, 20]. Screening mammograms all had 4 views (2 views per breast). Mammogram units were Siemens Mammomat [17,18,19] or Hologic Selenia [20, 22], and one study reported the use of both [16]. All were retrospective cohort studies with consecutive screens. Years of enrolment ranged from 2008 to 2018. Study-level mean age of the women ranged between 53 and 60 years.

For six studies reporting on FPR and FNR and/or sensitivity and specificity, two studies used a reference standard of screen-detected cancers only [16, 17], and four included interval cancers (in addition to screen-detected cancers) with follow-up of either 12 months [21] or 24 months [18, 19, 22]. Cancer prevalence in studies with screen-detected cancers only ranged between 0.64 and 0.71%, whereas this was from 0.71 to 1.22% for studies that included screen-detected and interval cancers. The reference standard in an additional study reporting cancer location was clinical review of established biopsied cancer location (all screen-detected; ≥ 2 year follow-up confirmed no interval cancers). An AI cancer marking was considered correct if its location intersected with the geometric centre of ground-truth (radiologists’) region of interest (ROI) [20].

Four studies using AI for triage reported “high”- or “moderate”-risk cases would be reviewed by radiologists [16, 17, 19, 22]. “Low-risk” cases (i.e. case deemed as low suspicion of cancer) would have no human-reading or reading by a single radiologist. AI performance was compared to radiologists (single or consensus reading) with 3 to 15 years of experience [16, 19, 22]. Three studies used standalone AI to evaluate its accuracy compared to either double reading (1–20 + years’ experience) [18, 21] or computer-aided detection (CAD) to reduce false positive markings on mammograms [20].

Table 2 summarises the risk of bias and applicability concerns of included studies. Overall, four studies had high risk of bias or applicability concerns in at least one of the four domains [16, 17, 20, 21]. Two studies had high risk of bias and applicability concerns for the reference standard [16, 17] and three studies had unclear risk of bias for patient selection [17, 20, 22]. Four studies had either high or unclear risk of bias in flow and timing [16, 17, 20, 21].

Table 2 “Traffic Light Plot” of overview of risk of bias and applicability of included studies

Full size table

Characteristics of AI tools and positivity thresholds

Of the six studies reporting FP and FN errors, five evaluated different versions (V1.4.0 to V1.7.0) of a commercially available algorithm (Transpara, Screen Point Medical) [16,17,18,19, 22] and one used an ensemble model [21] (Online Resource 4). An additional study reporting location errors used a prototype AI-CAD system [20].

In studies applying AI for triage, two used a single prospective threshold [17, 22], one used multiple prospective thresholds [19] and one study used multiple retrospective thresholds [16]. The most commonly reported thresholds were Transpara Score 5 [17, 19] or 7 [16, 22], where 10 equates to the highest suspicion of cancer on a scale of 0–10.

In studies using standalone AI, one used a single prospective threshold (Transpara Score 9) in addition to retrospective thresholds [18] and one used a retrospective single threshold to match radiologists’ specificity [21]. The study that assessed the location of any AI markings on the mammogram did not specify an AI positivity threshold [20].

Reported AI errors and associated factors

Table 3 presents the frequency and features associated with reported AI error.

Table 3 Features associated with reported AI errors

Full size table

Six studies reported sensitivity and specificity [16,17,18,19, 21, 22], five of which also reported AI error as false negatives and false positives according to comparable positivity thresholds [16,17,18,19, 22]. One study reported on AI error as technical failure which was defined as failure of the AI to process mammograms[19]. One study referred to location error as an AI cancer marking that is incorrectly highlighted on a mammogram (i.e. false positive marks) [20].

False positive proportion (FPP) and false negative proportion (FNP)

Pooled FPP decreased incrementally with increasing Transpara threshold (71.83% [95% CI 69.67, 73.90] at Transpara 3 to 10.77% [95% CI 8.34, 13.79] at Transpara 9). The most commonly reported prospective triage thresholds were Transpara 5 (pooled FPP 46.86% [95% CI 39.33, 54.53]) and Transpara 7 (pooled FPP 29.86% [95% CI 26.59, 33.35) (Fig. 2).

The pooled FNP increased incrementally from 0.02% [95% CI 0.01, 0.03] at Transpara 3 to 0.12% [95% CI 0.06, 0.26] (Transpara 9), consistent with a trade-off with FPP.

There was heterogeneity within Transpara thresholds, reflecting study-level differences in Transpara version and reference standard (Fig. 2, Online Resource 5). For studies using later versions of Transpara (V1.6.0 and V1.7.0), FPP was lower (and FNP higher) within each threshold when the reference standard included screen-detected and interval cancers [16], compared with studies including only screen-detected cancers in the reference standard [18, 19, 22]. One study that evaluated an earlier Transpara version (V1.4.0) included only screen-detected cancers in the reference standard [17]; it reported lower FPP and higher FNP relative to a study using the same reference standard and a later Transpara version [16].

Table 3 describes the lesion or imaging features associated with AI missed cancers (i.e. FN). Three studies reported on the lesion or imaging characteristics associated with FN, each at different Transpara scores (i.e. 5, 7, 9). Two studies reported that the majority of FN were invasive cancers (88–100%) [17, 18]. A majority of FN (53–100%) were interval cancers [18, 22]. One study reported FN cancers generally had a radiographic appearance of spiculated mass and were in Breast Imaging Reporting and Data System Density C and D breasts [17]. Two studies reported that median tumour size for cancers missed by AI ranged from ≤ 7 to 25 mm [17, 18]. When compared to AI detected cancers, the majority of these (77–84%) were also invasive cancers [17, 18].

Sensitivity and specificity

In studies that compared AI performance to radiologists, two reported on single reading [16, 21] and four [16, 18, 21, 22] reported on consensus reading. Transpara was the AI tool used in five of these studies (the other used an ensemble system [21]). As expected, we observed a trade-off with higher specificity and lower sensitivity as the Transpara positivity threshold increased. Regardless of single or consensus reading, the radiologists’ specificity remained consistent relative to the varied specificity and sensitivity at different AI positivity thresholds (Fig. 3). The range in sensitivity of a single reader is comparable to Transpara Score 7 (0.83–0.92) or 9 (0.77–0.88).

Other types of error

Two studies reported other forms of AI errors. One study reported that no technical failures were encountered in which the AI model failed to process mammograms [19]. A second study investigated the location of AI false marking on the mammogram [20]. Location error was weakly associated with radiographic features including calcification or masses. Eight of the 18 location errors were ultimately confirmed as benign lesions. No other forms of AI error were identified from included studies.

Discussion

This systematic review of externally validated AI algorithms for cancer detection in breast screening identified relatively few studies that report AI errors. Four types of AI errors were identified, the most commonly reported being false positive and false negative findings, which is consistent with a focus on diagnostic accuracy in studies of AI in breast cancer screening [16,17,18,19, 22]. Previous systematic reviews have assessed the diagnostic accuracy of AI in external validation studies; however, none have reported on AI error in detail [6,7,8]. This review is a novel attempt to report findings and imaging features associated with AI errors and identify other types of error. Technical and location errors were reported relatively infrequently and inconsistently, despite their importance in establishing the utility of AI in population breast cancer screening practice.

The findings highlight factors relating to algorithm, study, and imaging characteristics that may plausibly influence the FPP and FNP of AI in the breast screening context. The exploration of multiple AI positivity thresholds showed the expected trade-off between FPP and FNP, with progressively lower FPP (and higher FNP) as the positivity threshold increased. However, there was considerable heterogeneity of FPP and FNP within thresholds. Between-studies comparisons suggested that the frequency of these errors varied according to the version of the AI algorithm. Lower FNP was observed with more recent (v1.6.0 and 1.7.0) compared with earlier (v1.4.0) versions of Transpara, suggesting that improvements in AI over time have resulted in a lower likelihood of errors leading to missed cancers. However, a corresponding increase in FPR was also found, indicating that technical improvements to enhance AI sensitivity have the potential to result in increased recall. Studies that have integrated AI into the screen-reading workflow as an adjunct to radiologist reading have used recent Transpara versions[18, 23]; absent comparison with earlier versions, it is unclear if these observed differences in AI error rates may have translated to increased cancer detection and recall over time.

The comparisons also highlight the importance of appropriately defining the reference standard to classify FP and FN results. Studies that included both screen-detected and interval cancers reported lower FPP for AI compared with studies including screen-detected cancers only, logically reflecting the limitation of the latter design in validating true positive AI results that are deemed negative by radiologists [6]. “Missed” interval cancers by AI also contributed to higher FNP in such studies. Incompleteness of interval cancer ascertainment has been identified as source of potential bias in studies of AI [6, 8], with empirical studies showing an inflation of overall accuracy [2]. Studies investigating AI errors should, at minimum, include all interval cancers (ideally through registry linkage to minimise the potential for bias) [8]; extended follow-up should also be considered [8, 24], acknowledging the desirability of aligning follow-up with screening intervals which may differ between settings.

The above suggestion that studies of AI for mammography screening should use sufficient follow-up to include interval cancers is reinforced in the study from Larsen et al. [18] which used cancer registry linkage to ascertain interval cancers—it showed more of the cancers missed by the AI were interval (than screen-detected) cancers. It also showed that the AI was more likely to miss a smaller tumour than a larger tumour on the mammogram, evidenced in a median tumour diameter of cancers missed by AI ranging between 7 and 25 mm, whereas those correctly detected by the AI ranged between 9 and 30 mm.

Technical AI errors—in which the algorithm fails to generate output—may have important implications for implementation of AI in screening programmes. Such failures require remediation in the workflow, and systematic failures have the potential to have disproportionate impacts on different sub-populations [3]. Location errors—where AI identifies abnormality in an incorrect location of the breast—have potentially adverse clinical consequences for women and may erode radiologists’ confidence in AI findings. However, it should be noted that at present, there is no gold standard for defining these ‘location-specific’ errors which require clinical (imaging) review and subjective judgement, and this is an area worthy of further exploration. The one study identified in this review referred to location error as an AI cancer marking that is incorrectly highlighted on a mammogram and used a retrospective clinical review process [20]. Breast imaging fellowship-trained radiologists generated a region of interest to establish the location of the biopsied lesion, and AI markings were considered to be correct if the geometric centre lay within the region of interest.

Despite the importance of understanding the frequency and nature of both technical and location errors that occur when AI reads mammography, this review found that they were reported infrequently. In the case of technical errors, these were reported in only one instance to confirm the absence of such errors [21]. Additional emphasis on enumerating and describing such AI errors would be a valuable complement to the current research focus on accuracy (including FP and FN errors), enabling better understanding of the potential impact of AI on screening workflow and clinical outcomes. Improving knowledge on these issues is highly relevant for potential implementation of AI in breast screening practice, noting that women have high expectations that AI will improve mammography screening accuracy and outcomes [25].

This review did have limitations. Firstly, it focused on AI as standalone reader, not as an aid to reader interpretation. This may have limited the search strategy to exclude studies that are more likely to report location errors. However, these errors have been reported mostly in reader studies using cancer-enriched datasets and may not be generalisable to population breast cancer screening settings [7]. Furthermore, the review interprets between-study comparisons to explain heterogeneity in error estimates. Although the observed differences in FPP and FNP are in the expected directions, there are likely to be clinical and methodological differences between studies beyond those considered in our analyses. Within-study comparisons would provide stronger evidence from which to draw inferences. Where possible, authors should be encouraged to facilitate such comparisons by reporting FP and FN errors for screen-detected and interval cancers separately, and multiple follow-up times for ascertaining interval cancers [2].

Conclusions

Current evidence on AI algorithms in breast cancer screening demonstrates that false positives and false negatives are the predominantly reported forms of AI errors, which is consistent with the focus on diagnostic accuracy in the literature. Further reporting of other types of errors, including technical errors, could provide a better understanding of AI’s utility in breast screening practice. Further studies on AI errors using real-world data could also allow future systematic reviews to explore plausible factors (e.g. clinical or radiological characteristics) associated with errors that are generalisable to real populations. This could complement co-existing AI accuracy research, to ensure the safe integration of AI into future screening practice.

Data availability

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

References

Marinovich ML, Wylie E, Lotter W, Pearce A, Carter SM, Lund H et al (2022) Artificial intelligence (AI) to enhance breast cancer screening: protocol for population-based cohort study of cancer detection. BMJ Open 12(1):e054005. https://doi.org/10.1136/bmjopen-2021-054005
Article PubMed PubMed Central Google Scholar
Marinovich ML, Wylie E, Lotter W, Lund H, Waddell A, Madeley C et al (2023) Artificial intelligence (AI) for breast cancer screening: BreastScreen population-based cohort study of cancer detection. EBioMedicine 90:104498. https://doi.org/10.1016/j.ebiom.2023.104498
Article CAS PubMed PubMed Central Google Scholar
Oakden-Rayner L, Dunnmon J, Carneiro G, Ré C (2020) Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc ACM Conf Health Inference Learn 2020(2020):151–159. https://doi.org/10.1145/3368555.3384468
Article Google Scholar
Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam AL, Kohane IS (2019) Adversarial attacks on medical machine learning. Science 363(6433):1287–1289
Article CAS PubMed PubMed Central Google Scholar
Dratsch T, Chen X, Rezazade Mehrizi M, Kloeckner R, Mähringer-Kunz A, Püsken M et al (2023) Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology 307(4):e222176. https://doi.org/10.1148/radiol.222176
Article PubMed Google Scholar
Anderson AW, Marinovich ML, Houssami N, Lowry KP, Elmore JG, Buist DSM et al (2022) Independent external validation of artificial intelligence algorithms for automated interpretation of screening mammography: a systematic review. J Am College Radiol 19(21):259–73. https://doi.org/10.1016/j.jacr.2021.11.008
Article Google Scholar
Houssami N, Kirkpatrick-Jones G, Noguchi N, Lee CI (2019) Artificial Intelligence (AI) for the early detection of breast cancer: a scoping review to assess AI’s potential in breast screening practice. Expert Rev Med Devices 16(5):351–362. https://doi.org/10.1080/17434440.2019.1610387
Article CAS PubMed Google Scholar
Freeman K, Geppert J, Stinton C, Todkill D, Johnson S, Clarke A et al (2021) Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 374:n1872. https://doi.org/10.1136/bmj.n1872
Article PubMed PubMed Central Google Scholar
McInnes MD, Moher D, Thombs BD, McGrath TA, Bossuyt PM, Clifford T et al (2018) Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement. JAMA 319(4):388–396. https://doi.org/10.1001/jama.2017.19163
Article PubMed Google Scholar
Lee JH, Shin J, Realff MJ (2018) Machine learning: overview of the recent progresses and implications for the process systems engineering field. Comput Chem Eng 114:111–121. https://doi.org/10.1016/j.compchemeng.2017.10.008
Article CAS Google Scholar
Sounderajah V, Ashrafian H, Rose S, Shah NH, Ghassemi M, Golub R et al (2021) A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med 27(10):1663–1665. https://doi.org/10.1038/s41591-021-01517-0
Article CAS PubMed Google Scholar
Viechtbauer W (2005) Bias and efficiency of meta-analytic variance estimators in the random-effects model. J Educ Behav Stat 30(3):261–293. https://doi.org/10.3102/10769986030003261
Article Google Scholar
Raudenbush SW (2009) Analyzing effect sizes: random-effects models. The handbook of research synthesis and meta-analysis, 2nd edn. Russell Sage Foundation, New York, pp 295–315
Google Scholar
Viechtbauer W (2010) Conducting meta-analyses in R with the metafor package. J Stat Softw 36(3):1–48. https://doi.org/10.18637/jss.v036.i03
Article Google Scholar
Wickham H, editor An implementation of the grammar of graphics in R: ggplot. Book of Abstracts; 2006.
Balta C, Rodriguez-Ruiz A, Mieskes C, Karssemeijer N, Heywang-Köbrunner SH. Going from double to single reading for screening exams labeled as likely normal by AI: what is the impact? Proceedings of SPIE; 2020. https://doi.org/10.1117/12.2564179
Lang K, Dustler M, Dahlblom V, Akesson A, Andersson I, Zackrisson S (2021) Identifying normal mammograms in a large screening population using artificial intelligence. Eur Radiol 31(3):1687–1692. https://doi.org/10.1007/s00330-020-07165-1
Article PubMed Google Scholar
Larsen M, Aglen CF, Lee CI, Hoff SR, Lund-Hanssen H, Lang K et al (2022) Artificial intelligence evaluation of 122969 mammography examinations from a population-based screening program. Radiology 303:212381. https://doi.org/10.1148/radiol.212381
Article Google Scholar
Lauritzen AD, Rodriguez-Ruiz A, von Euler-Chelpin MC, Lynge E, Vejborg I, Nielsen M et al (2022) An artificial intelligence-based mammography screening protocol for breast cancer: outcome and radiologist workload. Radiology 304:210948. https://doi.org/10.1148/radiol.210948
Article Google Scholar
Mayo RC, Kent D, Sen LC, Kapoor M, Leung JWT, Watanabe AT (2019) Reduction of false-positive markings on mammograms: a retrospective comparison study using an artificial intelligence-based CAD. J Digit Imaging 32(4):618–624. https://doi.org/10.1007/s10278-018-0168-6
Article PubMed PubMed Central Google Scholar
Schaffter T, Buist DSM, Lee CI, Nikulin Y, Ribli D, Guan Y et al (2020) Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Netw 3(3):e200265. https://doi.org/10.1001/jamanetworkopen.2020.0265
Article Google Scholar
Raya-Povedano JL, Romero-Martin S, Elias-Cabot E, Gubern-Merida A, Rodriguez-Ruiz A, Alvarez-Benito M (2021) AI-based strategies to reduce workload in breast cancer screening with mammography and tomosynthesis: a retrospective evaluation. Radiology 300(1):57–65. https://doi.org/10.1148/radiol.2021203555
Article PubMed Google Scholar
Larsen M, Aglen CF, Hoff SR, Lund-Hanssen H, Hofvind S (2022) Possible strategies for use of artificial intelligence in screen-reading of mammograms, based on retrospective data from 122,969 screening examinations. Eur Radiol 32(12):8238–8246. https://doi.org/10.1007/s00330-022-08909-x
Article PubMed PubMed Central Google Scholar
Lee CI, Houssami N, Elmore JG, Buist DSM (2020) Pathways to breast cancer screening artificial intelligence algorithm validation. Breast 52:146–149
Article PubMed Google Scholar
Lennox-Chhugani N, Chen Y, Pearson V, Trzcinski B, James J (2021) Women’s attitudes to the use of AI image readers: a case study from a national breast screening programme. BMJ Health Care Inform 28(1):e100293. https://doi.org/10.1136/bmjhci-2020-100293
Article PubMed PubMed Central Google Scholar

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions. Aileen Zeng was a recipient of The Daffodil Centre Postgraduate Research Scholarship. Dr Luke Marinovich is funded by a National Breast Cancer Foundation Investigator Initiated Research Scheme grant (2023/IIRS0028). Prof Nehamat Houssami is funded through a National Breast Cancer Foundation Chair in Breast Cancer Prevention grant (EC-21-001) and a National Health and Medical Research Council Investigator Leader grant (1194410). Dr Brooke Nickel is supported by a National Health and Medical Research Council (NHMRC) Emerging Leader Research Fellowship (1194108).

Author information

Authors and Affiliations

The Daffodil Centre, The University of Sydney, a Joint Venture with Cancer Council New South Wales, Sydney, NSW, Australia
Aileen Zeng, Nehmat Houssami & M. Luke Marinovich
School of Public Health, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia
Aileen Zeng, Nehmat Houssami, Naomi Noguchi & M. Luke Marinovich
Wiser Healthcare, Sydney School of Public Health, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia
Brooke Nickel
Sydney Health Literacy Lab, Sydney School of Public Health, Faculty of Medicine and Health, University of Sydney, Sydney, NSW, Australia
Brooke Nickel
Westmead Applied Research Centre and Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, Australia
Aileen Zeng

Authors

Aileen Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Nehmat Houssami
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Noguchi
View author publications
You can also search for this author in PubMed Google Scholar
Brooke Nickel
View author publications
You can also search for this author in PubMed Google Scholar
M. Luke Marinovich
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Nehmat Houssami, Aileen Zeng and Luke Marinovich; Literature Search: Aileen Zeng; Data Extraction: Aileen Zeng, Naomi Noguchi and Luke Marinovich; Data Analysis: Aileen Zeng and Luke Marinovich; Writing – original draft preparation: Aileen Zeng and Luke Marinovich; Writing – review and editing: Nehmat Houssami, Luke Marinovich, Naomi Noguchi and Brooke Nickel.

Corresponding author

Correspondence to M. Luke Marinovich.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Ethics approval

This a systematic review and does not require ethics approval.

Consent to participate

Not applicable.

Consent to publish

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 39 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zeng, A., Houssami, N., Noguchi, N. et al. Frequency and characteristics of errors by artificial intelligence (AI) in reading screening mammography: a systematic review. Breast Cancer Res Treat 207, 1–13 (2024). https://doi.org/10.1007/s10549-024-07353-3

Download citation

Received: 15 December 2023
Accepted: 24 April 2024
Published: 09 June 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s10549-024-07353-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Frequency and characteristics of errors by artificial intelligence (AI) in reading screening mammography: a systematic review

Abstract

Purpose

Methods

Results

Conclusion

Similar content being viewed by others

Effect of artificial intelligence–based computer-aided diagnosis on the screening outcomes of digital mammography: a matched cohort study

Reader bias in breast cancer screening related to cancer prevalence and artificial intelligence decision support—a reader study

AI in breast screening mammography: breast screening readers' perspectives

Explore related subjects

Introduction

Materials and methods

Information sources and literature search

Study selection

Data extraction and risk of bias assessment

Data synthesis

Results

Study characteristics of included studies

Characteristics of AI tools and positivity thresholds

Reported AI errors and associated factors

False positive proportion (FPP) and false negative proportion (FNP)

Sensitivity and specificity

Other types of error

Discussion

Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Consent to participate

Consent to publish

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 39 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation