Introduction

Natural language processing (NLP) is increasingly used in healthcare, especially in oncology, for its ability to analyze free-text with diverse applications (Sorin et al. 2020a, b). Large language models (LLMs) like GPT, LLaMA, PaLM, and Falcon represent the pinnacle of this development, leveraging billions of parameters for highly accurate text processing (Sorin et al. 2020a, b; Bubeck et al. 2023). Despite this, the integration of such sophisticated NLP algorithms into practical healthcare settings, particularly in managing complex diseases like breast cancer, remains a technological, operational, and ethical challenge.

Breast cancer, the most common cancer among women, continues to pose significant challenges in terms of morbidity, mortality, and information overload (Kuhl et al. 2010; Siegel et al. 2019). While LLMs have shown promise in medical text analysis—with GPT-4 achieving a notable 87% success rate on the USMLE (Brin et al. 2023; Chaudhry et al. 2017) and extending its capabilities to image analysis (Sorin et al. 2023a, b, c)—their practical application in medicine and oncology in particular is still evolving.

We reviewed the literature on how large language models (LLMs) might offer solutions in breast cancer care.

Methods

This review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher 2009). We searched the literature for applications of LLMs in breast cancer management using MEDLINE.

The search included studies published up to December 22nd 2023. Our search query was “(("large language models") OR (llm) OR (gpt) OR (chatgpt) OR (openAI)) AND (breast)”. The initial search identified 97 studies. To ensure thoroughness, we also examined the reference lists of the relevant studies. This, however, did not lead to additional relevant studies that met our inclusion criteria.

The criteria for inclusion were English language full-length publications that specifically evaluated the role of LLMs in breast cancer management. We excluded papers that addressed other general applications of LLMs in healthcare or oncology without a specific focus on breast cancer.

Three reviewers (VS, YA, and EKL) independently conducted the search, screened the titles, and reviewed the abstracts of the articles identified in the search. One discrepancy in the search results was discussed and resolved to achieve a consensus. Next, the reviewers assessed the full text of the relevant papers. In total, six publications met our inclusion criteria and were incorporated into this review. We summarized the results of the included studies, detailing the specific LLMs used, the utilized tasks, number of cases, along with publication details in a table format. Figure 1 provides a flowchart detailing the screening and inclusion procedure.

Fig. 1
figure 1

Flow Diagram of the Inclusion Process. Flow diagram of the search and inclusion process based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines

Quality was assessed by the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) criteria (Whiting 2011).

Results

All six studies included were published in 2023 (Table 1). All focused on either ChatGPT-3.5 or GPT-4 by OpenAI. Applications described include information extraction and question-answering. Three studies (50.0%) evaluated the performance of ChatGPT on actual patient data (Sorin et al. 2023a, b, c; Choi et al. 2023; Lukac et al. 2023), as opposed to two studies that used data from the internet (Rao et al. 2023; Haver et al. 2023). One study crafted fictitious patient profiles by the head investigator (Griewing et al. 2023).

Table 1 Studies evaluating LLMs for breast cancer diagnosis and care

Rao et al. and Haver et al. evaluated LLMs for breast imaging recommendations (Rao et al. 2023; Haver et al. 2023). Sorin et al., Lukac et al. and Griewing et al. evaluated LLMs as supportive decision-making tools in multidisciplinary tumor boards (Sorin et al. 2023a, b, c; Lukac et al. 2023; Griewing et al. 2023). Choi et al. used LLM for information extraction from ultrasound and pathology reports (Choi et al. 2023) (Fig. 2). Example cases for applications from studies are detailed in Table 2.

Fig. 2
figure 2

Applications of large language models in breast cancer care and the corresponding accuracies achieved in various tasks in the different studies

Table 2 Example cases

Accuracy of LLMs on different applications ranged from 50 to 98%. Best performance rates were achieved for information extraction and question-answering, with correct responses ranging from 88 to 98% (Choi et al. 2023; Rao et al. 2023) (Table 3). The lower performance was for clinical decision support in breast tumor boards, ranging between 50 and 70% (Sorin et al. 2023a, b, c; Lukac et al. 2023; Griewing et al. 2023). The range in performance on this task was wide between studies. However, the methods of the three studies also varied significantly (Sorin et al. 2023a, b, c; Lukac et al. 2023; Griewing et al. 2023). Sorin et al. and Lukac et al. used authentic patient data and compared ChatGPT-3.5 to the retrospective decisions in breast tumor board (Sorin et al. 2023a, b, c; Lukac et al. 2023). In both studies, the authors used reviewers that scored ChatGPT-3.5 responses (Sorin et al. 2023a, b, c; Lukac et al. 2023). (Griewing et al. 2023) crafted 20 fictitious patient files that were then discussed by a multidisciplinary tumor board. Their assessment was based on binary evaluation of various treatment approaches, including surgery, endocrine, chemotherapy, and radiation therapy. Griewing et al. were the only study providing insights into LLM performance on genetic testing for breast cancer treatment (Griewing et al. 2023). All three studies analyzed concordance between the tumor board and the LLM on different treatment options (Sorin et al. 2023a, b, c; Lukac et al. 2023; Griewing et al. 2023).

Table 3 Summarization of performance of LLMs at different breast cancer care related tasks

All studies discussed the limitations of LLMs in the contexts in which the algorithms were evaluated (Table 4). In all studies some of the information the models generated was false. When used as a support tool for tumor board, in some instances, the models overlooked relevant clinical details (Sorin et al. 2023a, b, c; Lukac et al. 2023; Griewing et al. 2023). Sorin et al. noticed absolute lack of referral to imaging (Sorin et al. 2023a, b, c), while Rao et al. who evaluated appropriateness of imaging noticed imaging overutilization (Rao et al. 2023). Some of the studies also discussed whether the nature of the prompt affects the outputs (Choi et al. 2023; Haver et al. 2023), and the difficulty to verify the reliability of the answers (Lukac et al. 2023; Rao et al. 2023; Haver et al. 2023).

Table 4 Limitations of LLMs as described in each study

According to the QUADAS-2 tool, all papers but one scored as high risk of bias for index test interpretation. For the paper by Lukac et al. the risk was unclear, refraining from a clear statement whether the evaluators were blinded to the reference standard. The study by Griewing et al. was the only one identified to have a low risk of bias across all categories (Griewing et al. 2023). The objective assessment of the risk of bias is reported in Supplementary Table 1.

Discussion

We reviewed the literature on LLMs applications related to breast cancer management and care. Applications described included information extraction from clinical texts, question-answering for patients and physicians, manuscript drafting and clinical management recommendations.

A disparity in performance was seen. The models showed proficiency in information extraction and responding to structured questions, with accuracy rates between 88 and 98%. However, their effectiveness diminished down to 50–70% in making clinical decisions, underscoring a gap in their application. In breast cancer care, attention to detail is crucial. LLMs excel at processing medical information quickly. However, currently, they may be less adept at navigating complex treatment decisions. Breast cancer cases vary greatly, each case distinguished by a unique molecular profile, clinical staging, and patient-specific requirements. It is vital for LLMs to adapt to the individual patient. While these models can assist physicians in routine tasks, they require further development for personalized treatment planning.

Interestingly, half of the studies included real patients’ data as opposed to publicly available data or fictitious data. For the overall published literature on LLMs in healthcare, there are more publications evaluating performance on public data. This includes performance on board examinations and question-answering based on guidelines (Sallam 2023). These analyses may introduce contamination of data, since LLMs were trained on vast data from the internet. For commercial models such as ChatGPT, the type of training data is not disclosed. Furthermore, these applications do not necessarily reflect on the performance of these models in real-world clinical settings.

While some claim that LLMs may eventually replace healthcare personnel, currently, there are major limitations and ethical concerns that strongly suggest otherwise (Lee et al. 2023). Using such models to augment physicians’ performance is more practical, albeit also constrained by ethical issues (Shah et al. 2023). LLMs enable automating different tasks that traditionally required human effort. The ability to analyze, extract and generate meaningful textual information could potentially decrease some physicians’ workload and human errors.

Reliance on LLMs and potential integration in medicine should be made with caution. The limitations discussed in the studies further underscore this note. These models can generate false information (termed “hallucination”) which can be seamlessly and confidently integrated into real information (Sorin et al. 2020a, b). They can also perpetuate disparities in healthcare (Sorin et al. 2021; Kotek et al. 2023). The inherent inability to trace the exact decision-making process of these algorithms is a major challenge for trust and clinical integration (Sorin et al. 2023a, b, c). LLMs can also be vulnerable to cyber-attacks (Sorin et al. 2023a, b, c).

Furthermore, this study highlights the absence of uniform assessment methods for LLMs in healthcare, underlining the need of establishing methodological standards for evaluating LLMs. The goal is to enhance the comparability and quality of research. The establishment of such standards is critical for the safe and effective integration of LLMs into healthcare, especially for complex conditions like breast cancer, where personalized patient care is essential.

This review has several limitations. First, due to the heterogeneity of tasks evaluated in the studies, we could not perform a meta-analysis. Second, all included studies assessed ChatGPT-3.5, and only one study evaluated GPT-4. There were no publications identified on other available LLMs. Finally, generative AI is currently a rapidly expanding topic. Thus, there may be manuscripts and applications published after our review was performed. LLMs are continually being refined, and so is their performance.

To conclude, LLMs hold potential for breast cancer management, especially in text analysis and guideline-driven question-answering. Yet, their inconsistent accuracy warrants cautious use, following thorough validation and ongoing supervision.