FormalPara What does this study add to the clinical work?

This early feasibility study demonstrates that publicly available Large Language Models (LLM), especially GPT4, increasingly align with the decision-making of a multidisciplinary tumor board regarding high complexity breast cancer treatment choices. It indicates that methodological advancement, i.e. the optimization of prompting techniques, and technological development, i.e. enabling data input control and secure data processing, are necessary in the preparation of large-scale and multicenter studies. At present, safe and evidenced use of LLM in clinical breast cancer care is not yet feasible.

Introduction

In Germany, invasive breast cancer is the most prevalent cancer affecting women, with the annual national incidence exceeding 70,000 cases [1]. The implementation of a comprehensive nationwide mammography screening program between the years of 2005 to 2009 resulted in an initial peak in detected breast cancer cases. Subsequently, the increased efforts led to a consistent reduction in the incidence of advanced tumors and a gradual decline of primary disease. Despite these improvements, a high disease burden of breast cancer persists and given the aging demographic, a future increase in the incidence of breast cancer is anticipated. This development is accompanied by intensified shortage in healthcare professionals and care capacity [1, 2]. In addition, extensive research continuously expands the spectrum of treatment modalities, encompassing surgical interventions, endocrine therapy, radiotherapy, both neoadjuvant and adjuvant chemotherapy, and genetic testing for hereditary breast and ovarian cancer syndromes [3]. Moreover, the swift advancement in diagnostic and treatment technologies, including increasing adoption of next-generation sequencing, genetic arrays for the prediction of disease prognosis or chemotherapy benefit and the use of precision-targeted therapies such as antibody-drug conjugates, shape a transformative phase in gynecological oncology [4, 5]. This development is marked by abundance of evidenced knowledge and health data, which increasingly overwhelm practitioners in terms of complexity [6]. There is a growing optimism that technological innovations will bridge the gap between scientific possibilities and practical healthcare delivery by providing support to caregivers and will enable more individualized and effective treatment strategies in an environment with high volumes of data [7].

High expectations are set on artificial intelligence-based clinical decision support tools to augment doctoral intelligence in order to keep pace with this rapid development [8, 9]. Historically, the cumbersome digitization of German healthcare has led to a gap between technological capabilities and current practices, which keeps on widening [10]. A nationwide survey conducted by the Commission Digital Medicine of the German Association of Gynecology and Obstetrics (DGGG) revealed high heterogeneity in digital infrastructure within the field of gynecology, characterized by low interoperability and outdated systems, leading to dissatisfaction among healthcare providers [11]. In contrast, most gynecology specialists are optimistic that digitization could ease their growing workloads, enhance patient care, and foresee the adoption of smart algorithms to assist in patient treatment [12]. In the meantime, it has become a normality on the patient’s side to assess new symptoms digitally before visiting a doctor, i.e., using online-search engines and dedicated app-based symptom checkers [13,14,15]. This includes the recent widespread availability of Large Language Models (LLM), with tech-savvy individuals increasingly turning to public chatbots for health-related inquiries [9, 16]. This shift toward relying on easily accessible online resources, evolving from simple Google keyword searches to consulting advanced tools like ChatGPT, highlights a new reality that likewise demands to promote scientific evidence in the medical use of LLM.

The emergence of publicly available LLM in artificial intelligence has opened a new field in medical research, which still lacks the definition of methodological guard rails and best practices. Preliminary proof-of-concept analyses have indicated potential in using these models as supplementary tools in tumor boards [16,17,18,19,20,21]. In breast cancer care, few preliminary assessments have explored the accuracy of LLM in supporting decision-making through the evaluation of brief clinical scenarios [22, 23], but also high complexity cases [24]. The most recent literature increasingly challenges the consistency of LLM, highlighting significant changes in explanatory value over short intervals while emphasizing the necessity for their ongoing monitoring [25]. A scientific discussion has initiated on whether LLM will facilitate the implementation of increasingly complex evidence-based treatment guidelines in clinical routine or may serve as a possible guideline navigator for the professional user [16,17,18,19,20,21]. Furthermore, the question remains as to how to direct the technological and methodological development of LLM before initiating larger preclinical and clinical trials to generate further evidence on the technology’s application in breast cancer care.

To date, there is no literature in breast cancer care that compares different LLM and considers their monitoring with regard to changes in accuracy over time, i.e., with access to more data or through technological upgrades. Therefore, this early feasibility study investigated five different versions of publicly available LLM regarding their concordance of recommendations for complex breast cancer case examples at different stages of development and points in time. Based on its findings, it aims to conclude on how to direct further development and the scientific approach to LLM in breast cancer care.

Methods

Patient profiles

Following the breast cancer guidelines of the German Association of Gynecology and Obstetrics (DGGG) (version 4.4, May 2021, AWMF-registration number 032/0456OL), 20 patient profiles (P1-20) were designed to reflect the patho- and immunomorphological variety of breast cancer in comprehensive and structured manner (Tables 1 and 2) [24]. The use of publicly available LLM is limited to fictitious profiles at the current state, as data processing via international servers does not ensure data integrity in accordance with European (General Data Protection Regulation, GDPR) or German data protection standards (Datenschutz-Grundverordnung, DSVGO). This limits the current exploration of LLM to a preclinical simulation environment. Since no patient-related data was used, an ethics vote was waived by the Research Ethics Committee of Philipps-University Marburg (23-300 ANZ).

Table 1 Patient profiles 1–10 [24]
Table 2 Patient profiles 11–20 [24]

Prompting model

Prompting was carried out using a previously used, standardized input model for high complexity clinical cases (supplementary file 1) [24]. Prompts had to be slightly adjusted for patient profiles without previous surgical intervention (P14-16, P20) and ductal carcinoma in situ (DCIS) (P17-19).

Large language model selection

Five different LLM were utilized for comparison. GPT (ChatGPT Generative Pre-trained Transformer; by OpenAI LP, San Francisco, California, USA) was analyzed in three different development versions (GPT3.5 version September 2021, GPT3.5 version January 2022, GPT4 version April 2023) to trace the evolution over time and with access to more data or through technological upgrade. Besides, the selection of Llama2 70bn (version December 2022; Large Language Model Meta AI 2 70 billion parameters; by Meta, Menlo Park, California, USA) and Bard (version January 2023; by Google LLC, Mountain View, California, USA) enabled the comparison of two further commonly used LLMs.

Model execution

On July 21, 2023, the high complexity cases were presented in a randomized and blinded order to the multidisciplinary tumor board (MTB) of the partnering accredited gynecologic oncology center (supplementary file 1). On the same date, prompting was carried out in GPT3.5 version September 2021. GPT3.5 version January 2022, Llama2, Bard and GPT4 (version April 2023) were queried on December 6, 2023 (supplementary file 2).

Comparative assessment

Different treatment modality recommendations were assessed: surgical treatment (ST), endocrine treatment (ET), systemic treatment or chemotherapy (CT), radiotherapy (RT) and genetic testing (GT). The determination of treatment was recorded on a binary scale for each modality (recommended versus not recommended). Since the initially chosen prompting model did not include a query of multi-gene assays for the prediction of disease prognosis or chemotherapy benefit, the LLM did not provide an answer in this regard. Hence, profiles that were advised by the MTB to undergo the respective tests were excluded from analysis. As LLM depend on effective prompting, the suggested treatment options were categorized as recommended treatments (see supplementary file 1 and 2). Concordance between LLM and MTB treatment suggestions was assessed using descriptive statistics for each individual patient profile and specific treatment option.

Results

Comparative assessment per patient profile

Overall concordance between LLM and MTB recommendations was highest for GPT4 with 12/20 (60.0%), followed by GPT3.5 version September 21 (50.0%; 10/20) and GPT3.5 version January 22 (35.0%; 7/20) (see Table 3). For invasive breast cancer patients exclusively (CCBC), GPT4´s concordance amounts to 70.6% (12/17). Removing GT from assessment provides full concordance for invasive breast cancer of 82.4% for GPT4 and GPT3.5 version September 2021 (14/17). P7 had to be excluded from the partial evaluation as MTB recommended to perform a genetic array using Endopredict® (Myriad Genetics GmbH, Zurich, Switzerland) to assess the need for chemotherapy for the specific patient profile (see Fig. 1).

Table 3 Concordance according to patient profile per LLM
Fig. 1
figure 1

Comparison of average performance according to type of LLM

Comparative assessment according to treatment option

GPT4 achieved full concordance for RT (100%; 20/20) and the highest concordance for ET and GT by 85% (each 17/20). Regarding CT, GPT3.5 scored highest with 94.7% (18/19) followed by GPT4 with 89.5% (17/19) (see Fig. 2).

Fig. 2
figure 2

Comparative assessment according to type of LLM and treatment option

Longitudinal assessment of GPT versions

Figure 3 demonstrates the alternating accuracy of GPT versions regarding the concordance on breast cancer patient profiles (CCBC). There is an increase in concordance rates by 11.8% using GPT4 and a decrease by −17.6% between for the two GPT3.5 versions.

Fig. 3
figure 3

Development of concordance for breast cancer patient profiles for GPT versions

Discussion

In a novel research field that still lacks methodological best practices, this work presents an early feasibility study that uses a structured approach for comparing different publicly accessible LLM for complex decision-making in a simulated environment in breast cancer care. Based on the definition provided by the FDA (United States Food and Drug Administration), an EFS represents a preliminary clinical assessment of a technological application early in its development [26]. This study type involves examining a small group of cases to assess a new technological application, focusing on its initial safety for clinical use and its functional performance. The objective of this evaluation is to gather insights that could inform potential modifications to the application before initiating larger preclinical and clinical trials. EFS build an essential step in the evidence generation process, allowing to test innovative technologies and accompany these into a healthcare setting that could bring value to patients. In the European Union, there is neither a common standardized definition of EFS nor a regulatory framework on how such studies should be methodologically designed [27]. Due to the increasing importance of evaluating technological applications for their use in the medical sector, the Europe-wide project “Harmonized Approach to Early Feasibility Studies for Medical Devices in the European Union” (HEU-EFS) was launched in October 2023 [27]. It aims to develop a validated standardized approach for EFS in the European Union to provide early insights into technology evidence. In reference to the recommendations of the FDA and the initial results and objectives of HEU-EFS, the present study was conducted to guide adaptations of LLM technology and the scientific approach to it in the context of breast cancer care.

Principal findings

To our knowledge, this is the first dedicated early feasibility study (EFS) in breast cancer care that investigates different publicly available LLM and illustrates how they have advanced over a short time with access to more data or successive technological upgrade. It highlights a growing alignment for the GPT algorithm with complex decision-making processes in treating breast cancer, with GPT4 providing the highest concordance with the current gold standard of a multidisciplinary tumor board. This improvement appears to be primarily linked to the upgrade from GPT3.5 to GPT4 in the underlying technology. A comparison with Llama2 and Bard underscored GPT4’s superior algorithm accuracy. Furthermore, the findings support recent scientific critique of a prevailing challenge of LLM consistency over time by illustrating a declining accuracy of GPT3.5 within a six-month time period despite updated and enlarged data access, underlining the necessity for ongoing scientific monitoring of LLM [25]. These findings are important as they expand upon previous research, comparing the concordance of various LLM in managing breast cancer scenarios and monitoring advancements in accuracy over time and through continuous updates. Against the background of prior work, the results can contribute to the methodological and technological development of LLM application in breast cancer care.

Comparison to prior work

Previous analyses pointed toward the potential of LLM in providing clinical decision support for professional users, offering medical knowledge for different specialties throughout the entire clinical process [28]. In breast cancer care, few studies have explored LLM areas of use.

Rao and colleagues showed the promising use of GPT3.5 in radiologic evaluations and screening, proving its value in mammographic imaging [29]. Additionally, Haver et al. illustrated the chatbot’s capability in providing patient education on breast cancer prevention and screening [30]. Moreover, Choi et al. demonstrated the efficiency of using tailored prompts for LLM in extracting clinical insights from pathology and ultrasound reports in extensive breast cancer medical records [31]. The quality of AI-generated abstracts has reached a level of medical appropriateness that leaves experts to find it challenging to distinguish them from specialist-written content in a blinded review process [32].

With regard to tumor board decision-making, Lukac et al. and Sorin et al. retrospectively compared the answers of GPT3.5 (version September 2021) to the past treatment recommendations of a single tumor board [22, 23]. The latter research represents initial explorations of this technology, rather than definitive benchmarks for evaluating the capabilities of ChatGPT3.5. Their experiments only included the LLM ChatGPT3.5, involved a constrained and unstructured collection of patient profiles with restricted health data, and they utilized a short and limited prompting strategy. Additionally, their assessments were based on a self-developed scoring system. Notably, the studies omitted genetic testing for most cases, which is a crucial factor in the characterization of breast cancer. Both preliminary assessments inferred from their findings that the advice given by language model-based systems could align with that of a tumor board, but refrain from definitive statements about the specific performance level of LLM in their conclusions. Our research builds on the findings of Lukac et al. and Sorin et al. and seeks to extend them in a systematic manner [22, 23]. Therefore, we confirmed GPT3.5’s potential for managing high complexity case by employing a standardized prompting model and using comprehensive health data profiles as described in the methodology [24]. This subsequent EFS provides further insights by comparing different LLM versions and monitoring development over time, with access to more data and technological upgrade. It matches a generic observation by Eriksen et al. of superior performance by GPT4 for diagnosing complex clinical cases and confirms this finding in the field of breast cancer care [33]. Furthermore, it confirms the most recently raised critique of LLM regarding a persisting challenge in answer consistency in the field of breast cancer treatment [25] This relates to the deterioration in GPT3.5’s accuracy over the observation period. It points toward the possible issue, that an extension of data access with uncontrolled sources used for decision-making does not necessarily lead to an improvement in LLM accuracy but could lead to confusion in the models.

Limitations and implications for methodological and technological development of LLM application in breast cancer care

By monitoring the evolution of LLM, this study shows that especially the update to the GPT4 algorithm enables an increasing alignment with the recommendations of the MTB. It indicates that technological applicability rapidly develops toward technological maturity to provide clinical decision support, even for complex decision-making in breast cancer care. Nevertheless, at present, the study also underlines that a clinical use of LLM is not yet feasible. Several unresolved regulatory hurdles and missing evidence on the peculiarities of clinical application should forbid their current use in clinical care. The current level of evidence regarding the use of LLM in breast cancer therapy leaves crucial questions unanswered, which can also be derived this study.

The initially chosen prompting model only required the LLM to indicate whether chemotherapy should be given or not. However, the recommendation of multi-gene assays to assess disease prognosis and predict chemotherapy benefit in patients was not queried. Due to the increasing use of such tests and the associated increasing clinical relevance, future prompting models should include a query relating to the need for multi-gene assays to assess the chemotherapy necessity. This finding underscores the methodological need to develop sophisticated prompting models that should be tailored to the specifics of the oncologic entity being investigated in order to improve the consistency in LLM answering.

Furthermore, the study uses the recommendations by a single MTB as gold standard for comparing concordance in LLM decision-making. Large-scale observational studies, conducted by several international study groups, have revealed notable disparities in breast cancer treatment choices and outcomes [34, 35]. There is often considerable scope for decision-making on available treatment options, such as varying intensities of chemotherapy regimens, which reflects the diversity in national standards and respective guidelines. This issue also explains the rather moderate results for DCIS profiles in this study. The LLM have consistently recommended endocrine therapy, as, for example, suggested in a meta-analysis by Yan et al. from 2020 [36]. In contrast, the MTB in the study decided against endocrine therapy in the DCIS cases, a decision that was taken in interdisciplinary discourse in the MTB and within the decision-making scope of the German guidelines. However, as a dedicated EFS with a small group of 20 cases, no conclusions should be drawn regarding the LLM accuracy for different cancer subtypes stages of the disease, i.e., precancerous or advanced metastasized illness, and treatment options. Hence, in order to ensure the evidence-based and safe use of LLM in breast cancer care, these open questions must be adequately addressed by further research. Subsequent studies should incorporate larger study populations and multicenter study designs to expand findings from a preclinical simulation environment into clinical care.

At a technological level, a lack of control over the sources used for decision-making and a lack of security in the processing of health data have so far prevented the use of LLM in clinical care. The deterioration in GPT3.5’s accuracy over the observation period, which appears to be connected to the extension of data access, underlines how uncontrolled and enlarged input of sources can contribute to confusion in the models. It remains unclear which sources the open LLM use for decision-making, a problem that can also be seen in the moderate DCIS results, as it cannot be derived from the LLM answering which evidence is used by the LLM to recommend endocrine therapy. In alignment with the Explainable AI approach, the technological application should offer the possibility of gaining control over the sources used for decision-making while ensuring security in the processing of personal health data, i.e., by limiting it to local servers.

Opportunities for breast cancer care

Considering the findings of the national survey conducted by the Commission Digital Medicine of the German Association of Gynecology and Obstetrics (DGGG), 61.4% of specialists either agree of strongly agree that intelligent algorithms will support clinicians to treat patients and the majority support the perception that this will improve patient care (65.1% agree of strongly agree) and help to reduce increasing workload (78.4% agree of strongly agree) [12]. These concerns are accompanied by the aforementioned, intensified care complexity due to the rapid increase in evidence-based knowledge and case load in gynecological oncology [4, 5]. In this perspective, easily accessible and user-friendly publicly available LLM may provide a prospective solution in breaking down prevailing barriers [37]. As presented in this study, a clinical use of LLM is not yet feasible. Nevertheless, the controlled and evidence-based adaptation of LLM , i.e., the optimization of prompting techniques or enabling data input control and secure data processing, offers potential that LLM could bring value to patients in clinical breast cancer care.

Conclusion

This early feasibility study demonstrates that publicly available LLM, especially GPT4, increasingly align with the decision-making of a multidisciplinary tumor board and confirms decision consistency to remain a major issue for the application of LLM in breast cancer care. The findings underline that clinical use of LLM is not yet feasible. Nevertheless, the study gathers insights that could inform potential modifications to the LLM application. Methodological advancement, i.e., the optimization of prompting techniques, and technological development, i.e., enabling data input control and secure data processing, are necessary in the preparation of large-scale and multi-centric studies. These will subsequently provide further essential evidence on the safe and reliable application of LLM in breast cancer care to maximize benefits for providers and patients alike.