Introduction

Large language models (LLMs) have emerged as a transformative force in artificial intelligence (AI), generating significant interest across various sectors. The 2022 launch of OpenAI’s ChatGPT demonstrated their groundbreaking capabilities, revealing the current state of development to a wide audience. Since then, public availability and scientific interest have resulted in a flood of scientific papers exploring possible areas of application1 as well as their ethical and social implications from a practical perspective2. A particularly rapid adoption of LLMs is seen in medicine and healthcare3, encompassing clinical, educational and research applications3,4,5,6,7,8,9. This development may present a case where a general-purpose technology swiftly integrates into specific domains. According to Libsey, such technologies are characterized by their potential for extensive refinement and expansion, a wide array of applications across various processes, and significant synergies with existing technologies10,11. In a brief span, a significant number of publications have investigated the potential uses of LLMs in medicine and healthcare12, indicating a positive trajectory for the integration of medical AI. Present-day LLMs, such as ChatGPT, are considered to have a promising accuracy in clinical decision-making13,14, diagnosis15, symptom-assessment, and triage-advice16. In patient-communication, it has been posited that LLMs can also generate empathetic responses17. LLMs specifically trained on biomedical corpora forebode even further capacities for clinical application and patient care18 in the foreseeable future.

Conversely, the adoption of LLMs is entwined with ethical and social concerns19. In their seminal work, Bender et al. anticipated real-world harms that could arise from the deployment of LLMs20. Scholars have delineated potential risks across various application domains21,22. The healthcare and medical fields, being particularly sensitive and heavily regulated, is notably susceptible to ethical dilemmas. This sector is also underpinned by stringent ethical norms, professional commitments, and societal role recognition. Despite the potential benefits of employing advanced AI technology, researchers have underscored various ethical implications associated with using LLMs in healthcare and health-related research4,6,7,23,24,25,26. Paramount concerns include the propensity of LLMs to disseminate inadequate information, the input of sensitive health information or patient data, which raises significant privacy issues24, and the perpetuation of harmful gender, cultural or racial biases27,28,29,30, well known from machine learning algorithms31, especially in healthcare32. Case reports have documented that ChatGPT has already caused actual damage, potentially life-threatening for patients33.

While individual instances have drawn attention to ethical concerns surrounding the use of LLMs in healthcare, there appears to be a deficit in comprehensive, systematic overviews addressing these ethical considerations. This gap is significant, given the ambitions to rapidly integrate LLMs and foundational models into healthcare systems34. Our intention is to bridge this lacuna by mapping out the ethical landscape surrounding the deployment of LLMs in this field. To this end, we conducted a systematic review of the current literature including relevant databases and preprint servers. Our inquiry was structured around two research questions: Firstly, we sought to delineate the ethically relevant applications, interventions, and contexts where LLMs have been tested or proposed within the realms of medicine and healthcare. Secondly, we aimed to identify the principal outcomes as well as the opportunities, risks, benefits, and potential harms associated with the use of LLMs in these sectors, as deemed significant from an ethical standpoint. Through this, we aspire not only to outline the current ethical discourse but also to inform future dialogue and policy-making at the intersection of LLMs and healthcare ethics.

Results

Our search yielded a total of 796 database hits. After removal of duplicates, 738 records went through title/abstract screening. 158 full-texts were assessed. 53 records were included in the dataset, encompassing 23 original articles25,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56, including theoretical or empirical work, 11 letters57,58,59,60,61,62,63,64,65,66,67, six editorials68,69,70,71,72,73, four reviews8,74,75,76, three comments24,77,78, one report79 and five unspecified articles80,81,82,83,84. The flow of records through the review process can be seen in Fig. 1. Most works focus on applications utilizing ChatGPT across various healthcare fields, as indicated in Table 1. Regarding the affiliation of the first authors, 25 articles come from North America, 11 from Europe, six from West Asia, four from East asia, three from South Asia and four from Australia.

Fig. 1: Flow of records through the screening process.
figure 1

This Diagram following PRISMA guidelines showing the flow of records through the screening process.

Table 1 Overview of the included records

During analysis, four general themes emerged in our dataset, which we used to structure our reporting. These themes include clinical applications, patient support applications, support of health professionals, and public health perspectives. Table 2 provides exemplary scenarios for each theme derived from the dataset.

Table 2 Exemplary applications of LLMs

Clinical applications

To support initial diagnosis and triaging of patients39,52, several authors discuss the use of LLMs in the context of predictive patient analysis and risk assessment in or prior to clinical situations as a potentially transformative application74,80. The role of LLMs in this scenario is described as that of a “co-pilot” using available patient information to flag areas of concern or to predict diseases and risk factors44.

Currie, in line with most authors, notes that predicting health outcomes and relevant patterns is very likely to improve patient outcomes and contribute to patient benefit80. For example, overcrowded emergency departments present a serious issue worldwide and have a significant impact on patient outcomes. From a perspective of harm avoidance, using LLMs with triage notes could lead to reduced length of stay and a more efficient utilization of time in the waiting room52.

All authors note, however, that such applications might also be problematic and require close human oversight39,44,51,80. Although LLMs might be able to reveal connections between disparate knowledge40, generating inaccurate information would have severe negative consequences44,74. This could lead to direct harm to patients or provide clinicians with false and dangerous justifications and rationales for their decisions74. These problems are tightly connected to inherent biases in LLMs, their tendency to “hallucinate” and their intransparency52. The term “hallucination” refers to an LLM generating plausible and often confident statements that are factually incorrect in the sense of not being grounded in the data85. In addition, uncertainties are increased by use of unstructured data. Medical notes often differ from the data pretrained models utilise. This makes it difficult to predict accuracy of output when using such data in prompts or for fine-tuning LLMs52. The interpretability of results and recommendations introduce additional complexity and sources of potential harm52. Currie notes that despite such difficulties, the use of LLMs proceeds largely in absence of guidelines, recommendations and control. The outcome, hence, ultimately depends on clinicians’ ability to interpret findings and identify inaccurate information80.

In patient consultation and communication, LLMs can offer a novel approach to patient-provider interaction by facilitating informational exchange and bridging gaps between clinical and preclinical settings, such as self-management measures or community aids8. This includes easing the transition between settings by removing barriers to communication44,60,80,83 or removing barriers in the clinical workflow to facilitate timely and efficient support. As is suggested, LLMs can collect information from patients or provide additional information, enabling well-informed decisions and increasing satisfaction in patients56,60,80. Provision of language translation and simplification of medical jargon may allow patients to become more engaged in the process and enhance patient-provider communication80,83. However, it remains unclear in our dataset what such applications would look like in practice — specifically where, when and how LLMs could actually be integrated.

These suggestions necessitate consideration of ethically relevant boundaries regarding the protection of patient data, and safety36,60,77,83, potentially unjust disparities36,60,83, and the broader dimensions of care, such as the therapeutic relationship36,59,61,64,77. Robust measures to avoid incorrect information in technological mediation of communication and the need to strike a balance with “the human touch” of care60 are stressed. With regard to the former, Buzzaccarini et al. argue for robust expert oversight. Regarding the latter, Li et al. note a potential shift in power dynamics between patients and providers, in which providers might lose their authoritative position and might be seen as less knowledgeable64. Others fear a loss of personal care that should be avoided36,61,77 and the lack of contextual content of individual health challenges42,77. Open communication and consent to the technical mediation of patient-provider communication are required to promote trust but might be difficult to achieve69,78.

Many studies in our dataset discuss the possible use of LLMs for diagnosis8,36,39,44,59,61,66,67,74,75,78,80. It is suggested that the LLMs’ ability to analyze large amounts of unstructured data provides pathways to timely, efficient and more accurate diagnosis to the benefit of patients35,36,67,75,78. It might also enable the discovery of hidden patterns39 and reduce healthcare costs36,49.

An ethical problem emerges with potentially negative effects on patient outcomes due to biases in the training data36,39,41,74,75,78, especially with the lack of diverse datasets risking the underrepresentation of marginalized or vulnerable groups. Biased models may result in unfair treatment of disadvantaged groups, leading to disparities in access, exacerbating existing inequalities, or harming persons through selective accuracy41. Based on an experimental study setup, Yeung et al. deliver an insightful example showing that ChatGPT and Foresight NLP exhibit racial bias towards Black patients28. Issues of interpretability, hallucinations, and falsehood mimicry exacerbate these risks35,36,44,74. With regard to transparency, two sources suggest that LLM-supported diagnoses hamper the process of providing adequate justification due to their opacity36,74. This is understood to threaten the authoritative position of professionals, leaving them at risk of not being able to provide a rationale for a diagnosis35 and might lead to an erosion of trust between both parties. This is in line with others noting that LLMs are not able to replicate a process of clinical reasoning in general and, hence, fail to comprehend the complexity of the process44,59,75. Based on the principle of avoidance of harm, it is an important requirement to subject each generated datum to clinical validation as well as to develop “ethical and legal systems” to mitigate these problems36,39,59.

It needs to be noted, however, that the technically unaided process of diagnosis is also known to be subjective and prone to error67. This implies that an ethical evaluation should be carried out in terms of relative reliability and effectiveness compared to existing alternatives. Whether and under what circumstances this might be the case is a question that is not addressed.

Six studies in our dataset highlight the use of LLMs in providing personalized recommendations for treatment regimens or to support clinicians in treatment decisions based on electronic patient information or history58,60,61,66,67,80, providing a quick and reliable course of action to clinicians and patients. However, as with diagnostic applications, biases and perpetuating existing stereotypes and disparities are a constantly discussed theme60,61,67. Ferrara also cautions that LLMs will likely prioritize certain types of treatments or interventions over others, disproportionately benefiting certain groups and disadvantaging others41.

Additionally, it is highlighted that processing patient data raises ethical questions regarding confidentiality, privacy, and data security58,60,61,66,67. This especially applies to commercial and publicly available models such as ChatGPT. Inaccuracies in potential treatment recommendations are also noted as a concerning source of harm58,60,61,66,67. In a broader context, several authors suggest that for some LLMs, the absence of internet access, insufficient domain-specific data, limited access to treatment guidelines, lack of knowledge about local or regional characteristics of the healthcare system, and outdated research significantly heighten the risk of inaccurate recommendations24,37,38,40,47,55.

Patient support applications

Almost all authors concerned with patient-facing applications highlight the benefits of rapid and timely information access that users experience with state-of-the-art LLMs. Kavian et al. compare patients’ use of chatbots with shifts that have accompanied the development of the internet as a patient information source69. Such access can improve laypersons’ health literacy by providing a needs-oriented access to comprehensible medical information68, which is regarded as an important precondition of autonomy to allow more independent, health-related decisions8,74. In their work on the use of ChatGPT 4 in overcoming language barriers, Yeo et al. highlight an additional benefit, as LLMs could provide cross-lingual translation and thus contribute to equalizing healthcare and racial disparities56.

Regarding ethical concerns and risks, biases are seen as a significant source of harm8,39,74,75. The literature also highlights a crucial difference in the ethical acceptability of using patient support applications, leading to a more critical stance when LLMs are used by laypersons compared to health professionals28,53. However, ethical acceptability varies across fields; for instance, otolaryngology and infectious disease studies find ChatGPT’s responses to patients lack detail but aren’t harmful53, whereas pharmacology and mental health indicate greater potential risks67,68.

LLMs can offer laypersons personalized guidance, such as lifestyle adjustments during illness80, self-assessment of symptoms61,63, self-triaging, and emergency management steps8,57. Although current arrangements seem to perform well and generate compelling responses8,47,63, a general lack of situational awareness is noted as a common problem that might lead to severe harm8,61,63. Situational awareness means the ability to generate responses based on contextual criteria such as the personal situation, medical history or social situation. The inability of most current LLMs to seek clarifications by asking questions and their lack of sensitivity to query variations can lead to imprecise answers45,63. For instance, research by Knebel et al. on self-triaging in ophthalmologic emergencies indicates that ChatGPT’s responses can’t reliably prioritize urgency, reducing their usefulness45.

Support of health professionals and researchers

LLMs could automate administrative or documentation tasks like medical reporting80, or summarizing patient interactions8 including automatic population of forms or discharge summaries. The consensus is that LLMs could streamline clinical workflows8,36,43,51,52,60,68,74,80,81,83, offering time savings for health professionals currently burdened with extensive administrative duties68,83. By automating these repetitive tasks, professionals could dedicate more time to high-quality medical tasks83. Crucially, such applications would require the large-scale integration of LLMs into existing clinical data systems49.

In health research, LLMs are suggested to support text, evidence or data summarization54,64,82, identify research targets8,61,72,83, designing experiments or studies72,83, facilitate knowledge sharing between collaborators37,70,80, and to communicate results74. This highlights the potential for accelerating research46,79 and relieving researchers of workload8,40,64,74,75,83, leading to more efficient research workflows and allowing researchers to spend less time on burdensome routine work8,80. According to certain authors, this could involve condensing crucial aspects of their work, like crafting digestible research documents for ethics reviews or consent forms82. However, LLMs capacities are also critically examined, with Tang et al. emphasizing ChatGPT’s tendency to produce attribution and misinterpretation errors, potentially distorting original source information. This echoes concerns over interpretability, reproducibility, uncertainty handling, and transparency54,74.

Some authors fear that using LLMs could compromise research integrity by disrupting traditional trust factors like source traceability, factual consistency, and process transparency24. Additionally, concerns about overreliance and deskilling are raised, as LLMs might diminish researchers’ skills and overly shape research outcomes46. Given that using such technologies inevitably introduces biases and distortions to the research flow, Page et al. suggest researchers must maintain vigilance to prevent undue influence from biases introduced by these technologies, advocating for strict human oversight and revalidation of outputs70.

Public health perspectives

The dataset encompasses studies that explore the systemic implications of LLMs, especially from a public health perspective50,61,75. This includes using LLMs in public health campaigns, for monitoring news and social media for signs of disease outbreaks61 and targeted communication strategies50. Additionally, research examines the potential for improving health literacy or access to health information, especially in low-resource settings. Access to health information through LLMs can be maintained free of charge or at very low costs for laypersons55. Considering the case of mental health, especially low- and middle-income countries might benefit71. These countries often have a huge treatment gap driven by a deficit in professionals or inequitable resource distribution. Using LLMs could mitigate accessibility and affordability issues, potentially offering a more favorable alternative to the current lack of access71.

However, a number of authors raise doubts about overly positive expectations. Schmälzle & Wilcox highlight the risks of a dual use of LLMs50. While they might further equal access to information, malicious actors can and seem to be using LLMs to spread fake information and devise health messages at an unprecedented scale that is harmful to societies50,51,75. De Angelis et al. take this concern one step further, presenting the concept of an AI-driven infodemic46 in which the overwhelming spread of imprecise, unclear, or false information leads to disorientation and potentially harmful behavior among recipients. Health authorities have often seen AI technologies as solutions to information overload. However, the authors caution that an AI-driven infodemics could exacerbate future health threats. While infodemic issues in social media and grey literature are noted, AI-driven infodemics could also inundate scientific journals with low-quality, excessively produced content46.

The commercial nature of most current LLM systems present another critical consideration. The profit-driven nature of the field can lead to concentrations of power among a limited number of companies and a lack of transparency. This economic model, as highlighted by several studies, can have negative downstream effects on accessibility and affordability24,36,43. Developing, using, or refining models can be expensive, limiting accessibility and customization for marginalized communities. Power concentration also means pricing control lies with LLM companies, with revenues predominantly directed towards them44. These questions are also mirrored in the selection of training data and knowledge bases24 which typically encompass knowledge from well-funded, English speaking countries and, thus, significantly underrepresents knowledge from other regions. This could exacerbate health disparities by reinforcing biases rather than alleviating them.

Discussion

Our analysis has unveiled an extensive range of LLM applications currently under investigation in medicine and healthcare (see Fig. 2). This surge in LLMs was largely caused by the advent and ease of use of ChatGPT, a platform not originally tailored for professional healthcare settings, yet widely adopted within it12,83. This presents a rather unique instance where a general-purpose technology has rapidly permeated the sector of healthcare and research to an unprecedented extent.

Fig. 2: Fields of application of LLMs in medicine and healthcare.
figure 2

This figure shows the categories and subcategories of applications of LLMs.

Our review highlights a vivid testing phase of LLMs across various healthcare domains12. Despite the lack of real-world applications, especially in the clinic, there is an overarching sentiment of the promise LLMs hold. It is posited that these tools could increase the efficiency of healthcare delivery and research, with the potential to benefit patient outcomes while alleviating the burdensome workload of healthcare professionals. These advantages of LLMs are largely attributed to their capabilities in data analysis, personalized information provisioning, and support in decision-making, particularly where quick analysis of voluminous unstructured data is paramount. Moreover, by mitigating information loss and enhancing medical information accessibility, LLMs stand to significantly bolster healthcare quality.

However, our study has also surfaced recurrent ethical concerns associated with LLMs. These concerns echo the wider discourse on AI ethics86,87,88, particularly in healthcare89, and touch on issues of fairness, bias, non-maleficence, transparency, and privacy. Yet, LLMs introduce a distinctive concern linked to a dimension of epistemic values, that is, their tendency to produce harmful misinformation or convincingly but inaccurate content through hallucinations as illustrated in Fig. 390. The effects of such misinformation are particularly severe in healthcare, where the outcome could be dire. The inherent statistical and predictive architecture combined with the intransparency of LLMs presents significant hurdles in validating the clinical accuracy and reliability of their outputs91,92,93.

Fig. 3: Discussed dimensions of impact of LLMs.
figure 3

This figure shows recurring ethical issues and their relative weight in each field of application based on the number of codes extracted during the analysis.

The inclination of LLMs to output erroneous information underscores the need for human oversight and continual validation of machine-generated output, as our dataset demonstrates. This need is accentuated by the lack of professional guidelines or regulatory oversight within this field23. Consequently, there is a noticeable demand for ethical guidelines, as evidenced within the literature surrounding healthcare applications of LLMs46,60,64,70,71,74,75,78.

Future directions of ethics research

While we concur with the need for such guidance, our analysis suggests that the real challenge lies not in the articulation of such a need but in comprehending the scope of what this entails. There are inherent and contextual limitations and benefits associated with LLMs that warrant consideration. Inherently, state-of-the-art LLMs carry the risks of biases, hallucinations, and challenges in validity assessment, reliability testing, and reproducibility. Contextually, the effectiveness of LLM usage hinges on various situational factors, including the user utilizing LLMs, their level of expertise as well as their epistemic position (e.g. expert versus layperson), the specific domain of application, the risk profile of the application, and potential alternatives that the LLM is compared against.

A nuanced ethical discourse must recognize the multilayered nature of LLM–usage, from the epistemic stance of the user to the potential for harm, the varying degrees of potential harm due to misinformation or bias, and the diverse normative benchmarks for performance and acceptable levels of uncertainty. Our recommendation is to reframe the ethical guidance debate to focus on defining what constitutes acceptable human oversight and validation across the spectrum of applications and users. This involves considering the diversity of epistemic positions of users, the varying potentials for harm, and the different acceptable thresholds for performance and certainty in diverse healthcare settings. Such an approach should align with context-sensitive and participatory strategies for advancing technological development.

Given these questions, a critical inquiry is necessary into the extent to which the current experimental use of LLMs is both necessary and justified. Our dataset exemplifies a diversity of perspectives, methodologies, and applications of LLMs, revealing a significant degree of ambiguity and uncertainty about the appropriate engagement with this technology. Notably, a portion of current research seems propelled more by a sense of experimental curiosity than by well-defined methodological rigor, at times pushing the boundaries of ethical acceptability, particularly when sensitive real patient data are utilized to explore capabilities of systems like ChatGPT.

To frame these developments, it is instructive to consider the implementation of LLMs as a form of “social experiment”94,95. We employ this concept in a descriptive sense to denote a situation in which – according to van der Poel – the full benefits, risks, and ethical issues of a technology become evident only after its widespread adoption95. This perspective acknowledges the inherent uncertainties associated with the deployment of LLMs in medicine and healthcare due to their novelty, complexity, and opacity. Consequently, it necessitates that these technologies be introduced through an iterative process, which constitutes a learning endeavor. This approach facilitates a gradual understanding of the actual consequences of LLM use, thereby mitigating uncertainties. Furthermore, framing the current developments as social experiment also reinforces the need to establish and respect ethical limits – especially within the healthcare domain, where professional duties and responsibilities towards patients are foundational.

With this in mind, we suggest that understanding how we acquaint ourselves with disruptive technologies must be central to any future ethical discourse. There is a compelling need for additional research to ascertain the conditions under which LLMs can be appropriately utilized in healthcare, but also to establish conditions of gradual experimentation and learning that align with principles of health ethics.

Limitations

This review addresses ethical considerations of using LLMs in healthcare at the current developmental stage. However, several limitations are important to acknowledge. Ethical examination of LLMs in healthcare is still nascent and struggles to keep pace with rapid technical advancements. Thus, the review offers a starting point for further discussions. A significant portion of the source material originated from preprint servers and did not undergo rigorous peer review, which can lead to limitations in quality and generalisability. Additionally, the findings’ generalizability may be limited due to variations in researched settings, applications, and interpretations of LLMs. Finally, we note a potential underrepresentation in our dataset, when it comes to non-Western perspectives. Most articles are affiliated with North American or European institutions. This might have an impact on the scope of ethical issues discussed as well as on how certain issues are addressed and evaluated. For example, many authors express hopes that LLMs might help to mitigate issues of global health justice such as unequal distribution of access to healthcare or treatment gaps in disadvantaged countries. However, a lack of more critical perspectives potentially informed by non-Western experience and exploration of LLMs needs to be noted. This includes, for example, addressing the implications of Western economic dominance or the effects of training data that predominantly represents Western populations. With this in mind, we do not understand our overview of ethical issues as exhaustive.

Methods

Protocol and registration

A review protocol focusing on practical applications and ethical considerations grounded in experience was designed by the authors and registered in the international prospective register of systematic reviews96. Ethical approval or consent to participate was not required for this systematic review.

Information sources and search strategy

Relevant publication databases and preprint servers were queried (see Table 3 for sources).

Table 3 Overview on sources and search string

The decision to include preprint servers as well as databases was made based on the hypothesis that preprints are very common in technology-oriented fields. In addition, we hypothesized that even a mild publication delay would have prevented relevant work from already being indexed in the databases at the time of our search.

Study selection

Inclusions were screened and extracted in a two-staged process following a modified rapid review approach97. Inclusion and exclusion criteria were based on the three key concepts of intervention, application setting, and outcomes (see Supplementary Note 1). No additional inclusion or exclusion criteria (e.g. publication type) were applied. However, we excluded work that was solely concerned with (ethical) questions of medical education, academic writing, authorship and plagiarism. While we recognize that these issues are affected by the use of LLMs in significant ways6,98,99 these, challenges are not specific to health-related applications.

Data Collection and Extraction

Database searches were conducted in July 2023. Subsequently, the authors independently screened titles and abstracts of 10% of all database hits (73 records) to test and refine inclusion and exclusion criteria. After a joint discussion of the results, the remaining 90% were screened by the first author. Data was extracted using a self-designed extraction form (see Supplementary Note 2). The extraction categories were transformed into a coding tree using MaxQDA. Both authors independently coded 10% of the material to develop and refine the coding scheme in more detail. The remaining material was extracted by J.H. Results were iteratively discussed in three joint coding sessions.

Synthesis

A final synthesis was conducted following a meta-aggregative approach. Based on our extraction fields, we, first, developed preliminary categories encompassing actors, values, device properties, arguments, recommendations and conclusions. These categories were, then, iteratively refined and aggregated through additional coding until saturation was reached.

Quality appraisal

Given the constraints of normative quality appraisal100 and in line with our research goal to portrait the landscape of ethical discussions, we decided to take a hybrid approach to the quality question. We descriptively report on procedural quality criteria (see Table 1) to distinguish material that underwent processual quality control (such as peer review) from other material. In addition, we critically engage with the findings during reporting to appraise comprehensiveness and validity of the extracted information pieces.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.