Introduction

Recently, a promising autoregressive large language model (LLM), Generative Pre-trained Transformer (GPT)-3 trained with 175 billion parameters via cloud computing [1] has been made available to the public online (released by OpenAI on November 30, 2022; https://chat.openai.com/). Its size makes it one of the largest deep learning models ever created [2,3,4,5]. ChatGPT’s global uptake has been exponential: 40 days post launch, GPT-3, had 10 million daily users, surpassing social media giant Instagram in daily users [6] and becoming an overnight “cultural sensation” [7]. GPT-4 was released on March 14, 2023, and it is capable of performing better than humans on high-level professional school exams, and it is perceived as a general purpose artificial intelligence (AI) that is suitable for multiple economic sectors, including healthcare.

ChatGPT (GPT-3/4) is able to generate de novo textual outputs that are grammatically and semantically fluent. Interestingly, human performance for a task does not define the upper bound of LLMs [8,9,10]. GPT-3 and higher iterations are able to write computer code, compose poetry, generate unique musical composition, and even create cooking recipes. Importantly, ChatGPT can be used for a multitude of healthcare-relevant scenarios.

Uses in medicine pertaining to clinical care and education should be considered as target applications for this disruptive technology. Herein, we explore how LLMs can be applied to health literacy, decision-making, and written task formation. Potential pitfalls of LLMs are also discussed.

Understanding large language models (LLMs)

LLMs are a specific application of Natural Language Processing (NLP), which is a subfield of AI that focuses on the interaction between computers and human language. It involves processing and analyzing natural language data (text or speech) to enable machines to understand, generate, and respond to human language [11].

In the field of AI, LLMs are a subset of deep learning that overlaps with generative AI. Generative AI is capable of producing text, images, audio, and synthetic data. Current LLMs are most useful for four types of outputs: (a) text classification; (b) question answering; (c) document summarization (including sentiment analysis); and (d) text generation.

While GPT-4 is the most advanced iteration at the time of this writing (2023), there are other LLMs including PaLM (Pathways Language Model), LaMDA, Microsoft Bing, Google Bard, BERT, and T5. To be capable of advanced textual outputs, all of these models are based on transformer neural networks [12], as opposed to convolutional neural networks. The latter defines the architecture underlying modern computer vision [13].

One important aspect of transformers is that they are highly parallelizable [1]. By using thousands of graphics processing units (GPUs) in parallel, GPT-3 was able to be trained in just 1 month. Without being parallelizable, this would have required 355 years using a single GPU. GPUs were initially developed for video gaming because graphic outputs requires parallelized computation. Transformers allow for rapid scale-up compared to previous AI for system training [14]. OpenAI’s original ChatGPT (2018) [15] and ChatGPT-2 (2019) [16, 17] were LLMs developed a few years prior to GPT-3 and 4, but did not possess the power of the current iterations.

LLM architecture is based on being able to probabilistically predict the next word in a sentence: For example, the colors of the US flag are red, white, and _____. Mathematically, this can be expressed as \(P \left({X}_{n}\right| {X}_{n-1}\)), whereby the probability \(P\) of the next word \({X}_{n}\) is based on the word appearing immediately before it in the sentence, \({X}_{n-1}\). But the probability of the next word can also be based on more than the immediately prior word, thus more generally:

$$P(X_{n} | X_{n - 1} , X_{n - 2} , X_{n - 3} , X_{n - 4} , \ldots )$$

While transformer neural networks are highly complex, a critical part of their operation includes assigning weights to the words in the sentence because, generally speaking, certain words in a sentence are more important than others. A model can assign parameters, \(\varphi\), to maximize the probability that \({X}_{n}\) will be accurate and that it will make grammatical sense. Thus transformer LLMs generally use this formula:

$$P_{\varphi } (X_{n} | X_{n - 1} , X_{n - 2} , X_{n - 3} , X_{n - 4} , \ldots )$$

This is a simplified mathematical representation of how transformer LLMs are able to determine the next word (responses) to text queries, and how they are able to perform generative AI functions. LLMs do not actually “think”, but rather the model generates textual outputs based on next word probability and by “paying attention” to key words in text [12]. In computer science, this is termed, autoregression, where a weighted sample of past data (text) is used to predict future results or textual outputs.

The architecture of prior iterations of ChatGPT and current iterations (GPT-3/GPT-4) remained fundamentally the same; however, the build of the initial pre-training process was much smaller for GPT-1 and GPT-2, with approximately 20,000 times more computation used for the current model’s training [18]. In computing, what is known as scaling law allows LLMs to become exponentially more intelligent by feeding them more data [19, 20]. Importantly, GPT-3/4 are capable of so-called few shot or even zero shot learning, whereby learning is achieved from just a few or no examples. In addition, GPT-3/4 have the ability to perform chain-of-thought prompting to demonstrate reasoning, e.g., showing the steps for solving a complicated problem in calculus or physics [8, 21]. According to S. Bowman, experts are not yet able to interpret the inner workings of LLMs, because the LLMs generate outputs based on the vast textual data they are fed, and we have no satisfactory method to know “what kinds of knowledge, reasoning, or goals a model is using when it produces some output” [8].

LLMs in healthcare

One of the most intriguing ideas behind GPT-3, and higher versions, is that it can not only analyze voluminous amounts of textual data but that it can also compose it, making it a functioning generative AI model [2, 8, 22]. A potential use of GPT-3/4 by healthcare providers is implementation as an informatics support system with the objective of reducing staff workload and patient wait times. It can rapidly synthesize high volume, complex patient data and generate summative reports effortlessly. Theoretically, healthcare IT can integrate LLMs into the electronic medical record (EMR) so as to use the generative AI capability to write discharge summaries [23], assist with data entry, and optimize patient check-in for visits (e.g., by assimilating necessary patient data prior to consultation and treatment). Through EMR–LLM integration and, in the future, App-embedding, physicians and medical staff can conserve time, reallocating it to more patient-centric tasks which mandate interpersonal interactions—including face-to-face consultation and personalized treatment. In this manner, LLMs could have a transformative impact on productivity and well-being, ultimately decreasing provider burnout rates in the medical field [24, 25] while enriching patient experience. The advantages and limitations of LLMs in healthcare are summarized in Table 1.

Table 1 Summary of large language model applications in healthcare

Real-world example

In the USA, it is not infrequent that medical insurers deny payment for procedures and services rendered to patients, such as the “off-label” use of medications [26]. LLMs can be tooled to sift through patient data from EMRs and thus generate articulate medical insurance appeal letters written with proper prose; this is one example of how GPT-3/4 can be used to offset a growing clerical and logistical burden that ultimately results in delayed delivery of healthcare. This reallocated time could, in turn, be used for direct patient care. In surgery and research, it appears GPT-3 has a role in autogenous tasks [27], including creating grant proposals [28], and generating procedure-specific consent forms [29, 30].

In the following example GPT-3 was used to generate a response to a hypothetical insurance denial letter. This response was generated within 10 s after online query; the query and response are shown verbatim in Fig. 1.

Fig. 1
figure 1

ChatGPT-3’s rebuttal letter for denial of services. Both query and response are unedited. GTP-3 Generative Pre-trained Transformer-3, MRI Magnetic Resonance Imaging, CT Computed Tomography

One can see from this practical example that GPT-3’s textual output is in fluent, native-speaker English with appropriate syntax. This is just one example of how ChatGPT can be utilized to improve healthcare delivery in the real world.

Educational resource for patients and providers

The ability for GPT-3 (and higher iterations) to abstract large volumes of data and present it succinctly makes its output more useful than classical search engines; and this is a key reason why the technology has gained rapid usership. ChatGPT could serve as a tool to enhance health literacy, i.e., the capacity to seek, understand, and act on health information [31]. With the advent of this technology, patients have a way to probe an intelligent chatbot for healthcare knowledge. Since ChatGPT often provides in-depth and concise responses to health inquiries, individuals interested in furthering their knowledge of disease and health have the potential to ask questions and obtain responses that can be easily understood by laypersons. Whether advisable or not, patients will learn from ChatGPT’s responses, shaping their knowledge base and allowing users an “invisible hand” in the algorithm of healthcare [32], since they are able to ask specific questions and receive immediate personalized responses. One can predict that LLMs will eventually dethrone “Dr. Google” [33] to become the first choice as a source of reliable healthcare information.

Limitations and future considerations

When obtaining knowledge from ChatGPT it is critical for individuals to also be aware of the limitations [2] of the model such as its ability to generate factually incorrect information and produce potentially harmful or biased content. At least currently, GPT-3/4 lack training on events and developments which are current or those which occurred within the recent past (3 years). Despite these limitations, it serves as a dynamic tool patients can use to learn from.

Individuals who reside in remote locales often face a significant barrier: lack of easy access to medical professionals and health resources. This health disparity has a substantial impact on an individual’s health awareness, and GPT-3/4 will benefit persons residing in such underserved communities.

It does not take much to imagine an amalgam of technology that includes speech recognition, recent developments of human-expressive robotics [34], and generative AI via LLMs. Combining those elements appears to be a natural next step and a future in which a patient can be evaluated and triaged by a speaking intelligent humanoid robot in not inconceivable. Once the stuff of science fiction, such a construct could become a reality in the near future.

Ethics and safety governing LLMs remain challenging, especially because ChatGPT is growing at an exponential rate and also because it is definitely prone to misuse [2, 8]. Through proper human oversight, future iterations aim to minimize bias and potential harm to humans. Current renditions have proven to be suboptimal in certain settings and along certain chat streams. It has been shown in simulation, for example, that GPT-3 was capable of encouraging suicidal ideation in a mock patient [35], raising serious concern for public health rendered via GPT’s unregulated and free access.

GPT-3 delivered a functional AI to the masses, which is in itself an historic achievement. We must be prepared to shepherd its use to prevent malicious application and/or harmful outputs. Isaac Asimov famously set forth the cornerstone principles fundamental to all artificially intelligent systems. While intended for robots specifically, the Three Laws of Asimov [36] can be broadened to be inclusive of all AI—including GPT and other LLMs. Expanding this, Asimov’s Laws are as follows:

1st law:

A robot/AI may not injure a human being or, through inaction, allow a human being to come to harm.

2nd law:

A robot/AI must obey the orders given it by human beings except where such orders would conflict with the 1st law.

3rd law:

A robot/AI must protect its own existence as long as such protection does not conflict with the 1st or 2nd law.

Recall that computers have been making critical decisions in healthcare for decades. Perhaps the best example of this is the computerized human-independent algorithms that control the automated external defibrillator (AED), first developed by Paul Zoll at Harvard Medical School in the 1950s [37]; AED implementation has been crucial not just because of the access of the device in public areas, but the ability for it to be operated without the need for medical expertise and intentionally without human input [38].

While humans have been accustomed to specific computerized applications in medicine such as AEDs, generative AI models are far more complex and may not be as easily adapted. Modern LLMs do not always get it right and on occasion produce fluent and grammatically correct textual outputs that are categorically false, and in some instances fictitious—a phenomenon known as “hallucination” [39].

Experts, including OpenAI CEO Sam Altman and others, have voiced concern about the potential misuse of LLMs to write computer code for the purpose of carrying out cyberattacks, as well as GPT’s potential to propagate biased information and misinformation. Computer scientists have suggested that LLMs can manipulate humans to acquire power. The majority of 700 plus computer science researchers believe there is more than a 10% probability that humans will not be able to control further advancements in AI leading to “human extinction” [8, 40]. The gravity of these statements suggests that while LLMs are likely here to stay, human oversight and governance will be absolutely crucial.