1 Introduction

Currently, the growth of artificial intelligence (AI) based computations is doubling every 6 months, hence far exceeding Moore’s Law.Footnote 1 Natural language processing (NLP), a subfield of AI, focuses on the interaction of human language with computer systems (Nath et al. 2022). NLP has advanced significantly with the advent of transformers, an AI architecture that has improved NLP without requiring any recurrent or convolutional layer (Vaswani et al. 2017). Transformers exploit the attention mechanism to determine which parts of the input are more relevant. Hence, transformers-based models have been used for language translation and text completion tasks. More recently, transformers have been applied successfully to domains other than NLP, e.g., computer vision (Hatamizadeh et al. 2021).

Large language models (LLMs) constitute one of the most successful applications of transformers. In essence, they are large pre-trained AI systems based on knowledge gained from huge datasets, using language as a tool for human-AI interaction that can be adapted easily across several domains and for diverse tasks (Singhal et al. 2023a). The impressive performance of LLMs on NLP tasks has been amply demonstrated over the past few years (Singhal et al. 2023a). Kingston et al. demonstrated that LLMs’ performance and data efficiency increase with both model and dataset size (Kingston et al. 2021). LLMs have been shown to exhibit promising results across a wide range of tasks, including those requiring specialized scientific knowledge and reasoning, thereby enabling them to generalize rapidly and even exhibit reasoning abilities with appropriate prompt strategies (few shot, chain of thought, and self-consistency) (Brown et al. 2020; Cobbe et al. 2021; Li et al. 2021; Wang et al. 2022; Wei et al. 2022). By employing prompt engineering, LLMs can be adapted to downstream tasks without the need for fine-tuning (Liu et al. 2023b).

In 2018, Open AI (San Francisco, CA, United States) released Generative Pre-trained Transformer-1 (GPT-1), a 117 million-parameters autoregressive LLM.Footnote 2 It was trained using unsupervised pre-training followed by supervised fine-tuning on Common Crawl (a large body of publicly available text from the Internet) and Book Corpus (a set of thousands of books of various genres). During unsupervised pre-training, GPT-1 learned the statistical patterns and structures present in the text data to predict the next word in a sentence. During supervised fine-tuning, GPT-1 was trained with input–output pairs on specific tasks (natural language inference, question answering, semantic similarity, and classification) on smaller datasets.Footnote 3 For instance, if the task was text classification, GPT-1 was trained with labeled text samples to predict the correct labels. With fine-tuning GPT-1 specialized in a particular task. GPT-1 was followed by larger models: GPT-2 in 2019 with 1.5 billion and GPT-3 in 2020 with 175 billion parametersFootnote 4 (Brown et al. 2020).

Unfortunately, LLMs like GPT-3 may amplify social biases in the training data and generate incorrect outputs (hallucinations) or reflect negative sentiments (Liévin et al. 2022). For instance, LLMs can generate different occupations and levels of respect for different genders, by imposing the idea that intellectual “brilliance” belongs only to a gender (Shihadeh et al. 2022). In part, this is because LLMs are trained to predict the next (sequential) word in a large dataset of Internet text, and hence, the results may not always align with users’ expectations (Ouyang et al. 2022). InstructGPT by OpenAI, a fine-tuned version of GPT-3 incorporating reinforcement learning from human feedback (RLHF), has enhanced the performance of LLMs significantly (Stiennon et al. 2022). InstructGPT was trained in three stages: initially, a dataset of human-written prompts was submitted as input to the OpenAI application programming interface (API) with human annotators labeling the desired output. This dataset was used to fine-tune GPT-3 with supervised learning. Secondly, a dataset was collected on a larger set of API prompts, with human annotators ranking the outputs of different models for each prompt. A reward model was trained on this dataset to predict the preferred output by the annotators. Thirdly, the supervised learning baseline model was fine-tuned through reinforcement learning, with the reward model optimizing the policy using a proximal policy optimization algorithm (Ouyang et al. 2022).

LLaMA (Large Language Model Meta AI), a smaller LLM than InstructGPT (ranging from 7 to 65 vs. 175 billion parameters) trained on a larger number of tokens, performed better than InstructGPT on several benchmarks (Touvron et al. 2023). The latest version of LLaMA, called LLaMA 3, is available in configurations from 8 to 70 billion parameters.

ChatGPT by OpenAI, the successor of InstructGPT, was launched on November 30, 2022.Footnote 5 It was trained using RLHF following the same methods as InstructGPT but with a different dataset. In this case, the dataset included one from InstructGPT and another with conversations where human trainers assumed the roles of both the users and the AI assistants. ChatGPT was trained with data from the Internet until the end of 2021.

Bard was the response of Google (Mountain View, CA, United States) to ChatGPT and was unveiled on February 8, 2023. It was capable of answering multimodal questions, e.g. mixing text and images. At the end of the output, it added also weblinks for a Google search using some keywords of the input question. The same company has also been working on Sparrow, another LLM based on RLHF, Gemma and Gemini. Galactica, a decoder-only transformer LLM by Meta (Menlo Park, CA, United States), was developed to organize scientific literature. It was trained on over 48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, and encyclopedias (Taylor et al. 2022).

Claude by Anthropic (San Francisco) was designed to rely on Constitutional AI, a set of principles provided by humans, to improve the performances of LLMs (Bai et al. 2022). Its use is currently limited to users in the United States and the United Kingdom.

On March 14, 2023, OpenAI launched GPT-4, the successor of ChatGPT, accepting both image and text inputs, and generating text output. As with ChatGPT it was trained on publicly available data on the Internet and fine-tuned with RLHF (OpenAI 2023). GPT-4 also extended the size of the text which can be prompted as input, at the cost of increasing the computational complexity. In GPT models, there is a quadratic dependency between computational complexity and the length of the tokens sequence due to the self-attention mechanism in the transformer entailing pairwise comparisons between all tokens in the sequence. The maximum tokens sequence increased from 512 in GPT-1 to 4,096 in ChatGPT up to 128,000 in GPT-4.Footnote 6 For comparison, the latest version of Claude (Claude 3) can manage a sequence of 200,000 tokens.

Other LLMs handling multimodal data (text and images) include Flamingo, Bard, BLIP-2 (Bootstrapping Language Image Pretraining), CM3Leon, PaLM-E, and LLaVAFootnote 7 (Alayrac et al. 2022; Driess et al. 2023; Li et al. 2023b;  Liu  2023a).

LLMs can potentially store, combine, and explore scientific knowledge to find hidden connections between different searches and produce systematic reviews or meta-analyses on specific topics automatically (Taylor et al. 2022).

Because of the potential of LLMs to acquire useful knowledge encoded in medical databases, they are likely to have applications in healthcare, including knowledge retrieval, clinical decision support, synopsis of key findings, and triaging patients attending primary care clinics (Singhal et al. 2023a).

The ability to answer medical questions requires full comprehension of medical text, recall of appropriate medical knowledge, and reasoning with expert information. LLMs like ChatGPT, GPT-4, Google Bard, and Claude by Anthropic were not specifically trained for healthcare applications, since they were developed for general-purpose cognitive capability. The data on healthcare used to train LLMs came from openly available medical texts, research papers, health system websites, and online available health information podcasts and videos (Lee et al. 2023b). The training data did not include privately restricted data, e.g., as those contained in an electronic health records, or any medical information that exists only on the private network of a medical organization (Lee et al. 2023b).

Several LLMs have been developed for healthcare, including the LLM by Hippocratic (Kolkata, India), ChatDoctor, DoctorGLM, Clinical Camel (derived from LLaMA), Med-Alpaca (derived from LLaMA), PMC-LLaMA, HuaTuo, and ChatCAD (Han et al. 2023; Li et al. 2023d; Toma et al. 2023; Wang et al. 2023a; Wang et al. 2023b; Wu et al. 2023; Xiong et al. 2023). Recent efforts have led to multimodal LLMs for medicine like Med-PaLM M and LLaVA-Med (Li 2023a; Tu et al. 2023).

Research on LLMs for healthcare applications is expanding rapidly. A recent review highlighted that current studies are mostly focused on: i) medical education, such as assessing performances of LLMs in medical examinations and ability to provide information support to learners and teachers; ii) clinical practice, e.g., by generating clinical reports and summarizing clinical discussions; and iii) research e.g., to develop LLMs-based applications to collect and analyze medical literature (Wu et al. 2024). A list of applications of LLMs in healthcare is reported in Table 1. Radiology is currently the medical specialty where LLMs were mostly applied, followed by surgery and dentistry (Wu et al. 2024).

Table 1 Applications of LLMs in healthcare

This work focused on LLMs on the medical education domain, in particular on the assessment of the performances of LLMs in medical examinations.

The capability of LLMs to pass or not a medical examination may open new opportunities in medical education for both teachers and learners. If LLMs can demonstrate proficiency in answering correctly questions and demonstrating reasoning capabilities in some medical area they could be used by trainees to learn about a specific topic, or learners may ask an LLM to explain them some concepts which they did not understand. If LLMs can demonstrate reliability on knowledge related to a medical field, then teachers could trust them in preparing new teaching contents like lectures, prepare examinations, and use LLMs as assessment frameworks to evaluate the responses of students to examinations, thus saving a considerable amount of time.

1.1 Work motivation

Since LLMs have exceptional natural language comprehension abilities and are trained on massive datasets they represent ideal candidates for professional benchmarks, including those related to healthcare (Holmes et al. 2023). Several studies testing LLMs on medical examinations were conducted, for instance on the United States Medical Licensing Exam (USMLE), a three-step examination to assess clinical competence, required for licensure for independent provision of healthcare in the United States (Gilson et al. 2023; Han et al. 2023; Kung et al. 2023; Nori et al. 2023; Shama (2023); Singhal et al. 2023a; Singhal et al. 2023b; Toma et al. 2023; Tu et al. 2023; Wu et al. 2023). Step 1 of USMLE is taken by medical students after completing their preclinical training. Step 2 is taken after graduation and its scores are considered for admission into residency programs. Passing Step 3 is required for being licensed to practice medicine without supervision. However, at present there is no published systematic review of LLMs on healthcare examinations.

Although surgery is a medical specialty that generates some of the largest volume of data in healthcare which can be processed by AI algorithms, there is currently no published evidence on how LLMs perform on tests related to robot-assisted surgery (RAS). This surged ahead of traditional direct manual operations given its undoubted improved efficacy, such that the global market of RAS is predicted to grow at an average rate (CAGR) of 16.8%, reaching 21 USD billion in 2030.Footnote 8 Assuming this prediction materializes, there is an urgent need to train an increasing number of surgeons in RAS. Recognizing the need for training in RAS, several curricula have been proposed but none has received universal acceptance and widespread adoption (Satava et al. 2020). Fundamentals of Robotic Surgery (FRS) is a multi-specialty, proficiency-based curriculum of basic cognitive and technical skills to train and assess surgeons to safely and efficiently perform RAS. The threshold for attaining proficiency in FRS was computed as the mean of the expert surgeons participating in a multicenter randomized control trial involving 12 surgical training centers, accredited by the American College of Surgeons (Satava et al. 2020).

1.2 Contributions

The first purpose of this work was to perform a systematic review of published literature on LLMs on healthcare examinations. The second aim was to see whether ChatGPT, GPT-4, and Bard are capable of passing the FRS test. The main contributions of this paper are as follows:

  • The studies on LLMs in medical examinations are presented;

  • The role of prompt engineering is discussed for each group of studies;

  • A comparative analysis of ChatGPT, GPT-4, and Bard on the FRS test is performed;

  • The future challenges of LLMs in medical education are presented.

The rest of the paper is structured as follows: Sect. 2 describes the literature search strategy, and the process to extract and analyze studies. Section 3 states the research questions of this work. The applications of LLMs, reported in Sect. 4, are subdivided into three groups: National Qualifying examinations, Medical Specialty examinations, and other tests in medicine. In Sect. 5 a comparative analysis of ChatGPT, GPT-4, and Bard on FRS test is presented. First FRS is described, including the source to retrieve the online question set. Then, consistency of performances of these LLMs over trials is reported. For ChatGPT scores over multiple releases are presented. Section 6 deals with the discussion on the comparison of ChatGPT, GPT-4, and Bard on FRS test, underling similarities and differences with the published evidence resulting from our systematic review. Challenges of LLMs in medical education are then discussed. Conclusions are reported in Sect. 7.

2 Literature search

2.1 Search strategy

In August 2023, a literature search was conducted on PubMed, Web of Science, Scopus, and arXiv following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) statement and the AMSTAR 2 tool for critical appraisal of systematic reviews (Appendix A) (Page et al. 2021; Shea et al. 2017). The search was limited to articles in English language with an abstract and published from January 1st, 2018 to July 31, 2023. The following search terms were used:

“Large language models medical education”

OR “ChatGPT medical education”

OR “large language models medical exam”

OR “ChatGPT medical exam”

OR “large language models medical examination”

OR “ChatGPT medical examination”

OR “large language models medical license”

OR “ChatGPT medical license”

OR “large language models surgical education”

OR “ChatGPT surgical education”

OR “large language models medicine”

OR “ChatGPT medicine”

OR “large language models healthcare”

OR “ChatGPT healthcare”

OR “large language models surgery”

OR “ChatGPT surgery”

OR “Large language models surgical exam”

OR “ChatGPT surgical exam”

OR “large language models surgical examination”

OR “ChatGPT surgical examination”

OR “large language models medical test”

OR “large language models surgical test”

OR “ChatGPT medical test”

OR “ChatGPT surgical test”

OR “Large language models surgical license”

OR “ChatGPT surgical license”

Reviews, letters, non-peer reviewed articles, conference abstracts and proceedings were excluded from the analysis.

2.2 Data extraction

Identified articles were screened by title and abstract, followed by full-text review, data extraction, and review of references. Two reviewers (AM and KG) independently screened titles and abstracts for relevance. The sample, phenomenon of interest, design, evaluation, and research type (SPIDER) tool was used to structure qualitative research questions (Cooke et al. 2012). In case of insufficient information, the corresponding authors of the articles concerned were contacted for further details. References were checked to retrieve further studies.

2.3 Data analysis

Since the studies concerned many medical examinations, they were subdivided into three distinct groups: National Qualifying Examinations, Medical Specialty Examinations, and other studies. For each group, a table was prepared to visually present the data of the studies. The SPIDER tool was applied to the studies of each group, reporting: the number of questions of the examinations (Sample), the name of the examination (Phenomenon of Interest), the LLM, datasets, and prompt engineering technique (Design), the passing score and results (Evaluation), and whether the study was qualitative or quantitative (Research type).

2.4 Risk of bias

By considering the nature of the review, a bias analysis according to the Cochrane tool for assessing risk of bias was not applicable. The bias was rated in terms of memory retention of LLM, overlap between test and training data of LLMs, management of missing data, and type of funding (e.g. private and/or public).

3 Research questions

By using the SPIDER tool, the following research questions were formulated to serve as a roadmap for the scientific investigation, ensuring a thorough analysis of the published literature and of the performances of major LLMs like ChatGPT, GPT-4, and Bard on the FRS test.

  • RQ1: What are the medical examinations where LLMs were applied? How do different LLMs compare on the same exam? What are the performances of these LLMs on other medical examinations?

  • RQ2: Which type of prompt engineering techniques were used to improve the reasoning of LLMs?

  • RQ3: Do ChatGPT, GPT-4, and Bard pass the FRS test on cognitive skills? What is their consistency in confirming performances in subsequent attempts? How do their scores vary over multiple releases?

  • RQ4: What is the variability of ChatGPT, GPT-4, and Bard not only in terms of the overall score but also in terms of how many times all the FRS questions were answered correctly and erroneously?

  • RQ5: What are the main challenges of LLMs in the different stages of medical education, e.g., preparation for medical examinations?

4 Applications of LLMs in medical examinations

4.1 Results of the literature search

The database search retrieved 2393 results. After title and abstract screening, the full texts of 106 records were screened, but only 57 were found eligible for inclusion. A total of 45 studies were retrieved for full-text analysis, including 10 additional studies after references check. A list of the excluded articles from the 106 screened ones along with the reason for exclusion is provided in Online Appendix B. The flowchart based on the PRISMA statement is shown in Fig. 1 (Page et al. 2021).

Fig. 1
figure 1

Flow chart of the study selection process according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) (Page et al. 2021)

The 45 studies included in the review comprised 16 on national qualifying examinations (Fang et al. 2023; Gilson et al. 2023; Han et al. 2023; Jang et al. 2023; Kasai et al. 2023; Kung et al. 2023; Shama et al. 2023; Singhal et al. 2023a; Singhal et al. 2023b; Nori et al. 2023; Taira et al. 2023; Takagi et al. 2023; Thirunavukarasu et al. 2023; Toma et al. 2023; Tu 2023; Wu et al. 2023), four on neurology and neurosurgery (Ali et al. 2023a; Ali et al. 2023b; Giannos et al. 2023a; Hopkins et al. 2023), three on orthopedics (Cuthbert et al. 2023; Saad et al. 2023; Lum et al. 2023), two on anesthesiology (Angel et al. 2024; Shay et al. 2023), ophthalmology (Antaki et al. 2023; Mihalache et al. 2023), general surgery (Beaulieu-Jones et al. 2024; Oh et al. 2023), and radiology (Bhayana et al. 2023; Huang et al. 2023). The others included one study on examinations each on the following specialty: emergency medicine, family medicine, clinical informatics, cardiology, urology, gynecology, general practitioners, dermatology, gastroenterology, otolaryngology, and pharmacy (Kumah-Crystal et al. 2023; Hoch et al. 2023; Huynh et al. 2023; Li et al. 2023d; Liu et al. 2023c; Passby et al. 2023; Skalidis et al. 2023; Smith et al. 2023a; Suchman et al. 2023; Wang et al. 2023c; Weng et al. 2023). One study concerned admission to university while two concerned exams at a single institution (Giannos et al. 2023b; Huh et al. 2023; Strong et al. 2023).

4.2 Prompt engineering strategies

The following prompt engineering strategies were applied in the reviewed studies.

  • Few-shot: the model is given a few demonstrations of the task at inference time as conditioning (Brown et al. 2020).

  • One-shot: similar to few-shot but with one demonstration (Brown et al. 2020).

  • Zero-shot: similar to few-shot but with a natural language description of the task instead of any examples (Brown et al. 2020).

  • Chain of thought: demonstrations of intermediate natural language reasoning steps are provided in the exemplars for few-shot prompting: (Wei et al. 2022).

  • Self-consistency: an LLM is first prompted with a set of chain-of-thought exemplars. Then, a set of outputs from the LLM, generating a diverse set of reasoning paths, is sampled. Finally, the most consistent answer is chosen among the generated outputs (Wang et al 2022).

  • Ensemble refinement: in the first stage an LLM is prompted with a set of chain-of-thought exemplars to generate a set of output, similar to self-consistency. In this case, each output involves an explanation of the answer. During the second stage, the LLM is conditioned on the original prompt, question, and the concatenated generations, and is prompted to produce a refined explanation and answer. The second stage is performed multiple times. Finally, a plurality vote over the generated answers is conducted to determine the final answer (Singhal et. 2023b).

4.3 Studies on national qualifying examinations

The published studies on National Qualifying Examinations are reported in Table 2.

Table 2 Studies on National Qualifying examinations

Ten studies concerned USMLE, one the Applied Knowledge Test of the Membership of the Royal College of General Practitioners, one the Japanese Medical Licensing Examination, one the National Medical Practitioners Qualifying Examination in Japan, one the National Nursing Licensing Examinations in Japan, one the Korean National Licensing Examination for Korean Medicine Doctors, and one the Chinese National Medical Licensing Examination.

Three studies concerned ChatGPT, one GPT-3, one InstructGPT, one Med-PaLM, one Med-PaLM 2, one Med-PaLM M, one GPT-4, one Clinical Camel, one Med-Alpaca, and one PMC-LLaMA. Two studies used original questions from a real edition of the USMLE examination (Kung et al. 2023; Nori et al. 2023), while the others used online question banks. Four of these used 1,273 USMLE-style questions from the MedQA dataset (Nori et al. 2023; Singhal et al. 2023a; Singhal et al. 2023b; Tu et al. 2023). As a consequence, the number of questions varied among studies, from 114 to 1,649 (Nori et al. 2023; Shama et al. 2023). The passing score of USMLE varies over the years but was generally close to 60.0%. In the study by Gilson et al., only ChatGPT met this threshold in contrast with GPT-3 and InstructGPT. The study by Kung et al. investigated three different strategies to prompt USMLE questions: (i) as open-end questions; (ii) multiple-choice questions; and (iii) multiple choice with forced justification, i.e., by asking ChatGPT to provide the rationale on the response. With the first method ChatGPT passed all steps, with the second method only step 3, and with the third method both step 1 and step 3.

One study on ChatGPT reached 60.0% on 45 USMLE questions concerning critical reasoning (Shama et al. 2023). Prompt engineering techniques have demonstrated success to increase LLMs performances. Med-PaLM was the first LLM passing USMLE thanks few shots, chain of thought, and self-consistency (Singhal et al. 2023a). It reached 67.6% of correct answers and was beaten by Med-PaLM 2 reaching 86.5% thanks to ensemble refinement (Singhal et al. 2023b). Med-PaLM M, which are LLMs for medicine, outperformed Med-PaLM, a multimodal LLM for medicine, achieving 69.7% using the few-shot technique. GPT-4 used zero-shot and five-shot prompt engineering strategies in two different configurations, namely the base and released model, with the latter aligned with safety (Nori et al. 2023). The base GPT-4 model was able to reach 88.3% vs. 86.6% for the released one with five-shot prompting on the real USMLE exam (Nori et al. 2023). Scores on USMLE-like questions provided by MedQA were in line with Med-PaLM 2, i.e., 86.1% for the based GPT-4 model and 81.4% for the released one. None of the LLMs based on LLaMA passed USMLE, except MedAlpaca on Step 3. This LLM used the zero-shot technique (Wu et al. 2023). In the study by Kung et al., the USMLE questions were prompted to ChatGPT in different modalities: open-end, multiple-choice questions without and with forced justification.

In the United Kingdom general practitioners must pass the Applied Knowledge Test of the Membership of the Royal College of General Practitioners to complete their training. ChatGPT was tested in two different trials on questions from online question banks. However, it did not meet the passing threshold in either trial (Thirunavukarasu et al. 2023).

The Japanese Medical Licensing Examination is a mandatory exam for certifying medical practitioners in Japan. A study on 254 questions from the real 2023 edition has shown that GPT-4 passed it successfully in contrast with ChatGPT (Takagi et al. 2023). The National Medical Practitioners Qualifying Examination in Japan is taken by sixth-year medical students. It consists of a compulsory and a general part, for a total of 400 questions. Kasai et al. assessed GPT-3, ChatGPT (in the gpt-3.5-turbo configuration), and GPT-4 on six real editions (from 2018 to 2023) with questions in Japanese. All questions were prompted with three in-context examples. GPT-4 passed both parts in all editions, ChatGPT partly the compulsory part, while GPT-3 did not pass any part (Kasai et al. 2023). In a study on five real editions (from 2019 to 2023) of the National Nursing Licensing Examinations in Japan, ChatGPT passed the part on general questions (Taira et al. 2023). Both ChatGPT and GPT-4 did not pass the 2022 edition of the Korean National Licensing Examination for Korean Medicine Doctors after prompting questions in Korean, with a brief preamble about the type of examination (Jang et al. 2023). In contrast, GPT-4 passed the Chinese National Medical Licensing Examination with questions in Chinese from an online question bank (Fang et al. 2023). Questions were reformatted by deleting all the choices and adding a variable lead-in imperative or interrogative phrase, as in the study by Kung et al. (2023)

4.4 Studies on medical specialties examinations

The published studies on medical specialties examinations are shown in Table 3.

Table 3 Studies on medical specialties examinations

LLMs were tested on neurology/neurosurgery in four studies. GPT-4 passed both the UK Specialty Certificate examination and the written part of the American Board of Neurological Surgery, while ChatGPT only the latter (Ali et al. 2023b; Giannos et al. 2023a). Concerning the oral part of the American Board of Neurological Surgery, GPT-4 outperformed ChatGPT and Bard. In none of the three studies on orthopedics, either GPT-4 or ChatGPT reached the passing score. Concerning the American Board of Anesthesiology examination only GPT-4 reached the threshold, in contrast with GPT-3, ChatGPT, and Bard (Angel et al. 2024; Shay et al. 2023). The study on American Academy of Ophthalmology’s Basic and Clinical Science Course did not specify the passing score (Antaki et al. 2023). ChatGPT did not pass the Ophthalmic Knowledge Assessment Program examination (Mihalache et al. 2023).

Since the passing score was not specified in a study on the American Board of Surgery Qualifying Exam and another on the Korean general surgery board exams, it was not possible to know whether or not LLMs passed them (Beaulieu-Jones et al. 2024; Oh et al. 2023).

It is interesting to note that none of the published studies on neurology, orthopedics, anesthesiology, ophthalmology, and general surgery assessed LLMs on real examinations but using either online question banks or mock of actual exams (Table 3). Two studies on neurology, one on ophthalmology, two on surgery, and one on gynecology did not specify the passing score (Table 3).

Both ChatGPT and GPT-4 passed the real version of the American College of Radiology Radiation Oncology in-training (TXIT) examination, while ChatGPT scored slightly below the threshold of the Canadian Royal College Examination in Radiology using online question banks (Bhayana et al. 2023; Huang et al. 2023). The remaining published studies included one report for each medical specialty. Question banks or mock-up versions were used in six studies, while questions on real examinations were used for the Australian College of Emergency Medicine, Family Medicine Board Examination, American Urological Association Self-Assessment Study Program examination, American College of Gastroenterology self-assessment tests, and Taiwanese Pharmacist Licensing Examination (Table 3). GPT-4, Bard, and Bing passed the Australian College of Emergency Medicine exam in contrast with ChatGPT which did not meet the threshold (Smith et al. 2023a, b). ChatGPT did not pass the Family Medicine Board Examination (with questions in Chinese), and the American Urological Association Self-Assessment Study Program examination (Huyhn et al. 2023; Weng et al. 2023). Neither ChatGPT nor GPT-4 passed the American College of Gastroenterology self-assessment tests (Suchman et al. 2023). On the Taiwanese Pharmacist Licensing Examination ChatGPT met the threshold only on the part on pharmaceutical laws (questions in English) but not in pharmacology (questions in both Chinese and English) (Wang et al. 2023c).

Different prompt engineering strategies were applied to medical specialty examinations. Zero-shot and formatting questions both open-end and multiple choice were used in the studies for the Ophthalmic Knowledge Assessment Program examination (Antaki et al. 2023; Mihalache et al. 2023). In the study on preparation for the American Board of Surgery Qualifying Exam, all questions were prompted as both open-end and multiple-choice single answer without forced justification, as done in previous studies on USMLE and Chinese National Medical Licensing Examination (Beaulieu-Jones et al. 2024; Fang et al. 2023; Kung et al. 2023). A brief preamble specifying that the question was either multi-choice or single-choice was used in the study on the Chinese Clinical Medicine Entrance Examination, Family Medicine Board Examination, and Preparation for German otolaryngology board certification (Hoch et al. 2023; Liu et al. 2023c; Weng et al. 2023). A brief preamble requesting justification for the generated responses was applied to questions for the Clinical Informatics Board examination (Kumah-Crystal et al. 2023). Questions were prompted as both open end and multiple choice on the study on the American Urological Association Self-Assessment Study Program examination (Huyhn et al. 2023). Overall, GPT-4 outperformed ChatGPT in all examinations except the American College of Gastroenterology self-assessment test (Suchman et al. 2023). GPT-4 scored higher than Bard on the American Board of Neurological Surgery oral board examination, American Board of Anesthesiology examination, and Australian College of Emergency Medicine examination (Ali et al. 2023a; Angel et al. 2024; Smith et al. 2023a, b).

4.5 Other studies

The remaining reviewed studies are reported in Table 4.

Table 4 Studies on other examinations

They concerned the UK BioMedical Admissions Test, the Clinical reasoning exams administered to pre-clerkship medical students at Stanford University, and a parasitology exam at Hallym University (South Korea), as shown in Table 4. In all three studies, the LLMs were assessed on real exams. However, in two of them the passing score was not specified (Giannos et al. 2023b; Huh et al. 2023; Strong et al. 2023). ChatGPT score slightly below the passing score of the Clinical reasoning exams at Stanford University, while for the other two studies the threshold was not specified (Table 4). No prompt engineering strategies were applied to these studies (Table 4).

4.6 Analysis of bias

The analysis of bias is reported in Table 5.

Table 5 Analysis of bias in the reviewed studies

To reduce memory retention bias, a new chat session of ChatGPT was started for each question in two studies on USMLE, one on the Korean National Licensing Examination for Korean Medicine Doctors, one on the Chinese National Medical Licensing Examination, two on orthopedics examination, one on ChatGPT on ophthalmology, one on the Chinese Clinical Medicine Entrance Examination, and one on American Urological Association Self-Assessment Study Program examination (Table 5). A new chat session of GPT-4 was started for each question in one study on the Korean National Licensing Examination for Korean Medicine Doctors, and one on the Chinese National Medical Licensing Examination (Fang et al. 2023; Jang et al. 2023). A new chat session of ChatGPT and GPT-4 was started after five questions in one study on the American College of Radiology Radiation Oncology in-training examination (Huang et al. 2023). The other studies did not specify whether the queue of chats of the LLM was cleared or not.

The high score of GPT-4, Med-PaLM, and Med-PaLM 2 on USMLE suggests the hypothesis of memorization effect, which can arise when benchmark data are included in an LLM training set. However, specific tests using the Memorisation effects Levenshtein detector (MELD) method revealed no evidence of memory effect of GPT-4 on USMLE questions (Nori et al. 2023). No overlap was found between USMLE like questions of MultiMedQA and Med-PaLM training set, while little overlap was observed on MCQs (Singhal et al. 2023a). An overlap ranging from 0.9 to 11.15% was found between USMLE like questions of MedQA and Med-PaLM 2 (Singhal et al. 2023b). The other studies did not check overlap between test set and training set of the LLMs. Finally, a source of bias could be funding from private companies developing LLMs which occurred in four qualitative studies (Nori et al. 2023; Singhal et al. 2023a; Singhal et al. 2023b; Tu et al. 2023).

5 Comparative analysis on fundamentals of robotic surgery

5.1 Description of FRS

FRS was developed through four consensus conferences to establish the tasks, metrics, and curriculum content by involving 66 subject matters experts (surgeons, psychologists, psychometricians, engineers, simulation experts, and medical educators) from the Department of Defense and Veterans Administration, the American Board of Surgery, and 14 international surgical societies (Satava et al. 2020). FRS was designed to be agnostic to any particular platform and therefore it applies to basic robotic skills independent to the platform used. The FRS section on cognitive skills consists of online modules with educational videos and text. At the end of this part, trainees must pass successfully a questionnaire before progressing to the technical skills part consisting of a set of tasks of increasing difficulty on a virtual reality simulator (Satava et al. 2020). Learners must reach proficiency in each task before moving to the next.

5.2 Input source of questions

The FRS part on cognitive skills consists of four online modules providing basic knowledge. It starts with an introduction to surgical robotic systems, then moves on to didactic instructions on robotic surgery, psychomotor skills curriculum, and ends with team training and communication skills. It can be accessed after downloading an app available for iOS and Android devices.Footnote 9 At the end of the modules, the learners are required to pass a test consisting of 44 multiple choice questions (MCQs), each with four options, except one with three, and one with two. The breakdown of the questions is: i) introduction to surgical robotic systems (n = 18), ii) didactic instructions for robotic surgery (n = 13), iii) psychomotor skills curriculum (n = 6), and iv) team training and communication skills (n = 7). The proficiency level for passing the questionnaire is equivalent to 35 correct answers (79.5%) (Satava et al. 2020).

5.3 LLMs testing

The used protocol involved manually prompting ChatGPT, GPT-4, and Bard web interface with all the original MCQs of FRS (Fig. 2).

Fig. 2
figure 2

Protocol of experiments (all LLMs generated answers were stored in Word files)

We chose this technique as it is the closest to human test-taking. Only one of the 44 MCQs included both text and an image. Even though in our study ChatGPT could only manage text information, this question was retained. The questions and answers obtained by ChatGPT, GPT-4, and Bard were saved in Microsoft Word files. All MCQs were analyzed manually to determine the selected response, which was marked as correct, incorrect, and not selected option (i.e., when the LLM did not choose an answer). Not selected answers were considered incorrect.

Data were then imported into a Microsoft Excel spreadsheet. After FRS test was completed seven times with the January 30, 2023 release, a new version of ChatGPT became available online. The FRS test was then repeated seven times with the following 2023 versions: February 13, March 14, May 3, and May 24. Likewise, FRS test was submitted seven times to GPT-4 (March 14, 2023 releases) and Bard (July 13, 2023 release). After completing a full questionnaire, the queue of prompts and answers were cleared to avoid memory effect on subsequent trials. Statistically significant difference was tested with Kruskall-Wallis and Wilcoxon test (p < 0.001 for statistically significant difference).

5.4 Performances of the five different releases of ChatGPT

The results comparing the five releases of ChatGPT on FRS test are reported in Table 6.

Table 6 Comparison among different releases of ChatGPT on FRS test

The performances of ChatGPT over trials is shown in Fig. 3. On the first attempt (baseline) the third release achieved the highest score (79.5%) vs 77.3% of the fifth and 72.7% for the fourth, followed by 54.5% for first two releases. On baseline, only the third version reached the proficiency level for passing FRS test. The average score of correct answers is depicted in Fig. 3, improving slightly from the first (64.6%) to second release (65.6%), but more substantially by the third (75.0%) and fourth (78.9%) releases. Surprisingly the mean score of the fifth tested version of ChatGPT decreased to 72.7%.

Fig. 3
figure 3

Comparison on trend of the five different releases of ChatGPT, GPT-4, and Bard on FRS test

Kruskal–Wallis tests confirmed statistically significant difference (p < 0.001). ChatGPT was not able to achieve the benchmark in any attempts with the first two and fifth releases but reached the proficiency level on two trials with the third release, and on three attempts with the fourth. ChatGPT answered correctly the question with both text and image three, six, two, and four times from the first to the fourth tested version, respectively. It always provided an erroneous response with the fifth version. In the second release it had the highest average rate of answers without choosing any option (14.6%), followed by the third (7.8%), first and fourth (6.5%), and fifth (3.2%) as shown in Fig. 4).

Fig. 4
figure 4

Rate of correct, erroneous, and answers without option selected for the five releases of ChatGPT, GPT-4, and Bard on the entire FRS test

5.5 Performances of GPT-4

Scores of GPT-4 are shown in Table 7 and Fig. 3.

Table 7 Results of GPT-4 on FRS test

At the baseline, it always outperformed the five releases of ChatGPT (Table 6). GPT-4 successfully passed the FRS test on all seven attempts with an average of correct answers of 91.5%. It achieved a maximum of 95.4% in the sixth trial. It always performed better than all five tested versions of ChatGPT in each of the corresponding attempt, with the difference being statistically significant (p < 0.001). In sharp contrast with ChatGPT, it did not generate any response without choosing any option, except for the question containing both text and image (Fig. 5). In this case, it is always specified its inability to interpret images since this functionality was not available at the time of testing. Moreover, GPT-4 provided concise answers without additional explanations.

Fig. 5
figure 5

Raw marks for each attempt of all tested LLMs with correct answers in white, erroneous in red, and not selected answers in yellow

5.6 Performances of Bard

Scores of Bard are reported in Table 8 and in Fig. 3.

Table 8 Results of Bard on FRS test

Bard successfully passed FRS test on five out of seven attempts with an average of correct answers of 79.5%. It achieved a maximum of 81.8% in the third and sixth trials (Table 8). GPT-4 always performed better than Bard in each of the corresponding attempts, but without being statistically significant (p = 0.002). Bard always provided explanations for the response it generated and was able to answer correctly the question with text and image in six out of seven trials.

5.7 Consistency over multiple attempts and releases

In addition to showing variability in the overall score among attempts and in the case of ChatGPT also among releases, the LLMs demonstrated variability in the number of times each question was answered correctly or not, as depicted in Fig. 5.

The first release of ChatGPT provided the correct answer to 17 questions (38.6%) in all seven trials, vs 13 (29.5%) of the second vs 25 (56.8%) of the third vs 23 (52.3%) of the fourth vs 21 (47.7%) of the fifth vs 35 for GPT-4 (79.5%) vs 30 for Bard (68.2%). ChatGPT generated an erroneous answer in all attempts in two questions (4.5%) of the second, third, and fourth version vs one (2.2%) of the fifth releases vs one (2.2%) of GPT-4 vs four of Bard (9.1%).

5.8 Answers related to the knowledge domain

The answers provided by ChatGPT, GPT-4, and Bard relating to the knowledge domain are shown in Table 9 and Figs. 6, 7, 8, 9.

Table 9 Breakdown of questions topic
Fig. 6
figure 6

Rate of correct, erroneous, and answers without selected option for the three releases of ChatGPT, GPT-4, and Bard on domain specific questions of the FRS test (Introduction to surgical robotic systems)

Fig. 7
figure 7

Rate of correct, erroneous, and answers without option selected for the three releases of ChatGPT, GPT-4, and Bard on domain-specific questions of the FRS test (Clinical steps in robot-assisted surgery procedures)

Fig. 8
figure 8

Rate of correct, erroneous, and answers without selected option for the three releases of ChatGPT, GPT-4, and Bard on domain-specific questions of the FRS test (Psychomotor skills)

Fig. 9
figure 9

Rate of correct, erroneous, and answers without option selected for the three releases of ChatGPT, GPT-4, and Bard on domain-specific questions of the FRS test (Team training and communication skills)

GPT-4 outperformed all versions of ChatGPT in all domains by a substantial margin. It achieved the highest rate of correct answers on team training and communication skills (100.0%) vs. 85.7% for Bard, while ChatGPT ranged from 77.5 to 91.8%. On questions on the robotic system GPT-4 obtained 96.7% vs 73.0% for Bard, while the correct answers of ChatGPT ranged from 62.7 to 73.1%. On the clinical steps involved in a procedure of RAS, GPT-4 answered correctly on 84.6% of cases vs. 80.2% for Bard, whereas the correct answers of ChatGPT ranged from 53.8 to 79.1%. On psychomotor skills GPT-4 obtained 80.9% vs. 83.3% for Bard, while correct answers of ChatGPT ranged from 57.1 to 76.2%.

6 Discussion

6.1 Main findings

Launched in November 2022, ChatGPT is a generic LLM trained on information available on the Internet until the end of 2021. It was released free to users for testing, and immediately generated a viral interest, reaching 100 million users after the first two months, representing the fastest hike in a consumer Internet app, before the launch of Threads, the microblogging app by Meta, in July 2023.Footnote 10 Since then, new LLMs have been launched by giants like Google and Meta, start-ups like Anthropic and Hippocratic AI, and research groups.

In this systematic review, a comprehensive search strategy on a wide array of search terms was performed on PubMed, Web of Science, Scopus, and arXiv. The choice of arXiv was motivated by the need the discover studies in a preprint format, missing in the other databases. Additionally, arXiv is used by an increasing number of research groups publishing their efforts in computer science applications, including LLMs. The literature search was strengthened by using the SPIDER tool, which was used also to formulate the research questions. The results of this review have shown that a generic LLMs like GPT-4 is capable to pass national qualifying examinations like USMLE and others with questions in a different language from English using simple prompt engineering strategies like zero-shot and few-shot learning (Fang et al. 2023; Kasai et al. 2023; Nori et al. 2023; Takagi et al. 2023). It achieved a score slightly below the threshold only for the Korean National Licensing Examination for Korean Medicine Doctors (Jang et al. 2023). Novel LLMs designed specifically for the healthcare domain, namely Med-PaLM, and Med-PaLM-2 passed USMLE thanks to refined prompt engineering techniques like chain of thought, self-consistency, and ensemble refinement (Singhal et al. 2023a; Singhal et al. 2023b).

In the present study, we reported the performance over time of three different LLMs on a test for surgical education: ChatGPT and GPT-4 by OpenAI, and Bard by Google. They were assessed on the standardized cognitive questionnaire of FRS, consisting of four knowledge domains, which has been adopted by an increasing number of surgical training and education centers in the United States and the European Union. In total, in the present study a total of 2,156 answers, generated by LLMs, were analyzed. Like the study by Antaki et al. on the Ophthalmic Knowledge Assessment Program examination we prompted questions in the original form since this technique is the closest to human test-taking.

As in the recent study by Jang et al. on the Korean National Licensing Examination for Korean Medicine Doctors we assessed LLMs on multiple attempts to evaluate their consistency. In contrast with the study by Jang et al., we performed seven trials instead of five. Like the study by Mihalache et al. on the Ophthalmic Knowledge Assessment Program examination we tested several ChatGPT releases (in our case five instead of two).

Although it is expected that LLMs improve performances over time, there is no published evidence quantifying progress/ improvement of LLMs on surgery, and, to a broader extent, on the medical domain. In fact, in the study of Mihalache et al., the questions were prompted in different modalities between releases, namely as MCQs with the first and as open end with the second one (Mihalache et al. 2023).

Our findings demonstrated that the mean performance of ChatGPT on the FRS test improved from 64.6% to 78.9% from the first to the fourth tested release, but unexpectedly dropped to 72.7% with the fifth version. In particular, ChatGPT was unable to pass the FRS test in any of the seven trials with the first two versions and the fifth one. In contrast, with the third and fourth releases it passed the FRS test, although not consistently.

The results of the present study confirmed that GPT-4 outperformed ChatGPT and Bard in every attempt, thus in agreement with the results of our systematic review (Ali et al. 2023a; Ali et al. 2023b; Angel et al. 2024; Giannos et al. 2023a; Huang et al. 2023; Jang et al. 2023; Kasai et al. 2023; Oh et al. 2023; Passby et al. 2023; Smith et al. 2023a, b). The difference of the performance between ChatGPT and GPT-4 is in agreement with the reports included in our systematic review on the National Medical Practitioners Qualifying Examination in Japan, Korean National Licensing Examination for Korean Medicine Doctors, the UK Specialty Certificate Examination in Neurology, the American Board of Neurological Surgery board examination (both oral and written part), the Korean general surgery board exams, the American College of Radiology Radiation Oncology in-training examination, the Australian College of Emergency Medicine examination, and the Specialty Certificate Examination in Dermatology (Ali et al. 2023a; Ali et al. 2023b; Giannos et al. 2023a; Huang et al. 2023; Jang et al. 2023; Kasai et al. 2023; Oh et al. 2023; Passby et al. 2023; Smith et al. 2023a, b).

Of the four domains on cognitive skills of FRS, all LLMs achieved the highest score on team training and communication skills, probably because this is a topic with a large amount of information publicly available online. Bard was the only one of the tested LLMs able to answer multimodal questions with mixed text and images. Although GPT-4 is equipped with the same functionality, unfortunately, it was not available at the time of testing.

LLMs learn the statistical patterns of language on massive datasets of online text. They may produce errors and misleading information, especially for technical topics on which they have been trained only with small datasets (Stokel-Walker et al. 2023). In this regard, the present study has identified questions that were consistently answered erroneously, e.g., ChatGPT provided an incorrect response in 24 out of 35 attempts (68.6%) to one MCQ, indicating that the current RAS systems are semi-autonomous, performing independently part of an operation while the surgeon performs a different part. We believe that this may depend on some bias within the training dataset. It may well refer to one study published in 2016 reporting a prototype of an autonomous surgical robot performing anastomosis on animal tissue (Shademan et al. 2016). However, this robot has never been adopted for clinical practice on humans. GPT-4 answered wrongly in all seven attempts on a question on errors committed during a knot tying task. Bard provided an erroneous response on two questions on the ergonomics of the surgeon console, and one on communication skills in all seven trials.

Unfortunately, none of these LLMs provide references to support the generated answers, despite the consensus that the sources of information should always be verifiable (Stokel-Walker et al. 2023).

Even though the FRS test contains knowledge available before 2021 and ChatGPT was trained with data until 2021, the findings of the present study demonstrated that three 2023 releases of ChatGPT were unable to pass the FRS test in any trial, while it reached the benchmark only with the third and fourth version.

At present, there are no studies on LLMs, focused on the biomedical domain, on the FRS test. Recent evidence has shown that LLMs designed specifically for healthcare like Med-PaLM, Med-PaLM 2, and Med-PaLM M outperformed PubMedGPT on USMLE (Singhal et al. 2023a; Singhal et al. 2023b; Tu et al. 2023).

Overall, by considering that ChatGPT, GPT-4, and Bard are considered generic LLMs, we believe that their scores, observed by the present study on FRS, represent a remarkable result. The impressive performances of LLMs on competency examinations may contribute to the perception that artificial intelligence forays in healthcare will eventually devalue human intelligence (Nori et al. 2023). Additionally, the LLMs growing prowess may influence decisions of about medicine as a career path, and, for medical students, their choice of specialty (Nori et al. 2023). According to a recent survey among 32 medical schools in the United States, artificial intelligence had a negative impact on the choice of radiology as a career path among medical students (Reeder et al. 2022).

6.2 Limitations

We acknowledge some limitations in the present work. In the systematic review published studies in non-English languages were excluded. The paucity of the studies on the same examination prevented to compare their results, except those on USMLE. Additionally, since official questions of medical examinations are not generally freely available, most studies used databases with surrogate questions.

The most important limitation on the comparative study on FRS test was the inability to compare the scores of surgical trainees with the performances of LLMs. Secondly, the questions of the FRS test were submitted in the original form, without prompt engineering strategies to elicit some form of reasoning, to increase the probability of LLMs to generate the correct answer. However, we selected this technique as it is the closest to human test-taking. We are aware of the importance of prompt engineering to guide LLMs to generate more accurate output text. In a future study, we will investigate the role of prompt engineering to help trainees in the preparation of surgical examinations.

6.3 Open challenges

The hype behind LLMs has led to unwarranted speculations on their potential to transform medical education at different stages, including preparation for medical examinations. Firstly, they may be used to conduct needs assessments where they may help teachers to identify content gaps in education (Abd-Alrazaq et al. 2023). Secondly, they may develop measurable learning objectives and tailor the curriculum to meet the diverse needs of trainees. Thirdly, they may help instructors in preparing teaching materials (e.g., written simulated case reports, and contents of lectures) (Lee 2023a). A recent study reported positively on the application of ChatGPT to simulate standardized patients (Liu et al. 2023d), supporting the belief that in the future LLMs may be helpful in designing clinical scenarios for surgical simulations by integrating medical imaging, electronic health records, and virtual reality contents. Furthermore, LLMs can play the role of tutors by providing trainees with real-time and customized feedback, identifying areas of strength and weakness, and offering targeted suggestions for improvement (Abd-Alrazaq et al. 2023; Lee et al. 2023a). This could be helpful in the self-study phase before taking a real examination. Alternatively, LLMs may be employed as mentors to explain difficult topics in simple terms, thus streamlining the education process for struggling trainees (Abd-Alrazaq et al. 2023).

However, there remain some challenges that need to be addressed, the most critical being ensuring the accuracy and reliability of the information generated by LLMs. Since they predict the probability distribution of text, the risk of getting misleading answers is significant, as highlighted by the present study. Due to its non-deterministic nature, the text generated by LLMs can change over time, thus leading to confusion in some scenarios, e.g., students obtaining a different response to the same question over time, or students within the same class obtaining a different response to the same question asked at the same time.

The present study has identified different responses to the same questions of the FRS curriculum, especially for ChatGPT. In some instances, they were even contradictory. As a result, ChatGPT was unable to confirm proficiency on the FRS test after achieving it once. In contrast, Bard and GPT-4 showed a lower variability on FRS testing. Currently, the present study indicates that they do not possess the required reliability to act as mentors of trainees in complex subjects exemplified by surgery, despite the huge information freely accessible on the Internet, which might have been used to train LLMs. Although the improvement of LLMs performances, as demonstrated by our research, supports the belief that their potential in the medical education sector is vast, human expertise and guidance remain essential. In essence, the take home message from the present study is that human experts should always check and scrutinize the artificial intelligence generated content before approval to ensure the highest efficacy and reliability before the integration of LLMs within future surgical education.

The introduction of LLMs in healthcare education might replicate the path traced by simulation, which, after initial skepticism as in the case of surgery, has become an integral part. Simulation has represented a paradigm shift in the training of healthcare professionals allowing trainees to practice and enact errors in a risk-free environment for patient safety. For instance, in surgical simulation, trainees are allowed to commit errors and learn from them in sharp contrast to actual surgery (Gallagher and O’Sullivan 2012). This is the main strength of surgical simulation and is the main reason for becoming an integral component of surgical training programs. However, it is imperative to validate surgical simulation in order to demonstrate their effectiveness (Zevin et al. 2012). Likewise, LLMs may become a new tool in the armamentarium of the next generation of medical trainees in the different stages of the learning process, including preparation to real examinations, provided that their validity is demonstrated. Guidelines on the use of LLMs in medical education should therefore be developed to ensure safety, reliability, efficacy, and privacy protection.

7 Conclusions

In this work, the authors presented the first systematic review on LLMs on medical examinations. The results have shown that GPT-4 passed by a large margin several national qualifying examinations including USMLE and others with questions in Chinese and Japanese using zero shot and few shot learning. Med-PaLM 2 obtained similar scores on USMLE using a more refined prompt engineering approach like ensemble refinement. GPT-4 outperformed ChatGPT and Bard on several medical specialties examinations, namely the National Medical Practitioners Qualifying Examination in Japan, the Korean National Licensing Examination for Korean Medicine Doctors, UK Specialty Certificate Examination in Neurology, American Board of Neurological Surgery board examination, the American Board of Anesthesiology examination, Korean general surgery board exams, Australian College of Emergency Medicine examination, and the Specialty Certificate Examination in Dermatology.

Our findings on FRS tests have shown that performances of ChatGPT improved from the initial release, although this trend was reversed with the latest tested version. GPT-4 showed impressive performance in passing FRS test outperforming ChatGPT and Bard in all seven trials. The 95.4% of correct answers to FRS questionnaire represent the highest score by GPT-4 in a high-stake examination in surgery. In comparison Bard reached 81.8% as maximum score on FRS test.

Hence, it seems more than likely that LLMs will continue to improve their performance in medical examinations. In addition to collecting larger datasets with more updated information and integrating a search with the latest available data on the Internet, research should focus on improving RLHF to reduce the risk of generating harmful output, and on prompt engineering to improve reasoning capabilities of LLMs to solve challenging unmet needs in healthcare.