Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students

This study evaluates the pro�ciency of ChatGPT-4 across various medical specialties and assesses its potential as a study tool for medical students preparing for the United States Medical Licensing Examination (USMLE) Step 2 and related clinical subject exams. ChatGPT-4 answered board-level questions with 89% accuracy, but showcased signi�cant discrepancies in performance across specialties. Although it excelled in psychiatry, neurology, and obstetrics & gynecology, it underperformed in pediatrics, emergency medicine, and family medicine. These variations may be potentially attributed to the depth and recency of training data as well as the scope of the specialties assessed. Specialties with signi�cant interdisciplinary overlap had lower performance, suggesting complex clinical scenarios pose a challenge to the AI. In terms of the future, the overall e�cacy of ChatGPT-4 indicates a promising supplemental role in medical education, but performance inconsistencies across specialties in the current version lead us to recommend that medical students use AI with caution.


Introduction
Arti cial intelligence (AI) has seen a substantial uptick in utilization for medical education since the release of ChatGPT and other language learning models (LLMs) [1][2][3].As medical school curriculums continue to incorporate new innovations and changes to clinical recommendations, AI has helped meet a demand for more e cient and effective learning resources [4][5].In particular, medical students have been exploring the potential of LLMs to reinforce learned information, provide clari cation on complex clinical topics, and aid in preparation for tests such as the United States Medical Licensing Examination (USMLE) series and related subject exams [1,6].Initial studies have demonstrated that ChatGPT-4 can process board-level exam questions and provide useful clinical insights [7][8][9][10][11][12].No studies, however, have strati ed these capabilities within the specialty-speci c domains appearing on the USMLE Step 2 examination and associated clinical subject exams.This study addresses this with an assessment of ChatGPT 4 performance on questions derived from the following specialties: Internal Medicine, Surgery, Family Medicine, Ambulatory Medicine, Emergency Medicine, Obstetrics and Gynecology, Pediatrics, Neurology, and Psychiatry.Primary objectives include an evaluation and comparison of accuracy across specialty domains.Secondary objectives include comparison to student performance and determination of whether the AI's understanding is authentic or a result of random correct answer selection.Furthermore, we evaluate whether ChatGPT-4 should be recommended as a study tool in its current version for medical students preparing for USMLE Step 2 and related subject examinations.

Question Acquisition
Authors were provided access to utilize questions from AMBOSS for the purpose of this study.AMBOSS is a comprehensive medical education platform that serves as a reference database for medical topics and offers question banks for various medical exams.The custom session interface within the question bank section allows users to extract practice questions from particular specialties and examinations.
For our analysis, questions were extracted from the "Clinical Shelf Exam" section.This section comprises nine specialties: Internal Medicine, Surgery, Pediatrics, Obstetrics & Gynecology, Neurology, Psychiatry, Family Medicine, Emergency Medicine, and Ambulatory Medicine.Each specialty was toggled "on" individually, and from each, 100 questions were randomly selected for a total of 900 questions.

Conversation Input
Each question, along with its provided multiple choice answer options, were individually inputted into the ChatGPT-4 interface.The output was assessed for accuracy by comparing to the correct answer choice provided by the AMBOSS question bank.Additional variables included the number of multiple-choice options provided by each question and the percentage of students that correctly answered the question.After each question, the ChatGPT input and output conversations were deleted, and a new conversation interface initiated.This was done to avoid the possibility of feedback knowledge from prior questions, which could potentially in uence ChatGPT's algorithmic thought process in subsequent entries.
Questions that included media interpretation, such as diagnostic imaging, electrocardiograms, or lesion appearances were excluded due to ChatGPT's limitation of image processing at the time of our analysis.
Questions that incorporated all other modalities such as text or tables with lab results were included.In the event that ChatGPT did not provide a conclusive answer to a question or provided multiple answer choices, the question was omitted.

Statistical Analysis
IBM SPSS 29 was used for all statistical analyses.Accuracy percentages for each specialty and for the total question set were determined.Any differences in accuracy across specialties were assessed by unpaired chi-square tests.Additional unpaired chi-square tests were then conducted to compare the accuracy of individual specialties to one another.The mean accuracy for student performance was determined for each specialty and for the total question set.Additionally, accuracy was compared to the amount of multiple-choice questions available using unpaired chi-square tests.

Discussion
ChatGPT-4 was found to be pro cient in answering clinical subject exam questions with an overall accuracy of 89%.Further investigation, though, revealed specialty-speci c performance discrepancies.This is of particular importance to medical students considering the use of AI-based tools to enhance their preparation for both shelf subject and clinical knowledge board examinations.
ChatGPT-4 had an impressive performance in areas like psychiatry, neurology, obstetrics and gynecology, but its accuracy was notably lower in pediatrics, emergency medicine, and family medicine.There are several factors that should be taken into consideration to explain this performance variation.Inherently, performance is in uenced by the comprehensiveness and timeframe of the data used to train the model.
For example, responses for specialties that are more represented in the data would be expected to be more accurate than those that are not.Furthermore, because the training data for ChatGPT-4 was extracted in 2021, specialties that have changed their recommendations for medical conditions over the past 2 years would be expected to result in outdated AI responses [13].Although the depth of the training data remains a point of discussion, it was encouraging to note no variations in accuracy based on the number of multiple-choice options provided.This suggests that the AI's responses are authentic and not random answer selections.
Building upon this principle, we observed that the specialties with the lowest AI performance were those with signi cant interdisciplinary overlap in their questions.For instance, specialties like family medicine, emergency medicine, and pediatrics, which frequently feature complex, multifaceted clinical scenarios that require intricate clinical reasoning skills, displayed the lowest performance.Conversely, specialties such as psychiatry, obstetrics and gynecology, and neurology, have the narrowest scope and showcased the best AI performance among the elds we assessed.
Given these ndings, one can infer the broader implications for the application of AI in medical education.
Although the current version of ChatGPT displays variable performance across specialties, its overall e cacy suggests immense potential as an adjunct in medical school curricula.However, due to these observed inconsistencies, it becomes crucial for medical students, especially those in the clinical phase of their studies, to exercise caution when incorporating AI into their studies.As we make this recommendation, it remains pivotal to consider the limitations of our study methods.A notable constraint was in our question selection, such that not all of the topics within specialties were examined and imagebased questions were not assessed.Furthermore, our study did not assess the speci c characteristics of questions where ChatGPT4 faltered.Future studies should take these considerations into account in order to further enhance recommendations for medical students who intend to use AI to support their education.

Conclusion
LLMs like ChatGPT-4 provide promise for the future landscape of medical education.Nonetheless, our ndings emphasize the need for caution.At this time, medical students should be aware of the model's specialty-speci c strengths and weaknesses and use a well-rounded approach to their clinical curriculum.

Statements and Declarations
No funding was for this study.The authors have no nancial or non-nancial interests to disclose.

Data Availability
Data obtained and analyzed in this study is available from the corresponding author upon reasonable request.

Figures 1
Figures 1 and 2 demonstrate an example of a ChatGPT input and output conversation.This sample question was obtained from NBME's free online USMLE Step 2 Orientation.

Figures Figure 1
Figures