Introduction

As artificial intelligence (AI) continues to improve, AI models are being rapidly introduced and integrated into medicine. Recent studies have demonstrated potential benefit of ChatGPT in medical education, including our previous study which found that ChatGPT was able to answer 51% of 150 Otolaryngology board-style questions correctly [1, 2]. In March 2023, OpenAI released GPT-4, improving ChatGPT with increased reliability and expanded capabilities, including the ability to design user-specified, customized models. This study assesses the performance of standard and customized GPT-4 models compared to previous ChatGPT versions on 150 Otolaryngology board-style questions.

Methods

The same 150 Otolaryngology board-style questions utilized in our previous study of GPT-3.5 were obtained from BoardVitals (https://www.boardvitals.com/), across ten topics and three difficulty levels [2]. These questions were inputted into standard GPT-4 (https://openai.com/gpt-4) and a custom GPT-4 model (https://chat.openai.com/g/g-PyDG5N7Ko-ent-expert-mcq-solver). Using the built-in model customization interface, the custom GPT-4 model was instructed to specialize in Otolaryngology board-style questions, emphasizing precision, selecting one answer, providing evidence-based explanations, and validating answers using the internet. The independent samples t-test and multivariable binary logistic regression were implemented with SPSS version 25 (IBM).

Results

Of the 150 board-style questions assessed, standard GPT-4 answered 108 (72.0%) and custom GPT-4 answered 122 (81.3%) correctly (Table 1). Both standard (90.0% vs. 46.0%) and custom GPT-4 (98.0% vs. 62.0%) demonstrated a decrease in performance between “easy” and “hard” questions (P < 0.001). For 111 (74.0%) and 113 (73.5%) questions, respectively, standard and custom GPT-4 selected the most common answer option chosen by Otolaryngology trainees. On multivariable analysis adjusting for subject, difficulty, question length, answer length, percent trainees correct, standard vs. custom GPT-4, and GPT-4 response length, custom GPT-4 (adjusted odds ratio [aOR] 2.19, 95% confidence interval [CI] 1.16–4.11, P = 0.015) and plastic and reconstructive subject (aOR 7.41, 95% CI 1.44–38.05, P = 0.016) remained associated with GPT-4 answering correctly.

Table 1 Characteristics of 150 multiple-choice questions

For standard GPT-4, mean question length (251 vs. 229 characters, P = 0.502), correct answer option length (33 vs. 50 characters, P = 0.016), and GPT-4 response length (1,204 vs. 1,565 characters, P < 0.001) varied between questions answered correctly and incorrectly. For custom GPT-4, mean question length (248 vs. 229 characters), correct answer option length (37 vs. 42 characters), and GPT-4 response length (1,460 vs. 1,449 characters) were similar between questions answered correctly and incorrectly.

Discussion

Overall, our study demonstrated improved performance by standard and custom GPT-4, with more correct answers on ‘easy,’ longer, and plastic and reconstructive questions. Our study found that standard (72.0%) and custom GPT-4 (81.3%) demonstrated higher accuracy than GPT-3.5 on the same 150 questions included in our previous study (51.3%) and on similar Otolaryngology board-style questions (53%) [2, 3]. Custom GPT-4 outperformed otolaryngology trainees who averaged 72.7% accuracy. These findings align with recent studies showing GPT-4 outperforming GPT-3.5 on board-style questions for the United States Medical Licensing Examination, Plastic Surgery Inservice Training Examination, and National Board of Medical Examiners Surgery Subject Examination, demonstrating broad applicability of AI in medical education [4,5,6].

The higher accuracy demonstrated by Custom GPT-4 may be attributable to its instructions to specialize in Otolaryngology board-style questions, select one answer, and validate answers using the internet. Whereas custom GPT-4 always selected one answer, standard GPT-4 selected multiple or no answers 8.7% of the time. Custom models may enhance utilization of ChatGPT in medical education. There is the risk of intellectual dependency on AI and a perceived decrease in the need to learn complex pathophysiology. However, the present state of AI requires biomedical knowledge of diseases and critical thinking to appraise newly developed AI systems [7].

This study has several limitations including lack of repeated trials to account for variance in model output, because GPT-4 provides unique answers to each query. Questions with images were excluded because GPT-3.5 is limited to text input. The subject matter of our 150 board-style questions may not generalize to other fields. The functionality of GPT-4 may be limited in certain environments by the requirement of accessing GPT-4 with internet.

Conclusion

Our study found performance improvements of GPT-4 over GPT-3.5 on Otolaryngology board-style questions. The additional capabilities of custom GPT-4 allowed us to create a model specializing in Otolaryngology-board style questions, which demonstrated higher accuracy than standard GPT-4. It is important to heed the risk of intellectual dependency on AI and to approach AI models with critical thinking and a background of knowledge. Future studies should explore benefits conferred to otolaryngology trainees utilizing AI for medical education. With the ability to interact with users, provide explanations, and adjust to user customizations, AI-based text models may continue to improve as tools for medical education.