Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment

Revercomb, Lucy; Patel, Aman M.; Fu, Daniel; Filimonov, Andrey

doi:10.1007/s12070-024-04935-x

Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment

Other Articles
Open access
Published: 03 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Indian Journal of Otolaryngology and Head & Neck Surgery Aims and scope Submit manuscript

Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment

Download PDF

Lucy Revercomb ORCID: orcid.org/0009-0002-9996-8240¹,
Aman M. Patel¹,
Daniel Fu¹ &
…
Andrey Filimonov¹

124 Accesses
Explore all metrics

Abstract

Purpose

GPT-4, recently released by OpenAI, improves upon GPT-3.5 with increased reliability and expanded capabilities, including user-specified, customizable GPT-4 models. This study aims to investigate updates in GPT-4 performance vs. GPT-3.5 on Otolaryngology board-style questions.

Methods

150 Otolaryngology board-style questions were obtained from the BoardVitals question bank. These questions, which were previously assessed with GPT-3.5, were inputted into standard GPT-4 and a custom GPT-4 model designed to specialize in Otolaryngology board-style questions, emphasize precision, and provide evidence-based explanations.

Results

Standard GPT-4 correctly answered 72.0% and custom GPT-4 correctly answered 81.3% of the questions, vs. GPT-3.5 which answered 51.3% of the same questions correctly. On multivariable analysis, custom GPT-4 had higher odds of correctly answering questions than standard GPT-4 (adjusted odds ratio 2.19, P = 0.015). Both GPT-4 and custom GPT-4 demonstrated a decrease in performance between questions rated as easy and hard (P < 0.001).

Conclusions

Our study suggests that GPT-4 has higher accuracy than GPT-3.5 in answering Otolaryngology board-style questions. Our custom GPT-4 model demonstrated higher accuracy than standard GPT-4, potentially as a result of its instructions to specialize in Otolaryngology board-style questions, select exactly one answer, and emphasize precision. This demonstrates custom models may further enhance utilization of ChatGPT in medical education.

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology

Article Open access 29 October 2023

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Article Open access 04 October 2023

Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4

Article Open access 17 October 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

As artificial intelligence (AI) continues to improve, AI models are being rapidly introduced and integrated into medicine. Recent studies have demonstrated potential benefit of ChatGPT in medical education, including our previous study which found that ChatGPT was able to answer 51% of 150 Otolaryngology board-style questions correctly [1, 2]. In March 2023, OpenAI released GPT-4, improving ChatGPT with increased reliability and expanded capabilities, including the ability to design user-specified, customized models. This study assesses the performance of standard and customized GPT-4 models compared to previous ChatGPT versions on 150 Otolaryngology board-style questions.

Methods

The same 150 Otolaryngology board-style questions utilized in our previous study of GPT-3.5 were obtained from BoardVitals (https://www.boardvitals.com/), across ten topics and three difficulty levels [2]. These questions were inputted into standard GPT-4 (https://openai.com/gpt-4) and a custom GPT-4 model (https://chat.openai.com/g/g-PyDG5N7Ko-ent-expert-mcq-solver). Using the built-in model customization interface, the custom GPT-4 model was instructed to specialize in Otolaryngology board-style questions, emphasizing precision, selecting one answer, providing evidence-based explanations, and validating answers using the internet. The independent samples t-test and multivariable binary logistic regression were implemented with SPSS version 25 (IBM).

Results

Of the 150 board-style questions assessed, standard GPT-4 answered 108 (72.0%) and custom GPT-4 answered 122 (81.3%) correctly (Table 1). Both standard (90.0% vs. 46.0%) and custom GPT-4 (98.0% vs. 62.0%) demonstrated a decrease in performance between “easy” and “hard” questions (P < 0.001). For 111 (74.0%) and 113 (73.5%) questions, respectively, standard and custom GPT-4 selected the most common answer option chosen by Otolaryngology trainees. On multivariable analysis adjusting for subject, difficulty, question length, answer length, percent trainees correct, standard vs. custom GPT-4, and GPT-4 response length, custom GPT-4 (adjusted odds ratio [aOR] 2.19, 95% confidence interval [CI] 1.16–4.11, P = 0.015) and plastic and reconstructive subject (aOR 7.41, 95% CI 1.44–38.05, P = 0.016) remained associated with GPT-4 answering correctly.

Table 1 Characteristics of 150 multiple-choice questions

Full size table

For standard GPT-4, mean question length (251 vs. 229 characters, P = 0.502), correct answer option length (33 vs. 50 characters, P = 0.016), and GPT-4 response length (1,204 vs. 1,565 characters, P < 0.001) varied between questions answered correctly and incorrectly. For custom GPT-4, mean question length (248 vs. 229 characters), correct answer option length (37 vs. 42 characters), and GPT-4 response length (1,460 vs. 1,449 characters) were similar between questions answered correctly and incorrectly.

Discussion

Overall, our study demonstrated improved performance by standard and custom GPT-4, with more correct answers on ‘easy,’ longer, and plastic and reconstructive questions. Our study found that standard (72.0%) and custom GPT-4 (81.3%) demonstrated higher accuracy than GPT-3.5 on the same 150 questions included in our previous study (51.3%) and on similar Otolaryngology board-style questions (53%) [2, 3]. Custom GPT-4 outperformed otolaryngology trainees who averaged 72.7% accuracy. These findings align with recent studies showing GPT-4 outperforming GPT-3.5 on board-style questions for the United States Medical Licensing Examination, Plastic Surgery Inservice Training Examination, and National Board of Medical Examiners Surgery Subject Examination, demonstrating broad applicability of AI in medical education [4,5,6].

The higher accuracy demonstrated by Custom GPT-4 may be attributable to its instructions to specialize in Otolaryngology board-style questions, select one answer, and validate answers using the internet. Whereas custom GPT-4 always selected one answer, standard GPT-4 selected multiple or no answers 8.7% of the time. Custom models may enhance utilization of ChatGPT in medical education. There is the risk of intellectual dependency on AI and a perceived decrease in the need to learn complex pathophysiology. However, the present state of AI requires biomedical knowledge of diseases and critical thinking to appraise newly developed AI systems [7].

This study has several limitations including lack of repeated trials to account for variance in model output, because GPT-4 provides unique answers to each query. Questions with images were excluded because GPT-3.5 is limited to text input. The subject matter of our 150 board-style questions may not generalize to other fields. The functionality of GPT-4 may be limited in certain environments by the requirement of accessing GPT-4 with internet.

Conclusion

Our study found performance improvements of GPT-4 over GPT-3.5 on Otolaryngology board-style questions. The additional capabilities of custom GPT-4 allowed us to create a model specializing in Otolaryngology-board style questions, which demonstrated higher accuracy than standard GPT-4. It is important to heed the risk of intellectual dependency on AI and to approach AI models with critical thinking and a background of knowledge. Future studies should explore benefits conferred to otolaryngology trainees utilizing AI for medical education. With the ability to interact with users, provide explanations, and adjust to user customizations, AI-based text models may continue to improve as tools for medical education.

References

Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V (2023) Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2:e0000198. https://doi.org/10.1371/journal.pdig.0000198
Revercomb L, Patel AM, Choudhry HS, Filimonov A (2023) Performance of ChatGPT in Otolaryngology knowledge assessment. Am J Otolaryngol 45:104082. https://doi.org/10.1016/j.amjoto.2023.104082
Mahajan AP, Shabet CL, Smith J, Rudy SF, Kupfer RA, Bohm LA (2023) Assessment of Artificial Intelligence Performance on the Otolaryngology Residency In-Service Exam. OTO Open 7:e98. https://doi.org/10.1002/oto2.98
Gupta R, Park JB, Herzog I, Yosufi N, Mangan A, Firouzbakht PK, Mailey BA (2023) Applying GPT-4 to the Plastic Surgery Inservice Training Examination. J Plast Reconstr Aesthet Surg 87:78–82. https://doi.org/10.1016/j.bjps.2023.09.027
Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, Nadkarni G, Klang E (2023) Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 13:16492. https://doi.org/10.1038/s41598-023-43436-9
Grigorian A, Shipley J, Nahmias J, Nguyen N, Schwed AC, Petrie BA, de Virgilio C (2023) Implications of Using Chatbots for Future Surgical Education. JAMA Surgery 158:1220–1222. https://doi.org/10.1001/jamasurg.2023.3875
Lee J, Wu AS, Li D, Kulasegaram K (Mahan) (2021) Artificial Intelligence in Undergraduate Medical Education: A Scoping Review. Academic Medicine 96:S62. https://doi.org/10.1097/ACM.0000000000004291

Download references

Author information

Authors and Affiliations

Department of Otolaryngology – Head and Neck Surgery, Rutgers New Jersey Medical School, 185 S Orange Ave Newark, Newark, NJ, 07103, USA
Lucy Revercomb, Aman M. Patel, Daniel Fu & Andrey Filimonov

Authors

Lucy Revercomb
View author publications
You can also search for this author in PubMed Google Scholar
Aman M. Patel
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Fu
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Filimonov
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Lucy Revercomb: design, analysis, interpretation, manuscript writing; Aman M. Patel: conception, design, analysis, interpretation, manuscript writing; Daniel Fu: design, analysis, interpretation, manuscript writing; Andrey Filimonov: design, analysis, interpretation, manuscript writing, final approval.

Corresponding author

Correspondence to Lucy Revercomb.

Ethics declarations

Financial Disclosures and Conflicts of Interest

None.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Revercomb, L., Patel, A.M., Fu, D. et al. Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment. Indian J Otolaryngol Head Neck Surg (2024). https://doi.org/10.1007/s12070-024-04935-x

Download citation

Received: 29 March 2024
Accepted: 22 July 2024
Published: 03 August 2024
DOI: https://doi.org/10.1007/s12070-024-04935-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment