Abstract
Purpose
Previous studies of ChatGPT performance in the field of medical examinations have reached contradictory results. Moreover, the performance of ChatGPT in other languages other than English is yet to be explored. We aim to study the performance of ChatGPT in Hebrew OBGYN-‘Shlav-Alef’ (Phase 1) examination.
Methods
A performance study was conducted using a consecutive sample of text-based multiple choice questions, originated from authentic Hebrew OBGYN-‘Shlav-Alef’ examinations in 2021–2022. We constructed 150 multiple choice questions from consecutive text-based-only original questions. We compared the performance of ChatGPT performance to the real-life actual performance of OBGYN residents who completed the tests in 2021–2022. We also compared ChatGTP Hebrew performance vs. previously published English medical tests.
Results
In 2021–2022, 27.8% of OBGYN residents failed the ‘Shlav-Alef’ examination and the mean score of the residents was 68.4. Overall, 150 authentic questions were evaluated (one examination). ChatGPT correctly answered 58 questions (38.7%) and reached a failed score. The performance of Hebrew ChatGPT was lower when compared to actual performance of residents: 38.7% vs. 68.4%, p < .001. In a comparison to ChatGPT performance in 9,091 English language questions in the field of medicine, the performance of Hebrew ChatGPT was lower (38.7% in Hebrew vs. 60.7% in English, p < .001).
Conclusions
ChatGPT answered correctly on less than 40% of Hebrew OBGYN resident examination questions. Residents cannot rely on ChatGPT for the preparation of this examination. Efforts should be made to improve ChatGPT performance in other languages besides English.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the authors on a reasonable request.
References
Cox SM et al (1994) Assessment of the resident in-training examination in obstetrics and gynecology. Obstet Gynecol 84(6):1051–1054
Hollier LM et al (2002) Effect of a resident-created study guide on examination scores. Obstet Gynecol 99(1):95–100
Withiam-Leitch M, Olawaiye A (2008) Resident performance on the in-training and board examinations in obstetrics and gynecology: implications for the ACGME outcome project. Teach Learn Med 20(2):136–142
Association IM Residency information booklet. Available at: https://www.ima.org.il/internesnew/viewcategory.aspx?categoryid=7016#.UnoBaEoUGJA. Accessed 22 August 2023
Pekar Zlotin M et al (2022) Preparation for final board exam in obstetrics and gynecology following the outbreak of the COVID 19 pandemic. Harefuah 161(2):125–126
Soong TK, Ho CM (2021) Artificial Intelligence in medical OSCEs: reflections and future developments. Adv Med Educ Pract 12:167–173
van Dis EAM et al (2023) ChatGPT: five priorities for research. Nature 614(7947):224–226
ChatGPT, Available at: https://openai.com/blog/chatgpt. Accessed 22 August 2023
Arif TB, Munaf U, Ul-Haque I (2023) The future of medical education and research: Is ChatGPT a blessing or blight in disguise? Med Educ Online 28(1):2181052
Gilson A et al (2023) How Does ChatGPT Perform on the United States medical licensing examination? the implications of large language models for medical education and knowledge assessment. JMIR Med Educ 9:e45312
Kung TH et al (2023) Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2(2):e0000198
Humar P, et al (2023) ChatGPT is Equivalent to First Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Exam. Aesthet Surg J sjad130. https://doi.org/10.1093/asj/sjad130
Gupta R, et al (2023) Performance of ChatGPT on the Plastic Surgery Inservice Training Examination. Aesthet Surg J sjad128. https://doi.org/10.1093/asj/sjad128
Mihalache A, Popovic MM, Muni RH (2023) Performance of an artificial intelligence Chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. https://doi.org/10.1001/jamaophthalmol.2023.2754
Giannos P, Delardas O (2023) Performance of ChatGPT on UK standardized admission tests: insights from the BMAT, TMUA, LNAT, and TSA examinations. JMIR Med Educ 9:e47737
Nakhleh A, Spitzer S, Shehadeh N (2023) ChatGPT’s response to the diabetes knowledge questionnaire: implications for diabetes education. Diabetes Technol Ther 25(8):571–573
Subramani M, Jaleel I, Krishna Mohan S (2023) Evaluating the performance of ChatGPT in medical physiology university examination of phase I MBBS. Adv Physiol Educ 47(2):270–271
Hopkins BS et al (2023) ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions. J Neurosurg 1:8
Fijačko N et al (2023) Can ChatGPT Pass the life support exams without entering the american heart association course? Resuscitation 185:109732
Huh S (2023) Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study. J Educ Eval Health Prof 20:1
Wang YM, Shen HW, Chen TJ (2023) Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Assoc 86(7):653–658
Lum ZC (2023) Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT, Clin Orthop Relat Res
Suchman K, Garg S, Trindade AJ (2023) ChatGPT Fails the multiple-choice American college of gastroenterology self-assessment test. Am J Gastroenterol. https://doi.org/10.14309/ajg.0000000000002320
Birkett L, Fowler T, Pullen S (2023) Performance of ChatGPT on a primary FRCA multiple choice question bank. Br J Anaesth 131(2):e34–e35
Shay D et al (2023) Assessment of ChatGPT success with specialty medical knowledge using anaesthesiology board examination practice questions. Br J Anaesth 131(2):e31–e34
Bhayana R, Krishna S, Bleakney RR (2023) Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 307(5):230582
Deng J, Lin Y (2023) The benefits and challenges of ChatGPT: an overview. Front Comput Intell Syst. 2(2):81–83
Levin G et al (2023) Identifying ChatGPT-written OBGYN abstracts using a simple tool. Am J Obstet Gynecol MFM 5(6):100936
Levin G et al (2023) ChatGPT-written OBGYN abstracts fool practitioners. Am J Obstet Gynecol MFM 5(8):100993
Biswas S (2023) ChatGPT and the future of medical writing. Radiology 307(2):223312
Funding
No funding or support was obtained for this study.
Author information
Authors and Affiliations
Contributions
GL, AC and RA: project development, data collection and management, data analysis, manuscript writing/editing. NL, RM and YB: data collection and management, manuscript writing/editing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
We used American Association for Public Opinion Research (AAPOR) reporting guidelines. This study did not require ethics approval, as we used only publicly accessible data and no human participants were involved.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cohen, A., Alter, R., Lessans, N. et al. Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations. Arch Gynecol Obstet 308, 1797–1802 (2023). https://doi.org/10.1007/s00404-023-07185-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00404-023-07185-4