Abstract
This study aimed to evaluate the performance for answering the Japanese medical physicist examination and providing the benchmark of knowledge about medical physics in language-generative AI with large language model. We used questions from Japan’s 2018, 2019, 2020, 2021 and 2022 medical physicist board examinations, which covered various question types, including multiple-choice questions, and mainly focused on general medicine and medical physics. ChatGPT-3.5 and ChatGPT-4.0 (OpenAI) were used. We compared the AI-based answers with the correct ones. The average accuracy rates were 42.2 ± 2.5% (ChatGPT-3.5) and 72.7 ± 2.6% (ChatGPT-4), showing that ChatGPT-4 was more accurate than ChatGPT-3.5 [all categories (except for radiation-related laws and recommendations/medical ethics): p value < 0.05]. Even with the ChatGPT model with higher accuracy, the accuracy rates were less than 60% in two categories; radiation metrology (55.6%), and radiation-related laws and recommendations/medical ethics (40.0%). These data provide the benchmark for knowledge about medical physics in ChatGPT and can be utilized as basic data for the development of various medical physics tools using ChatGPT (e.g., radiation therapy support tools with Japanese input).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data used in this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request.
References
Cha E, Elguindi S, Onochie I, et al. Clinical implementation of deep learning contour autosegmentation for prostate radiotherapy. Radiother Oncol. 2021;159:1–7.
Mackay K, Bernstein D, Glocker B, et al. A review of the metrics used to assess auto-contouring systems in radiotherapy. Clin Oncol (R Coll Radiol). 2023;35:354–69.
Heilemann G, Zimmermann L, Schotola R, et al. Generating deliverable DICOM RT treatment plans for prostate VMAT by predicting MLC motion sequences with an encoder-decoder network. Med Phys. 2023;50:5088–94.
Tomori S, Kadoya N, Takayama Y, et al. A deep learning-based prediction model for gamma evaluation in patient-specific quality assurance. Med Phys. 2018;45:4055–65.
Tozuka R, Kadoya N, Tomori S, et al. Improvement of deep learning prediction model in patient-specific QA for VMAT with MLC leaf position map and patient’s dose distribution. J Appl Clin Med Phys. 2023;24:e14055.
Introducing ChatGPT. OpenAI. URL: https://openai.com/blog/chatgpt [accessed 2023–8–21].
A message from our CEO: an important next step on our AI journey. Google. 2023. URL: https://blog.google/technology/ai/bard-google-ai-search-updates/ [accessed 2023–08–21].
Gilson A, Safranek CW, Huang T, et al. How does chatgpt perform on the united states medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9: e45312.
Singhal K, Tu T, Gottweis J, et al. Towards Expert-Level Medical Question Answering with Large Language Models, 2023; arXiv:2305.09617.
Rebelo N, Sanders L, Li K, et al. Learning the treatment process in radiotherapy using an artificial intelligence-assisted chatbot: development study. JMIR Form Res. 2022;6: e39443.
Liu Z, Zhong A, Li Y, et al. Radiology-GPT: A large language model for radiology, 2023; arXiv:2306.08666.
Toyama Y, Harigai A, Abe M, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan radiology society. Japanese J Radiol. 2023. https://doi.org/10.1007/s11604-023-01491-2.
Etxaniz J, Azkune G, Soroa A, et al. Do multilingual language models think better in english?, 2023; arXiv:2308.01223.
Han X, Zhang Z, Ding N, et al. Pre-trained models: past, present and future, 2021; arXiv:2106.07139.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need, 2017; arXiv:1706.03762.
Dong L, Xu S Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. 2018 ieee international conference on acoustics, speech and signal processing (ICASSP). 20185884–5888.
Yenduri G, M R, Selvi G C, et al. Generative pre-trained transformer: a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions, 2023; arXiv:2305.10435.
Zaitsu W Jin M. Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis, 2023; arXiv:2304.05534.
Medical physicist certification examination, Japanese board for medical physicist qualification. https://www.jbmp.org/certification/examination/ [accessed 2023–8–21].
Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP Tasks, 2020; arXiv:2005.11401.
Xiong G, Jin Q, Lu Z, et al. Benchmarking Retrieval-augmented generation for medicine, 2024; arXiv:2402.13178.
Elmore S, Prajogi G, Polo A, et al. The global radiation oncology workforce in 2030: estimating physician training needs and proposing solutions to scale up capacity in low- and middle-income countries. Appl Radiation Oncol. 2019. https://doi.org/10.37549/ARO1193.
Miller DD, Brown EW. Artificial intelligence in medical practice: the question to the answer? Am J Med. 2018;131:129–33.
Vaishya R, Javaid M, Khan IH, et al. Artificial Intelligence (AI) applications for COVID-19 pandemic. Diabetes Metab Syndr. 2020;14:337–9.
Liu P, Yuan W, Fu J, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing, 2021; arXiv:2107.13586.
Chen L, Zaharia M Zou JY. How is ChatGPT's behavior changing over time? ArXiv 2023;abs/2307.09009.
Acknowledgements
The authors are grateful to Japanese Board for Medical Physicist Qualification for permission of usage of the exam questions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
Inoue is employees of Elith, inc.
Ethical approval
There are no human subjects in this article and informed consent is not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Kadoya, N., Arai, K., Tanaka, S. et al. Assessing knowledge about medical physics in language-generative AI with large language model: using the medical physicist exam. Radiol Phys Technol (2024). https://doi.org/10.1007/s12194-024-00838-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12194-024-00838-2