Introduction

In recent years, the field of artificial intelligence (AI) has witnessed rapid advancements, particularly in the domain of natural language processing (NLP) [1]. The development of advanced NLP models has revolutionized the way humans interact with computers, enabling machines to better understand and respond to complex linguistic inputs. As AI systems become increasingly intuitive and capable, they present the potential to transform a multitude of industries and improve the quality of life for millions of people worldwide [1].

The advent of ChatGPT, and specifically the GPT-4 architecture, has resulted in a multitude of applications and research opportunities [2, 3]. GPT-4 has demonstrated superior capabilities in language processing and generation, significantly outperforming its predecessors in terms of performance and versatility [4, 5]. Its ability to process context, generate coherent and contextually relevant responses, and adapt to a wide range of tasks has made it an effective tool in numerous domains. As researchers and industries continue to explore the potential of GPT-4, its role in shaping the future of human–computer interaction becomes increasingly apparent.

Despite the attention that ChatGPT based on GPT-4 has received due to its superior performance compared to GPT-3 or GPT-3.5 [6], there is a significant gap in publications exploring its potential clinical applications which others have claimed will revolutionize healthcare and improve patient outcomes [7,8,9]. This lack of knowledge underscores the need for more in-depth investigations into the clinical capabilities of ChatGPT, including as a diagnostic support tool or second opinion.

To assess the clinical applicability of ChatGPT, we employed the New England Journal of Medicine (NEJM) quiz as a benchmark. This rigorous quiz, designed for healthcare professionals, tests the ability to analyze clinical scenarios, synthesize information, and make appropriate decisions. By analyzing ChatGPT's performance on the NEJM quiz, we sought to determine its potential to assist clinicians in their daily practice, contribute to the ever-growing field of AI-driven healthcare, and help transform the way healthcare professionals approach decision-making and patient care. This study is a preliminary examination of the usefulness of ChatGPT for differential diagnosis. This preliminary evaluation aims to determine the model’s performance at healthcare question answering in well-defined question formats like those of the NEJM quiz.

Materials and methods

Study design

In this study, our primary hypothesis was that ChatGPT, based on the GPT-4 architecture, could accurately evaluate and respond to the clinical scenarios presented in the NEJM quiz. The potential clinical applications of ChatGPT were demonstrated by using it as a tool for evaluating clinical scenarios and making appropriate diagnostic decisions. As ChatGPT is currently unable to handle images, they were not used as input. The requirement for informed consent of quizzes was waived by the Ethical Committee of Osaka Metropolitan University Graduate School of Medicine because this study only utilized published papers. All authors, including participating physicians, consented to the study. The study design was based on the Standards for Reporting Diagnostic Accuracy (STARD) guidelines, where applicable [10].

Data collection

The NEJM offers a weekly quiz called "Image Challenge" (https://www.nejm.org/image-challenge). Although the training data is not publicly available, ChatGPT was developed using data available up to September 2021. Taking into account the possibility that earlier NEJM quizzes may have been used for training purposes, we collected the quizzes from October 2021 to March 2023. This quiz consists of images and clinical information, with readers selecting their answers from five candidate choices. While images are undoubtedly important, many questions can be answered based on clinical information alone. Two author physicians read all the quizzes and commentaries and excluded questions from the NEJM quiz that were impossible to answer without images. A third author physician was consulted when consensus could not be reached. We categorized the quiz types as Diagnosis, Finding, Treatment, Cause, and Other based on what the reader was asked to find. Case commentaries for each quiz are featured on the "Images in Clinical Medicine" website, and tags related to the specialty for the case are displayed. These specialty tags were also extracted for our analysis.

Processes for input and output into the ChatGPT interface

We used the GPT-4-based ChatGPT (Mar 23 Version; OpenAI; https://chat.openai.com/). One case at a time, the quizzes were entered and answers were obtained from ChatGPT. For each case, we obtained the output from ChatGPT (Step 1: Generate answer without choices). Then we input the answer choices and asked ChatGPT to choose one of them (Step 2: Generate answer with choices). Two author physicians confirmed whether the answer generated by ChatGPT matched the ground truth. If there was a discrepancy, a third author physician made the decision. We introduced this process of confirmation in case the difference was purely linguistic.

Statistical analysis

The percentage of correct responses generated by ChatGPT with and without candidate choices was evaluated by quiz type and specialty. We verified the reproducibility by obtaining the responses again using the same prompt, and comparing the results using Fisher's exact test for paired data and the chi-square test. We extracted the percentage of correct choices for each case from the NEJM Image Challenge website and compared this to ChatGPT's accuracy using Spearman's correlation analysis. Cases with a lower percentage of correct choices were considered more difficult questions for medical professionals, while those with a higher percentage were considered easier questions. A p-value of 0.05 was used to determine statistical significance. All analyses were performed using R (version 4.0.0, 2020; R Foundation for Statistical Computing; https://R-project.org).

Results

Evaluation

In our study, we assessed ChatGPT's performance on the NEJM quiz questions which span different types and medical specialties. The results demonstrated varying levels of accuracy depending on the specific context, summarized in Table 1. Eligibility is shown in Fig. 1. Overall, ChatGPT correctly answered 87% (54/62) of the questions without candidate choices, and this accuracy increased to 97% (60/62) with the choices after excluding 16 quizzes which required images. The results show that the best performing category was Diagnosis, although the number of cases was too small for accuracy in the other categories. This is shown in Fig. 2.

Table 1 Accuracy summary
Fig. 1
figure 1

Eligibility flowchart

Fig. 2
figure 2

Results by answer type and specialty. These are the accuracy rates for various types and specialties of quizzes from the New England Journal of Medicine. The blue bar is the accuracy without choices and the green bar is the accuracy with choices. Dotted lines show total accuracy with and without choices

Overall, ChatGPT performed well on the NEJM quiz across a range of medical specialties. In most cases, the model's accuracy improved when given choices. Several specialties showcased a remarkable 100% accuracy rate in both scenarios while Genetics had the lowest accuracy at 67% (2/3) both with and without choices. Accuracy for a few specialties, including Otolaryngology, Allergy/Immunology, and Rheumatology, improved when choices were provided. This is shown in Fig. 2. In assessing ChatGPT's reproducibility, the initial test yielded accuracies of 97% (60/62) and 87% (54/62) for tasks with and without choices, respectively, while the retest produced accuracies of 94% (58/62) and 84% (52/62). Chi-square tests showed no statistically significant differences between the two tests, with p-values of 0.5 and 0.69 for tasks with and without choices, respectively. No significant differences were found between the percentage of correct choices and ChatGPT's accuracy both with choices (r =  − 0.0034, p-value = 0.98) as well as without choices (r = 0.075, p-value = 0.52). The percentage of correct choices by those who attempted the image challenge did not significantly correlate with ChatGPT's accuracy, as shown in Fig. 3. ChatGPT maintained consistent performance regardless of the perceived difficulty.

Fig. 3
figure 3

Relationship between perceived difficulty of quiz questions and ChatGPT's accuracy. The x-axis represents the percentages of correct choices made by the participants, grouped in quartiles, from most difficult (Q1) to easiest (Q4). The number of correct and total choices published on the New England Journal of Medicine Image Challenge website was used as a proxy for perceived difficulty. The y-axis represents the accuracy of ChatGPT's responses, both with choices (blue line) and without choices (green line). ChatGPT's accuracy remained consistent across the perceived difficulty quartiles of the quiz questions

Discussion

Our study assessed ChatGPT's performance on the NEJM quiz, encompassing various medical specialties and question types. The sample size was relatively small, limiting the generalizability of the findings. However, it provides a preliminary assessment of the potential clinical applications of GPT-4-based ChatGPT. Overall, ChatGPT achieved an 87% accuracy without choices and a 97% accuracy with choices, after excluding image-dependent questions. When examining performance by quiz type, ChatGPT excelled in the Diagnosis category, securing an 89% accuracy without choices and a 98% accuracy with choices. Although other categories contained fewer cases, ChatGPT's performance remained consistent across the spectrum. ChatGPT exhibited high accuracy in most specialties, however Genetics registered the lowest at 67%. This could be due to the amount of available Genetics-related data for training, or due to the complexity and specificity of the language used in this field. While this analysis highlighted the potential for clinical applications of ChatGPT, it also revealed some strengths and weaknesses, emphasizing the importance of understanding and leveraging these performance insights to optimize its use.

This is our initial investigation exploring the potential clinical applications of GPT-4-based ChatGPT to clinical decision-making quizzes, marking an important milestone. Our study highlights the novelty of assessing GPT-4-based ChatGPT's potential for clinical applications, specifically its ability to handle well-defined problems in the medical field, setting it apart from earlier research on GPT-3-based ChatGPT. Compared to GPT-3, GPT-4 demonstrates improved performance in processing complex linguistic inputs and generating contextually relevant responses, making it more suitable for specialized domains such as healthcare [2, 3]. A previous study applied GPT-3-based ChatGPT to the United States Medical Licensing Examination and found that it achieved 60% accuracy [11]. This outcome hinted at its potential for medical education and future incorporation into clinical decision-making. Another study evaluated the diagnostic accuracy of GPT-3-based ChatGPT in generating differential diagnosis lists for common clinical vignettes [12]. Results showed that it can generate diagnosis lists with good accuracy, but physicians still outperformed the AI chatbot (98.3% vs. 83.3%, p = 0.03).

The results of this study reveal that ChatGPT, based on the GPT-4 architecture, demonstrates promising potential in various aspects of healthcare. With an accuracy rate of 97% for answers with choices and 87% for answers without choices, ChatGPT has shown its capability in analyzing clinical scenarios and making appropriate diagnostic decisions. There is no evident correlation between the proportion of respondents choosing the correct answer, which is believed to reflect the difficulty of the quiz, and the accuracy of ChatGPT. This suggests that ChatGPT might be able to provide correct answers regardless of the question's difficulty. One key implication is the potential use of ChatGPT as a diagnostic support tool. Healthcare professionals may utilize ChatGPT to help them with differential diagnosis after taking into consideration its strengths and weaknesses as demonstrated in this study. By streamlining workflows and reducing cognitive burden, ChatGPT could enable more efficient and accurate decision-making [13, 14]. In addition to supporting diagnostic decisions, ChatGPT's performance on the NEJM quiz suggests that it could be a valuable resource for medical education [15,16,17,18,19,20]. By providing students, professionals, and patients with a dynamic and interactive learning tool, ChatGPT could enhance understanding and retention of medical knowledge.

This study has several limitations that should be considered when interpreting the results. Firstly, it focused solely on text-based clinical information, which might have affected ChatGPT's performance due to the absence of crucial visual data. The sample size was relatively small and limited to the NEJM quizzes, which may not fully represent the vast array of clinical scenarios encountered in real-world medical practice, limiting the generalizability of the findings. Additionally, the study did not evaluate the impact of ChatGPT's use on actual clinical outcomes, patient satisfaction, or healthcare provider workload, leaving the real-world implications of using ChatGPT in clinical practice uncertain. Another limitation is the absence of a comparative analysis with human performance on the same quiz. Lastly, potential biases in GPT-4’s training data, as well as potential clinician biases for or against AI-provided results, may lead to disparities in the quality and accuracy of AI-driven recommendations for specific clinical scenarios or populations [21].

In conclusion, this study demonstrates the potential of GPT-4-based ChatGPT for diagnosis by evaluating its performance on the NEJM quiz. While the results show promising accuracy rates, several limitations highlight the need for further research. Future studies should focus on expanding the range of clinical scenarios, assessing the impact of ChatGPT on actual clinical outcomes and healthcare provider workload, and exploring the performance of ChatGPT in diverse language settings and healthcare environments. Additionally, the importance of incorporating image analysis in future models should not be overlooked. By addressing these limitations and integrating image analysis, the potential of ChatGPT to revolutionize healthcare and improve patient outcomes can be more accurately understood and harnessed.