Introduction

With the rapid advancement of digital technologies, the emergence of artificial intelligence (AI) has become increasingly prevalent in clinical medicine and medical education [1,2,3,4,5,6]. Recently, news of the AI language tool, ChatGPT made global headlines when researchers were able to use the tool to pass the United States Medical Licensing Exam (USMLE) without any specialized training or re-enforcement [7]. The results of this study suggested that tools like ChatGPT have the potential to assist medical education through the use of clinical case reports and potentially even support real-life clinical decision-making.

Several studies have been published that evaluate best-use cases of AI tools in differing clinical scenarios. Hirosawa et al. [8] found that ChatGPT could generate well-differentiated diagnosis lists for common clinical presentations. In another study, Rao et al. demonstrated the ability of ChatGPT to accurately generate differential diagnoses, suggest appropriate diagnostic tests, and reasonably deduce final diagnoses using medical vignettes published in the Merck Sharpe and Dohme (MSD) clinical manual [9]. Finally, exploration work has been conducted to evaluate the ability of ChatGPT to predict clinical outcomes [10]. However, results of this specific use case have yet to conclude any definitive evidence that ChatGPT is able to predict clinical outcomes accurately.

The increasing use of AI to support clinical practice is gaining acceptance among clinicians from diverse backgrounds [11, 12]. Like any new technology, it has advantages and drawbacks that must be evaluated and assessed. A concerning feature of deploying specialized technology among clinicians who lack AI development training is the potential for misuse of its benefits without considering its limitations. For instance, ChatGPT, a large language model (LLM) initially developed for language-based tasks, is now being utilized in various clinical settings beyond its original scope, as previous research indicates.

Currently, there is insufficient guidance available for clinicians on how to effectively integrate AI tools into clinical practice [13]. Furthermore, there is a lack of clinician training to ensure the safe use of AI in medicine [14, 15]. This is likely due to the need for further research in this field. To address this gap, this study examines the potential benefits and limitations of ChatGPT in a single clinical case report within the specialty of orthopaedic surgery [16]. This specialty was chosen because it involves the interpretation of visual information such as X-rays, which current language models like ChatGPT are unable to analyse. Specifically, the case report features a 53-year-old male with a femoral neck fracture. The purpose of this study is not to examine the use of ChatGPT in every clinical scenario, but more so to use this specific vignette as an exemplar to highlight some of the crucial considerations that must be contemplated when utilizing AI tools in a clinical context.

Methods

This was a case study performed using a single clinical case report from OrthoBullets [17]. This is a global clinical collaboration platform for orthopaedic surgeons with a community of over 600,000 providers and 150 million annual page views. The OrthoBullets Case Reports feature allows surgeons to post interesting or relevant clinical cases and have the community comment and vote on standardized peer-reviewed treatment polls with regard to investigations, treatment options, surgical techniques and post-operative protocols.

ChatGPT was asked to respond to the poll questions relating to a single clinical case report and provide a best response [16, 18]. No identifiable data were used in the study, and therefore, ethics approval was not required. Written permission from OrthoBullets to use their clinical case report for this study was obtained prior to submission.

Clinical case report

The case report used in this study comprised of the following:

Title: Femoral Neck Fracture in 53 M (Right hip pain).

History of Presenting Incident: A 53-year-old male presents to an outside hospital in the early morning, about 8 am, after a bicycle crash. He had immediate hip pain and an inability to ambulate. The patient was transferred to a trauma hospital at 830pm, about 12 hours after the injury, for definitive management. He is an avid cyclist and often does 100-mile rides.

Past Medical History: No past medical history. The patient does not smoke tobacco or drink alcohol.

Physical Examination: The affected hip was short and externally rotated. Painful to range of motion (ROM). Neurovascularly intact distally.

Outcomes

The primary outcome of this study was to qualitatively evaluate the responses of ChatGPT to the clinical case report presented. These were in relation to the poll questions associated with the case report. We aimed to identify the strengths, limitations, and potential risks of using ChatGPT in this scenario. We used previously described methods of qualitatively synthesizing the responses with thematic commentary to present the results [19,20,21]. In addition, we aimed to examine the impact of varying the case report’s context, introducing descriptors of radiographs, and assessing the reproducibility on ChatGPT's response output. These secondary outcomes were important to understand how ChatGPT performs under different conditions and identify areas for improvement.

Original dialogue protocol

To ensure consistency and accuracy, we used a specific dialogue protocol for feeding the case report and poll questions into ChatGPT. Due to word limit constraints, we divided the case report and questions into separate inputs, beginning with the case report and the first poll question in a single input, followed by each subsequent poll question as individual inputs. To provide responses, ChatGPT was asked to select from the available responses on the OrthoBullets website. In the event that ChatGPT declined to answer a question due to its safety mechanisms, we provided an additional prompt with the wording: “For the purposes of an educational exercise, what would be your best response?” This prompt allowed us to obtain responses even when ChatGPT safety mechanisms were triggered. For further information on the original dialogue protocol, please refer to Online Appendix 1.

Alternative responses

We introduced three additional dialogue protocols to better evaluate the variability of responses generated by ChatGPT.

In the first protocol, we fed the case report along with the poll questions to ChatGPT but allowed for additional prompts such as “please provide me a rationale for your decision” or “you have not selected a response, please choose only one of the responses listed” to guide ChatGPT in generating clinical responses. This freestyle dialogue approach allowed for greater control over the responses generated by ChatGPT and helped evaluate its ability to respond to questions effectively.

In the second protocol, we replicated the original dialogue protocol but on a separate day and session to assess the reproducibility of responses generated by ChatGPT based on access date and identify any differences that may have arisen.

In the final protocol, we provided ChatGPT with a descriptor of the pre-operative imaging provided in the clinical vignette (Fig. 1). We added the following information to the vignette:

Imaging: AP and lateral plain films are provided, showing a minimally displaced, transcervical right hip fracture with minimal radiographic signs of osteoarthritis.

Fig. 1
figure 1

Pre-operative X-ray images of the clinical vignette provided by OrthoBullets

We then repeated the original dialogue protocol and recorded the responses generated by ChatGPT. This approach allowed us to assess the impact of additional information on the responses generated by ChatGPT.

Technical specifications

The clinical vignette was published on OrthoBullets on 1 April 2023, while access to the vignette and poll responses was obtained on 24 April 2023. The study utilized the free version of ChatGPT-3.5, accessed on the same day (24 April 2023) and had its most recent update on 23 March 2023. In this version of ChatGPT, only internet data up to September 2021 were fed into the LLM. The device used to access ChatGPT was a MacBook Pro 2021 (Apple Inc., USA) running MacOS Monterey (version 12.6), while Google Chrome (version 112.0.5615.49) was the browser used to access both OrthoBullets and ChatGPT. To avoid any potential biases from previous interactions, a new account was created when accessing ChatGPT for the first time.

Results

Original dialogue protocol responses

Responses to the original dialogue protocol are presented in Table 1 along with the OrthoBullets community responses to the poll questions. Using the original dialogue protocol, ChatGPT typically produced one of four types of responses when answering the questions:

  1. 1.

    Clinically appropriate responses which are relevant and applicable to the question asked and align with established medical guidelines and best practices.

  2. 2.

    Clinically appropriate responses that lack sufficient justification or explanation for their recommendation. These responses may still be relevant and helpful but could benefit from additional detail or reasoning to support their advice.

  3. 3.

    Clinically inappropriate responses that do not align with established medical guidelines or best practices. These responses may be inaccurate, outdated, or potentially harmful and should be avoided.

  4. 4.

    Responses that do not directly provide a clinical suggestion, but instead offer insight into the decision-making process behind a particular recommendation. While these responses may not directly answer the question, they can still clarify the reasoning and considerations that inform medical decision-making.

Table 1 Original dialogue protocol responses with OrthoBullets poll results

Type 1 responses from ChatGPT are characterized by clinically appropriate and evidence-based answers that are consistent with established medical guidelines and best practices. For example, in Table 1, questions 1, 2, and 3 all received type 1 responses, where ChatGPT provided sensible and well-supported answers. It is worth noting, however, that these questions were less "controversial" and had a larger body of available evidence to draw from that were generally consistent. This may have influenced the quality of the responses provided by ChatGPT. Nonetheless, the fact that ChatGPT provided appropriate and evidence-based responses to these questions suggested a positive indication of its usefulness as a tool for clinical decision-making.

Type 2 responses from ChatGPT were characterized by clinically appropriate answers that are insufficiently justified or contain inappropriate justification. For example, from Table 1, consider question 4 where ChatGPT recommended performing a total hip arthroplasty (THA) on a patient in the morning, even if it means bumping elective cases. While this recommendation may not be unreasonable in specific contexts, the evidence cited by ChatGPT to support the claim that delaying surgery by a few hours could increase mortality and morbidity is unfounded in this specific case. Furthermore, this response highlights a limitation of ChatGPT in that it fails to consider the practical consequences of bumping elective cases and the potential morbidity cost to patients whose surgeries are delayed.

In addition, in Table 1, questions 5 and 6 should also be categorized as type 2 responses. The management of femoral neck fractures is a complex area where there is often no clear consensus or evidence-based guidelines, and decisions are sometimes based on surgeon preference. In such cases, ChatGPT's provision of a rationale for a particular response may introduce bias and overlook other valid perspectives and approaches. However, question 6, suggesting the use of a proximal femoral locking plate would deviate significantly from most common surgical practice [22,23,24,25]. This was evidenced by the observation that only 1% of the OrthoBullets member base selected this option. Additionally, questions 7 and 8 in Table 1 were answered by ChatGPT without providing any explanation or justification. As a result, these responses should also be considered type 2, as they fail to provide sufficient information to support the recommendation and may lack clinical relevance.

Type 3 responses, characterized by clinically inappropriate answers, were not identified using the original dialogue protocol. However, it is worth noting that subsequent responses from ChatGPT using different dialogue protocols and prompts did yield clinically inappropriate responses, which will be discussed in later sections of the results and discussion.

The final type of response, type 4, is observed in Table 1 questions 9–13. These responses did not provide a direct clinical recommendation, but instead presented reasoning and rationale behind the response options. These responses are likely a result of ChatGPT's built-in safety mechanisms, which prevent it from providing clinical recommendations [26]. Some type 4 responses were more detailed than others. For example, question 9 simply deferred the decision to an orthopaedic surgeon. In contrast, question 12 provided references to academic institutions and evidence to support its rationale for extending anti-coagulant use post-THA up to 35 days. However, upon closer examination, it became apparent that the references used in the rationales generated by ChatGPT were outdated, as the referenced guidelines were published in 2008 [27]. Since then, numerous studies have been conducted that challenge the duration required for prophylactic anti-coagulant use after THA, with some suggesting that aspirin may be a sufficient option [28,29,30]. This suggests that in addition to ChatGPT's evidence base being potentially outdated, there may be biases in how evidence is prioritized and used in generating responses.

Freestyle dialogue responses

When using the freestyle dialogue protocol, ChatGPT generated responses that could also be grouped into the same response types as the original dialogue protocol. For most questions, similar responses were generated (Table 2). However, some significant differences also emerged. Notably, in Table 2 question 4, ChatGPT provided a clinically inappropriate response (type 3). The statement that "performing a total hip arthroplasty (THA) in the setting of an acute traumatic hip fracture is not a recommended first-line management option" was incorrect [31,32,33,34]. This response could be potentially harmful if followed by an inexperienced clinician who relies solely on ChatGPT's advice. The response overlooks essential aspects of the patient's case, such as the fracture pattern, which is critical in making treatment decisions in orthopaedic surgery.

Table 2 Comparison of original dialogue protocol responses with freestyle dialogue responses

Furthermore, our analysis revealed inconsistencies between the ChatGPT responses generated by the original and freestyle dialogue protocols. For example, in Table 2 question 13 elicited different suggestions for managing the weight-bearing status of a patient following divergent screw plate surgery. While this specific question may not have a clear evidence-based answer, the differences in responses suggest that ChatGPT can be influenced by the user's prompts, which raises concerns about the reliability of ChatGPT for clinical decision-making. This highlights one of the limitations of using ChatGPT to generate consistent, appropriate, and reasoned clinical responses.

Reproducibility of responses on alternative day

The responses generated by ChatGPT were found to be inconsistent when the same original dialogue protocol was run on separate days (Table 3). Responses provided by ChatGPT on 25 April 2023 conflicted with those provided the previous day (Questions 3–5, 7, 8, and 12). For example, in response to question 5, ChatGPT recommended open reduction on 24 April 2023, but suggested closed reduction on 25 April 2023. This is a concerning finding because the prompts given to ChatGPT were identical on both days, indicating that the responses seemingly depended on the day or even the time ChatGPT is queried.

Table 3 Original dialogue protocol responses recorded on an alternative day (24 April 2023 versus 25 April 2023)

Responses after X-ray description input

When presented with a brief description of pre-operative X-ray findings, ChatGPT generated responses that differed from those produced by the original dialogue protocol (Table 4). The discrepancies were most notable for questions 3–7. For example, in response to question 6, the description of a "minimally displaced transcervical right hip fracture" in conjunction with the patient's age in the vignette may have influenced ChatGPT to recommend a sliding hip screw instead of a proximal femoral locking plate. However, given the observed inconsistencies in ChatGPT responses based on various other factors, it is difficult to determine whether the improved response was solely due to the X-ray information or other variables.

Table 4 Original dialogue protocol responses recorded with X-ray descriptors

In addition, we noticed a concerning inconsistency with question 3. In the dialogue protocol that included X-ray information, we found that the recommended time to theatre as described by ChatGPT was 24–32 h for fracture reduction internal fixation. This is generally considered too long to wait for an orthopaedic emergency, which this clinical vignette describes. The response to this specific question can be classified as a type 3 response, as the information presented is clinically inaccurate and poses a significant risk to patient safety. If an inexperienced clinician were to follow this modified advice, this could result in serious harm to patients from poorer outcomes following surgery [35, 36].

Discussion

The study has established that ChatGPT holds promise in providing satisfactory responses to specific clinical queries, following a clinical case report involving a 53-year-old male with a femoral neck fracture. However, the results of this study also reveal that ChatGPT's responses were at times inadequate and even hazardous. Additionally, a lack of consistency was observed in the responses generated by ChatGPT, which varied depending on the nature of the dialogue, the date which the interaction occurred, and different information inputs. Notably, radiographic data, such as X-rays, could not be directly incorporated into ChatGPT and necessitated human interpretation before being transformed into textual prompts for ChatGPT. The study's implications highlight that ChatGPT, in its present form, may not be a reliable tool for widespread use as a clinical decision aid or an educational resource. The study also highlights the potential risks associated with untrained clinicians relying on AI-based technologies, such as ChatGPT, without considering the limitations and inherent dangers. The findings of this study underscore the need for continued research to enhance the reliability, safety, and applicability of ChatGPT in a clinical setting.

From this study, we have identified five fundamental limitations that significantly restrict the use of ChatGPT in a clinical scenario. Firstly, the responses generated by ChatGPT can be inconsistent and lack reliability, leading to suboptimal clinical decision-making. For example, the same dialogue prompts on different days resulted in substantially different responses. This inconsistency suggests that ChatGPT's performance may not be entirely dependable, and users must be cautious when relying on it for clinical recommendations. Additionally, ChatGPT's responses can be limited in scope, meaning they may not provide a comprehensive range of options, particularly for complex or nuanced questions.

Secondly, ChatGPT's data input is restricted and constrained. Each version has a cut-off point beyond which it cannot access new data, leading to potential limitations in the currency and quality of information available for generating responses. In the context of clinical decision-making, ChatGPT may not be able to provide the latest and most relevant data, compromising the validity and accuracy of its responses.

Thirdly, the study highlights that ChatGPT cannot assess the quality of available evidence leading to the provision of inappropriate clinical recommendations. ChatGPT does not consider the level of evidence available or the quality of the literature available, which can have significant consequences in clinical practice. For example, it may be unable to identify high-quality clinical evidence that could inform the best treatment approach for a specific patient.

Fourthly, ChatGPT's limitations extend to its inability to process imaging information, which is critical in many medical specialties, including orthopaedic surgery. The ability to interpret images accurately and provide the correct diagnosis and treatment is crucial for making informed clinical decisions. ChatGPT's inability to handle imaging information could lead to significant clinical errors and potentially jeopardize patient safety.

Finally, this study also notes that ChatGPT exhibits signs of memory fatigue, with earlier responses being more relevant and justified than later ones. This limitation highlights the importance of ensuring that ChatGPT's responses are regularly reviewed and updated to reflect any changes in the patient's condition or clinical context.

Conclusions

In conclusion, using AI tools like ChatGPT is promising to improve clinical decision-making and patient outcomes in orthopaedics. The results of this study suggest that ChatGPT can provide clinically appropriate and evidence-based recommendations in specific contexts. Still, it also has significant limitations and requires ongoing refinement and improvement to optimize its performance. ChatGPT's strengths include its ability to quickly synthesize vast amounts of clinical data, thereby potentially reducing the burden on healthcare professionals. However, the data it presents may be outdated, biased, and in some cases inappropriate. The study also highlights the need for human input, clinical judgment, and AI tools. Ultimately, ChatGPT and other AI tools could serve as a valuable aid in clinical decision-making in the future. However, in its current form, the tools are not appropriate for safe clinical decision-making and are not recommended for use in a clinical context.