FormalPara Key Summary Points

Why carry out this study?

Optic disc drusen are acellular deposits in the optic nerve head, which affect approximately 2% of the population.

Currently, no evidence-based treatments are available to prevent or ameliorate optic disc drusen-related vision loss. The important role of the ophthalmologist here is to provide information on various aspects of this condition.

We explored the accuracy of ChatGPT responses for typical patient questions on optic disc drusen.

What was learned from the study?

ChatGPT often provides relevant answers for typical patient questions on optic disc drusen. On topics related to therapy, ChatGPT provides potentially harmful advice.

It is important for clinicians to understand the quality and accuracy of the information to which patients are exposed to better understand patient requests and patient behavior.

Introduction

Optic disc drusen (ODD) are acellular deposits in the optic nerve head, which are often benign and asymptomatic, and affect approximately 2% of humans [1,2,3]. Some patients develop visual field defects, and ODD increase the risk of vascular co-morbidities [4,5,6,7,8,9]. ODD are not necessarily visible on fundoscopy, and in these cases ODD can mimic a potentially life-threatening papilledema, leading to an unnecessary range of clinical workups. The diagnosis of ODD is best established using enhanced depth imaging optical coherence tomography [10].

There are currently no evidence-based treatments available to prevent or ameliorate ODD-related visual loss. Therefore, when the patients receive a neuroophthalmological examination and obtain a diagnosis of ODD, one of the most important roles for the clinician is to inform the patient on the various aspects of the disease and answer the patient’s questions.

Seeking information on health and disease on the Internet is a natural part of coping and living with a disease [11]. Interestingly, one Danish study reported that when information on the Internet was not in alignment with the information from the physician, patients reported dissatisfaction with consultations [12]. This means that as a physician, one needs to understand what information the patients are exposed to in order to understand the context from where the patient’s questions arise. Unfortunately, information on the Internet is rarely developed in collaboration with clinical experts, which put patients at risk of not only inaccurate information but also potentially dangerous information [13, 14].

ChatGPT (OpenAI, San Francisco, CA, USA), which stands for Chat Generative Pre-trained Transformer, was launched in December 2022 and is an artificial intelligence-based chatbot [15]. User statistics revealed a record fast adaptation rate across the world, and it is currently being used for a variety of applications [15]. In ODD, considering that no treatments are currently available, and access to a neuro-ophthalmologist is rarely readily available, one potentially very useful application would be if it could provide reasonable answers to common patient questions. Considering the current and increasing adaptation rate of ChatGPT, it becomes increasingly important to understand the accuracy of the answers provided by ChatGPT.

The objective of this study was to evaluate the accuracy of ChatGPT responses to common patient questions regarding ODD. This was made by reaching consensus on the most common questions from two ODD experts, by obtaining ChatGPT responses to these questions, and finally by evaluating the accuracy of these responses.

Methods

This study was designed as a study of publicly accessible software (ChatGPT) and we neither obtained nor analyzed clinical records or samples from patients or other study subjects. According to Danish law, such studies do not require institutional review board approval. All aspects of this study adhered to the tenets of the Declaration of Helsinki. According to ChatGPT Terms of Use, studies that do not deal with security flaws or inappropriate content can be conducted without a priori or a posteriori approval [16].

Question Development

The author group, including two authors (L.M. and S.H.) who are Optic Disc Drusen Studies consortium members (both with years of experience in ODD research and management) [17], developed the questions. All questions were sent to all authors and discussed until consensus to obtain face validity. We neither had a minimum nor maximum number of questions for this study, but discussed potential questions in the author group until consensus could be reached regarding relevance and importance of the questions. The following 20 questions were included for the analyses:

  1. 1.

    What is optic disc drusen?

  2. 2.

    How is optic disc drusen diagnosed?

  3. 3.

    What is the best way for diagnosing optic disc drusen?

  4. 4.

    Why do I have optic disc drusen?

  5. 5.

    Are my optic disc drusen inherited from my parents?

  6. 6.

    How are optic disc drusen formed?

  7. 7.

    Do optic disc drusen enlarge when I get older?

  8. 8.

    Is optic disc drusen preventable?

  9. 9.

    Is optic disc drusen treatable?

  10. 10.

    Can optic disc drusen impact my vision?

  11. 11.

    Will my optic disc drusen prevent me from driving a car?

  12. 12.

    Will I get blind from optic disc drusen?

  13. 13.

    Is there anything I should avoid when I have optic disc drusen?

  14. 14.

    Is there anything I can do to avoid the optic disc drusen from getting worse?

  15. 15.

    My doctor tells me I have pseudo-papilledema due to optic disc drusen. How can my doctor be sure that I don't have true papilledema in addition to my optic disc drusen?

  16. 16.

    Do I need regular eye examinations for my optic disc drusen?

  17. 17.

    I have a relative with optic disc drusen. Do I need to get an eye exam?

  18. 18.

    Should I be worried that my children will inherit optic disc drusen?

  19. 19.

    What are the symptoms of optic disc drusen?

  20. 20.

    Can optic disc drusen lead to other diseases?

ChatGPT Responses, Evaluation, and Data Analysis

We used ChatGPT version 4, which was published in March 2023 [18]. We accessed the Application Programming Interface using a custom Python (v. 3.8.13) script to ask the 20 questions, separately and without the answers to certain questions being able to influence the answers to other questions. This was ensured by starting a new conversation every time a new question was asked. Each question was asked five times separately to allow evaluation of the spectrum of different answers for the same question. In total, 100 answers were obtained for evaluation. The accuracy of obtained answers was evaluated on a 5-point Likert scale by two Optic Disc Drusen Studies consortium members (L.M. and S.H.). We defined individual points of the Likert scale for a consistent evaluation:

  1. 1.

    Irrelevant response/no response

  2. 2.

    Relevant response with major inaccuracies and potential for harm

  3. 3.

    Relevant response with major inaccuracies and no potential for harm

  4. 4.

    Relevant response with minor inaccuracies and no potential for harm

  5. 5.

    Relevant response without any inaccuracies

The two experts provided the evaluations in an independent fashion. We used Cronbach’s alpha to evaluate the inter-rater agreement between the experts. Responses were summarized using descriptive statistics to understand the distribution of responses. The scores for the five responses for each question from each expert was summarized using median and range. Statistics were made using SPSS version 28.0.1.0 (IBM, Armonk, NY, USA).

Results

All ChatGPT responses are summarized in the Supplementary Materials. The two experts exhibited a fair/substantial agreement in the evaluation of responses (Cronbach’s alpha: 0.64; 95% confidence interval 0.46–0.76). A comparison of the distribution of scores between the two experts is provided in Table 1.

Table 1 Comparison of the distribution of scores between the two experts for all ChatGPT responses collected (n = 100)

We experienced no absent responses or responses that were completely irrelevant. All responses were relevant but had varying degrees of inaccuracies.

Seventeen responses were rated as being without any inaccuracies by both experts. Below is one example of such a response to the question “My doctor tells me I have pseudo-papilledema due to optic disc drusen. How can my doctor be sure that I don't have true papilledema in addition to my optic disc drusen?”:

“To determine if someone with optic disc drusen also has true papilledema, an ophthalmologist would look for signs such as increased intracranial pressure or other symptoms of a neurological condition. Imaging tests such as an MRI or CT may be done to rule out any underlying brain issues that could be causing the papilledema. Additionally, they may perform regular follow-up visits and evaluations to monitor any changes in your symptoms and condition over time. It is important to consult with your doctor and get proper diagnosis and treatment for your condition.”

In 33 other responses, one or both experts stated the presence of minor inaccuracies without potential for harm. One example of such minor inaccuracy is illustrated in the following response to the question “How are optic disc drusen formed?”:

“Optic disc drusen are formed due to the aggregation of calcium and protein deposits in the optic nerve head (ONH). The exact cause behind this process is not known, but it is believed that the accumulation of debris may be related to genetic factors or problems with the normal processes of protein metabolism and drainage within the eye. As these deposits build up over time, they can result in optic nerve abnormalities and potential vision problems. Although there is currently no cure for optic disc drusen, early detection and appropriate monitoring of patients can help reduce the risk of vision loss.”

This response suggested that early detection and monitoring could alter the natural course of the disease, which was deemed inaccurate. The experts deemed this as being a minor inaccuracy and without any potential for harm. Similarly, 50 responses, or half of all responses were either without any inaccuracies or with only minor inaccuracies.

In 45 responses, one or both experts stated the presence of major inaccuracies without any potential for harm. One example of major inaccuracy is illustrated in the following response to the question “Will I get blind from optic disc drusen?”:

“Optic disc drusen typically do not cause blindness. However, they can cause visual field defects or symptoms such as blurry vision or blind spots that may interfere with daily activities. In rare cases, optic disc drusen can lead to complications such as papilledema or ischemic optic neuropathy, which can lead to vision loss. It is important to discuss any concerns about your eye health with a trusted healthcare professional.”

This response linked papilledema as a rare complication of ODD, which was deemed incorrect. The experts evaluated this to be a major inaccuracy, but without any potential for harm.

Finally, five responses were evaluated as having major inaccuracies by both experts, and one expert stated potential for harm. These responses are all provided with a detailed explanation of the potential for harm in Table 2. Thus, five responses, or 1 in 20, had any potential for harm.

Table 2 Responses deemed as having major inaccuracies with any potential for harm

Accuracy of the responses for each question are summarized in Table 3. The combined scores ranged from a mean of 4.8 ± 0.4 for the question with the highest scores to a mean of 3.1 ± 0.3 for the question with the lowest scores. The highest scores were obtained for the questions “Do I need regular eye examinations for my optic disc drusen?”, “Will my optic disc drusen prevent me from driving a car?”, “Is optic disc drusen preventable?”, “My doctor tells me I have pseudo-papilledema due to optic disc drusen. How can my doctor be sure, that I don't have true papilledema in addition to my optic disc drusen?”, and “How is optic disc drusen diagnosed?”. The lowest scores were obtained for the questions “Is there anything I can do to avoid the optic disc drusen from getting worse?”, “Can optic disc drusen impact my vision?”, “What are the symptoms of optic disc drusen?”, “Can optic disc drusen lead to other diseases?”, and “Is optic disc drusen treatable?”.

Table 3 Accuracy scores for ChatGPT responses for each of the 20 questions

Discussion

In this study, we evaluated the accuracy of ChatGPT responses for typical patient questions regarding ODD. Overall, ChatGPT responded relevant to all questions. The accuracy of the answers was high for a substantial part, and major inaccuracies with potential for harm was only an issue for a small minority of the responses. Nevertheless, it is important for clinicians to understand the nature of such inaccuracies.

In previous studies, we explored the accuracy of ChatGPT responses for questions dealing with common retinal diseases and for vernal keratoconjunctivitis [19, 20]. For retinal diseases, we experienced inaccuracies and the potential for harm to be related to questions dealing with treatment [19]. Similarly, treatment-related inaccuracies were observed for responses related to vernal keratoconjunctivitis [20]. However, in both studies, ChatGPT was able to provide accurate information on disease definition and diagnosis [19, 20]. These patterns, which we also observed in this study, highlight the key issues in obtaining health information from an artificial intelligence-based model that is developed as a large language model [22, 23]. Large language models are models that consist of a neural network that is trained on very large datasets of text [22, 23]. These models allow for accurate response and restructure of text that exists in relatively large quantities, e.g., textbooks, general information on the web, Wikipedia, etc.; but become challenged in providing responses that deal with details related to treatment and monitoring [22, 23]. Treatment and monitoring are often aspects that undergo changes faster than that of disease definition and pathophysiology, and clinical knowledge is often based on current general expert consensus or guidelines. These aspects are difficult to incorporate in an artificial intelligence-based large language model, as one important limitation of these generative artificial intelligence-based systems is that they rely on pre-existing data to learn and then use to synthesize new output.

The literature on health information-seeking behaviors outline a complex field with many influencing factors [11, 23]. Studies also find changes in trends with time [11, 23], which to some extent may be explained by increasing technology adaptation with time. Thus, health information-seeking behavior is likely to be a dynamic phenomenon over time, and chatbot-based information may represent a new topic within this field. Advantages of current artificial intelligence-based large language models include prompt engineering (i.e., providing a setting prior to asking the questions, defining the output-style of the responses, or a dialogue-based conversation), which may allow for a more precise input and a more relevant output, at least from the patients’ perspective.

Strengths and limitations need to be acknowledged when interpreting the results of our study. Questions were developed as a consensus and face validity approach by two content experts. The two content experts being affiliated with the same ophthalmological department constitutes a limitation, as they theoretically would be more likely to be in agreement than two experts not affiliated with the same department. The approach of developing questions based on expert input is feasible but may not always incorporate the questions that the patients may prioritize. Thus, there is also a limitation in that patients were not involved in question development. Another important limitation is that responses of ChatGPT vary. To better understand the variety of responses, we retrieved five responses for each question, which in our experience allows for a certain saturation in the quality of responses. Finally, one source of inaccuracy is that differences between the experts’ interpretation of text may lead to different responses on the Likert scale, which is also seen in our study, as the experts were not in full agreement.

Conclusions

In conclusion, this study explored the accuracy of ChatGPT responses for typical patient questions on ODD and found that ChatGPT overall provides relevant answers. In 1 of 20 responses, we detected inaccuracies with potential for harm. Considering that the neuro-ophthalmologist, or even the ophthalmologist, not being readily available, but ChatGPT being able to respond within seconds, the patients will likely obtain information from ChatGPT or similar services. As clinicians, what we can do is to understand the nature of the information provided to the patients in our effort to guide and advise patients most appropriately.