Abstract
Introduction
Optic disc drusen (ODD) are acellular deposits in the optic nerve head, which are most often benign and asymptomatic. Patients may develop visual field defects and be at increased risk of ischemic co-morbidities. As ODD can be difficult to distinguish from papilledema, patients are at risk of unnecessary clinical workups. Patient information is a key aspect of ODD management. In this study, we explored the accuracy of ChatGPT responses for typical patient questions on ODD.
Methods
Two content experts reached consensus on 20 typical patient questions. We retrieved five separate responses for each question from ChatGPT, totaling 100 responses. Each response was evaluated on a 5-point Likert-scale on accuracy by each content expert in an individual fashion.
Results
The two experts were in fair/substantial agreement in the evaluation of responses (Cronbach’s alpha: 0.64). Of the 100 responses, 17 were relevant and without any inaccuracies, 78 were relevant and with inaccuracies not being harmful, and five were relevant and with inaccuracies potentially harmful. The lowest accuracy scores were obtained for questions dealing with treatment and prognosis.
Conclusions
ChatGPT often provides relevant answers for patient questions on ODD, but inaccuracies become potentially harmful when questions deal with treatment and prognosis.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Why carry out this study? |
Optic disc drusen are acellular deposits in the optic nerve head, which affect approximately 2% of the population. |
Currently, no evidence-based treatments are available to prevent or ameliorate optic disc drusen-related vision loss. The important role of the ophthalmologist here is to provide information on various aspects of this condition. |
We explored the accuracy of ChatGPT responses for typical patient questions on optic disc drusen. |
What was learned from the study? |
ChatGPT often provides relevant answers for typical patient questions on optic disc drusen. On topics related to therapy, ChatGPT provides potentially harmful advice. |
It is important for clinicians to understand the quality and accuracy of the information to which patients are exposed to better understand patient requests and patient behavior. |
Introduction
Optic disc drusen (ODD) are acellular deposits in the optic nerve head, which are often benign and asymptomatic, and affect approximately 2% of humans [1,2,3]. Some patients develop visual field defects, and ODD increase the risk of vascular co-morbidities [4,5,6,7,8,9]. ODD are not necessarily visible on fundoscopy, and in these cases ODD can mimic a potentially life-threatening papilledema, leading to an unnecessary range of clinical workups. The diagnosis of ODD is best established using enhanced depth imaging optical coherence tomography [10].
There are currently no evidence-based treatments available to prevent or ameliorate ODD-related visual loss. Therefore, when the patients receive a neuroophthalmological examination and obtain a diagnosis of ODD, one of the most important roles for the clinician is to inform the patient on the various aspects of the disease and answer the patient’s questions.
Seeking information on health and disease on the Internet is a natural part of coping and living with a disease [11]. Interestingly, one Danish study reported that when information on the Internet was not in alignment with the information from the physician, patients reported dissatisfaction with consultations [12]. This means that as a physician, one needs to understand what information the patients are exposed to in order to understand the context from where the patient’s questions arise. Unfortunately, information on the Internet is rarely developed in collaboration with clinical experts, which put patients at risk of not only inaccurate information but also potentially dangerous information [13, 14].
ChatGPT (OpenAI, San Francisco, CA, USA), which stands for Chat Generative Pre-trained Transformer, was launched in December 2022 and is an artificial intelligence-based chatbot [15]. User statistics revealed a record fast adaptation rate across the world, and it is currently being used for a variety of applications [15]. In ODD, considering that no treatments are currently available, and access to a neuro-ophthalmologist is rarely readily available, one potentially very useful application would be if it could provide reasonable answers to common patient questions. Considering the current and increasing adaptation rate of ChatGPT, it becomes increasingly important to understand the accuracy of the answers provided by ChatGPT.
The objective of this study was to evaluate the accuracy of ChatGPT responses to common patient questions regarding ODD. This was made by reaching consensus on the most common questions from two ODD experts, by obtaining ChatGPT responses to these questions, and finally by evaluating the accuracy of these responses.
Methods
This study was designed as a study of publicly accessible software (ChatGPT) and we neither obtained nor analyzed clinical records or samples from patients or other study subjects. According to Danish law, such studies do not require institutional review board approval. All aspects of this study adhered to the tenets of the Declaration of Helsinki. According to ChatGPT Terms of Use, studies that do not deal with security flaws or inappropriate content can be conducted without a priori or a posteriori approval [16].
Question Development
The author group, including two authors (L.M. and S.H.) who are Optic Disc Drusen Studies consortium members (both with years of experience in ODD research and management) [17], developed the questions. All questions were sent to all authors and discussed until consensus to obtain face validity. We neither had a minimum nor maximum number of questions for this study, but discussed potential questions in the author group until consensus could be reached regarding relevance and importance of the questions. The following 20 questions were included for the analyses:
-
1.
What is optic disc drusen?
-
2.
How is optic disc drusen diagnosed?
-
3.
What is the best way for diagnosing optic disc drusen?
-
4.
Why do I have optic disc drusen?
-
5.
Are my optic disc drusen inherited from my parents?
-
6.
How are optic disc drusen formed?
-
7.
Do optic disc drusen enlarge when I get older?
-
8.
Is optic disc drusen preventable?
-
9.
Is optic disc drusen treatable?
-
10.
Can optic disc drusen impact my vision?
-
11.
Will my optic disc drusen prevent me from driving a car?
-
12.
Will I get blind from optic disc drusen?
-
13.
Is there anything I should avoid when I have optic disc drusen?
-
14.
Is there anything I can do to avoid the optic disc drusen from getting worse?
-
15.
My doctor tells me I have pseudo-papilledema due to optic disc drusen. How can my doctor be sure that I don't have true papilledema in addition to my optic disc drusen?
-
16.
Do I need regular eye examinations for my optic disc drusen?
-
17.
I have a relative with optic disc drusen. Do I need to get an eye exam?
-
18.
Should I be worried that my children will inherit optic disc drusen?
-
19.
What are the symptoms of optic disc drusen?
-
20.
Can optic disc drusen lead to other diseases?
ChatGPT Responses, Evaluation, and Data Analysis
We used ChatGPT version 4, which was published in March 2023 [18]. We accessed the Application Programming Interface using a custom Python (v. 3.8.13) script to ask the 20 questions, separately and without the answers to certain questions being able to influence the answers to other questions. This was ensured by starting a new conversation every time a new question was asked. Each question was asked five times separately to allow evaluation of the spectrum of different answers for the same question. In total, 100 answers were obtained for evaluation. The accuracy of obtained answers was evaluated on a 5-point Likert scale by two Optic Disc Drusen Studies consortium members (L.M. and S.H.). We defined individual points of the Likert scale for a consistent evaluation:
-
1.
Irrelevant response/no response
-
2.
Relevant response with major inaccuracies and potential for harm
-
3.
Relevant response with major inaccuracies and no potential for harm
-
4.
Relevant response with minor inaccuracies and no potential for harm
-
5.
Relevant response without any inaccuracies
The two experts provided the evaluations in an independent fashion. We used Cronbach’s alpha to evaluate the inter-rater agreement between the experts. Responses were summarized using descriptive statistics to understand the distribution of responses. The scores for the five responses for each question from each expert was summarized using median and range. Statistics were made using SPSS version 28.0.1.0 (IBM, Armonk, NY, USA).
Results
All ChatGPT responses are summarized in the Supplementary Materials. The two experts exhibited a fair/substantial agreement in the evaluation of responses (Cronbach’s alpha: 0.64; 95% confidence interval 0.46–0.76). A comparison of the distribution of scores between the two experts is provided in Table 1.
We experienced no absent responses or responses that were completely irrelevant. All responses were relevant but had varying degrees of inaccuracies.
Seventeen responses were rated as being without any inaccuracies by both experts. Below is one example of such a response to the question “My doctor tells me I have pseudo-papilledema due to optic disc drusen. How can my doctor be sure that I don't have true papilledema in addition to my optic disc drusen?”:
“To determine if someone with optic disc drusen also has true papilledema, an ophthalmologist would look for signs such as increased intracranial pressure or other symptoms of a neurological condition. Imaging tests such as an MRI or CT may be done to rule out any underlying brain issues that could be causing the papilledema. Additionally, they may perform regular follow-up visits and evaluations to monitor any changes in your symptoms and condition over time. It is important to consult with your doctor and get proper diagnosis and treatment for your condition.”
In 33 other responses, one or both experts stated the presence of minor inaccuracies without potential for harm. One example of such minor inaccuracy is illustrated in the following response to the question “How are optic disc drusen formed?”:
“Optic disc drusen are formed due to the aggregation of calcium and protein deposits in the optic nerve head (ONH). The exact cause behind this process is not known, but it is believed that the accumulation of debris may be related to genetic factors or problems with the normal processes of protein metabolism and drainage within the eye. As these deposits build up over time, they can result in optic nerve abnormalities and potential vision problems. Although there is currently no cure for optic disc drusen, early detection and appropriate monitoring of patients can help reduce the risk of vision loss.”
This response suggested that early detection and monitoring could alter the natural course of the disease, which was deemed inaccurate. The experts deemed this as being a minor inaccuracy and without any potential for harm. Similarly, 50 responses, or half of all responses were either without any inaccuracies or with only minor inaccuracies.
In 45 responses, one or both experts stated the presence of major inaccuracies without any potential for harm. One example of major inaccuracy is illustrated in the following response to the question “Will I get blind from optic disc drusen?”:
“Optic disc drusen typically do not cause blindness. However, they can cause visual field defects or symptoms such as blurry vision or blind spots that may interfere with daily activities. In rare cases, optic disc drusen can lead to complications such as papilledema or ischemic optic neuropathy, which can lead to vision loss. It is important to discuss any concerns about your eye health with a trusted healthcare professional.”
This response linked papilledema as a rare complication of ODD, which was deemed incorrect. The experts evaluated this to be a major inaccuracy, but without any potential for harm.
Finally, five responses were evaluated as having major inaccuracies by both experts, and one expert stated potential for harm. These responses are all provided with a detailed explanation of the potential for harm in Table 2. Thus, five responses, or 1 in 20, had any potential for harm.
Accuracy of the responses for each question are summarized in Table 3. The combined scores ranged from a mean of 4.8 ± 0.4 for the question with the highest scores to a mean of 3.1 ± 0.3 for the question with the lowest scores. The highest scores were obtained for the questions “Do I need regular eye examinations for my optic disc drusen?”, “Will my optic disc drusen prevent me from driving a car?”, “Is optic disc drusen preventable?”, “My doctor tells me I have pseudo-papilledema due to optic disc drusen. How can my doctor be sure, that I don't have true papilledema in addition to my optic disc drusen?”, and “How is optic disc drusen diagnosed?”. The lowest scores were obtained for the questions “Is there anything I can do to avoid the optic disc drusen from getting worse?”, “Can optic disc drusen impact my vision?”, “What are the symptoms of optic disc drusen?”, “Can optic disc drusen lead to other diseases?”, and “Is optic disc drusen treatable?”.
Discussion
In this study, we evaluated the accuracy of ChatGPT responses for typical patient questions regarding ODD. Overall, ChatGPT responded relevant to all questions. The accuracy of the answers was high for a substantial part, and major inaccuracies with potential for harm was only an issue for a small minority of the responses. Nevertheless, it is important for clinicians to understand the nature of such inaccuracies.
In previous studies, we explored the accuracy of ChatGPT responses for questions dealing with common retinal diseases and for vernal keratoconjunctivitis [19, 20]. For retinal diseases, we experienced inaccuracies and the potential for harm to be related to questions dealing with treatment [19]. Similarly, treatment-related inaccuracies were observed for responses related to vernal keratoconjunctivitis [20]. However, in both studies, ChatGPT was able to provide accurate information on disease definition and diagnosis [19, 20]. These patterns, which we also observed in this study, highlight the key issues in obtaining health information from an artificial intelligence-based model that is developed as a large language model [22, 23]. Large language models are models that consist of a neural network that is trained on very large datasets of text [22, 23]. These models allow for accurate response and restructure of text that exists in relatively large quantities, e.g., textbooks, general information on the web, Wikipedia, etc.; but become challenged in providing responses that deal with details related to treatment and monitoring [22, 23]. Treatment and monitoring are often aspects that undergo changes faster than that of disease definition and pathophysiology, and clinical knowledge is often based on current general expert consensus or guidelines. These aspects are difficult to incorporate in an artificial intelligence-based large language model, as one important limitation of these generative artificial intelligence-based systems is that they rely on pre-existing data to learn and then use to synthesize new output.
The literature on health information-seeking behaviors outline a complex field with many influencing factors [11, 23]. Studies also find changes in trends with time [11, 23], which to some extent may be explained by increasing technology adaptation with time. Thus, health information-seeking behavior is likely to be a dynamic phenomenon over time, and chatbot-based information may represent a new topic within this field. Advantages of current artificial intelligence-based large language models include prompt engineering (i.e., providing a setting prior to asking the questions, defining the output-style of the responses, or a dialogue-based conversation), which may allow for a more precise input and a more relevant output, at least from the patients’ perspective.
Strengths and limitations need to be acknowledged when interpreting the results of our study. Questions were developed as a consensus and face validity approach by two content experts. The two content experts being affiliated with the same ophthalmological department constitutes a limitation, as they theoretically would be more likely to be in agreement than two experts not affiliated with the same department. The approach of developing questions based on expert input is feasible but may not always incorporate the questions that the patients may prioritize. Thus, there is also a limitation in that patients were not involved in question development. Another important limitation is that responses of ChatGPT vary. To better understand the variety of responses, we retrieved five responses for each question, which in our experience allows for a certain saturation in the quality of responses. Finally, one source of inaccuracy is that differences between the experts’ interpretation of text may lead to different responses on the Likert scale, which is also seen in our study, as the experts were not in full agreement.
Conclusions
In conclusion, this study explored the accuracy of ChatGPT responses for typical patient questions on ODD and found that ChatGPT overall provides relevant answers. In 1 of 20 responses, we detected inaccuracies with potential for harm. Considering that the neuro-ophthalmologist, or even the ophthalmologist, not being readily available, but ChatGPT being able to respond within seconds, the patients will likely obtain information from ChatGPT or similar services. As clinicians, what we can do is to understand the nature of the information provided to the patients in our effort to guide and advise patients most appropriately.
Data Availability
All data generated or analyzed during this study are included as supplementary information files.
References
Hamann S, Malmqvist L, Costello F. Optic disc drusen: understanding an old problem from a new perspective. Acta Ophthalmol. 2018;96:673–84.
Skougaard M, Heegaard S, Malmqvist L, Hamann S. Prevalence and histopathological signatures of optic disc drusen based on microscopy of 1713 enucleated eyes. Acta Ophthalmol. 2020;98:195–200.
Mukriyani H, Malmqvist L, Subhi Y, Hamann S. Prevalence of optic disc drusen: A systematic review, meta-analysis and forecasting study. Acta Ophthalmol. 2023; ePub ahead of print. https://doi.org/10.1111/aos.15690.
Fraser JA, Rueløkke LL, Malmqvist L, Hamann S. Prevalence of optic disc drusen in young patients with nonarteritic anterior ischemic optic neuropathy: a 10-year retrospective study. J Neuroophthalmol. 2021;41:200–5.
Hamann S, Malmqvist L, Wegener M, et al. Young adults with anterior ischemic optic neuropathy: a multicenter optic disc drusen study. Am J Ophthalmol. 2020;217:174–81.
Purvin V, King R, Kawasaki A, Yee R. Anterior ischemic optic neuropathy in eyes with optic disc drusen. Arch Ophthalmol. 2004;122:48–53.
Rueløkke LL, Malmqvist L, Wegener M, Hamann S. Optic disc drusen associated anterior ischemic optic neuropathy: prevalence of comorbidities and vascular risk factors. J Neuroophthalmol. 2020;40:356–61.
Newman NJ, Lessell S, Brandt EM. Bilateral central retinal artery occlusions, disk drusen, and migraine. Am J Ophthalmol. 1989;107:236–40.
Rothenbuehler SP, Maloca PM, Belmouhand M, Hamann S, Larsen M. Branch retinal vein occlusion precipitated by compression between a major retinal artery and underlying optic disc drusen. Acta Ophthalmol. 2021;99:931–3.
Youn S, Loshusan B, Armstrong JJ, Fraser JA, Hamann S, Bursztyn LLCD. A comparison of diagnostic accuracy of imaging modalities to detect optic disc drusen: the age of enhanced depth imaging optical coherence tomography. Am J Ophthalmol. 2023;248:137–44.
Mirzaei A, Aslani P, Luca EJ, Schneider CR. Predictors of health information-seeking behavior: systematic literature review and network analysis. J Med Internet Res. 2021;23: e21680.
Riiskjær E, Ammentorp J, Nielsen JF, Kofoed PE. Hospitals need to customise care according to patients’ differing information-seeking behaviour. Dan Med J. 2014;61:A4787.
Subhi Y, Bube SH, Rolskov Bojsen S, Skou Thomsen AS, Konge L. Expert involvement and adherence to medical evidence in medical mobile phone apps: a systematic review. JMIR Mhealth Uhealth. 2015;3: e79.
Borgersen NJ, Henriksen MJ, Konge L, Sørensen TL, Thomsen AS, Subhi Y. Direct ophthalmoscopy on YouTube: analysis of instructional YouTube videos’ content and approach to visualization. Clin Ophthalmol. 2016;10:1535–41.
Reuters: ChatGPT sets record for fastest-growing user base - analyst note. Available online: https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ (accessed on 8 June 2023).
OpenAI: Terms of Use. Available online: https://openai.com/policies/terms-of-use (accessed on 8 June 2023).
Malmqvist L, Bursztyn L, Costello F, et al. The optic disc drusen studies consortium recommendations for diagnosis of optic disc drusen using optical coherence tomography. J Neuroophthalmol. 2018;38:299–307.
OpenAI: ChatGPT Release Notes. Available online: https://help.openai.com/en/articles/6825453-chatgpt-release-notes (accessed on 8 June 2023).
Potapenko I, Boberg-Ans LC, Stormly Hansen M, Klefter ON, van Dijk EHC, Subhi Y. Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT Acta Ophthalmol. 2023. https://doi.org/10.1111/aos.15661.
Rasmussen MLR, Larsen AC, Subhi Y, Potapenko I. Artificial intelligence-based ChatGPT chatbot responses for patient and parent questions on vernal keratoconjunctivitis. Graefes Arch Clin Exp Ophthalmol. 2023. https://doi.org/10.1007/s00417-023-06078-1.
Binz M, Schulz E. Using cognitive psychology to understand GPT-3. Proc Natl Acad Sci U S A. 2023;120: e2218523120.
Duong D, Solomon BD. Analysis of large-language model versus human performance for genetics questions. Eur J Hum Genet. 2023. https://doi.org/10.1038/s41431-023-01396-8.
Li F, Li M, Guan P, Ma S, Cui L. Mapping publication trends and identifying hot spots of research on Internet health information seeking behavior: a quantitative and co-word biclustering analysis. J Med Internet Res. 2015;17: e81.
Medical Writing and/or Editorial Assistance
Under the direction of the authors, medical writing support for this manuscript was provided by Bobby Thompson, MSc (Res), of Oxford PharmaGenesis, Oxford, UK, and was funded by AstraZeneca.
Funding
No funding or sponsorship was received for this study or publication of this article.
Author information
Authors and Affiliations
Contributions
Conceptualization, Ivan Potapenko, Lasse Malmqvist, Yousif Subhi, and Steffen Hamann; Methodology, Ivan Potapenko and Yousif Subhi; Formal analysis and investigation, Ivan Potapenko, Lasse Malmqvist, Yousif Subhi, and Steffen Hamann; Writing—original draft preparation, Ivan Potapenko, Lasse Malmqvist, Yousif Subhi, and Steffen Hamann; Writing—review and editing, Ivan Potapenko, Lasse Malmqvist, Yousif Subhi, and Steffen Hamann; Supervision, Steffen Hamann. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of Interest
Yousif Subhi has received speakers fees from Bayer and Roche, not related to this work. Ivan Potapenko, Lasse Malmqvist, and Steffen Hamann have nothing to disclose.
Ethical Approval
This study was designed as a study of publicly accessible software (ChatGPT) and we neither obtained nor analyzed clinical records or samples from patients or other study subjects. According to Danish law, such studies do not require institutional review board approval. All aspects of this study adhered to the tenets of the Declaration of Helsinki.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which permits any non-commercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc/4.0/.
About this article
Cite this article
Potapenko, I., Malmqvist, L., Subhi, Y. et al. Artificial Intelligence-Based ChatGPT Responses for Patient Questions on Optic Disc Drusen. Ophthalmol Ther 12, 3109–3119 (2023). https://doi.org/10.1007/s40123-023-00800-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40123-023-00800-2