Abstract
Purpose
The aim of this study was to define the capability of ChatGPT-4 and Google Gemini in analyzing detailed glaucoma case descriptions and suggesting an accurate surgical plan.
Methods
Retrospective analysis of 60 medical records of surgical glaucoma was divided into “ordinary” (n = 40) and “challenging” (n = 20) scenarios. Case descriptions were entered into ChatGPT and Bard’s interfaces with the question “What kind of surgery would you perform?” and repeated three times to analyze the answers’ consistency. After collecting the answers, we assessed the level of agreement with the unified opinion of three glaucoma surgeons. Moreover, we graded the quality of the responses with scores from 1 (poor quality) to 5 (excellent quality), according to the Global Quality Score (GQS) and compared the results.
Results
ChatGPT surgical choice was consistent with those of glaucoma specialists in 35/60 cases (58%), compared to 19/60 (32%) of Gemini (p = 0.0001). Gemini was not able to complete the task in 16 cases (27%). Trabeculectomy was the most frequent choice for both chatbots (53% and 50% for ChatGPT and Gemini, respectively). In “challenging” cases, ChatGPT agreed with specialists in 9/20 choices (45%), outperforming Google Gemini performances (4/20, 20%). Overall, GQS scores were 3.5 ± 1.2 and 2.1 ± 1.5 for ChatGPT and Gemini (p = 0.002). This difference was even more marked if focusing only on “challenging” cases (1.5 ± 1.4 vs. 3.0 ± 1.5, p = 0.001).
Conclusion
ChatGPT-4 showed a good analysis performance for glaucoma surgical cases, either ordinary or challenging. On the other side, Google Gemini showed strong limitations in this setting, presenting high rates of unprecise or missed answers.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs00417-024-06470-5/MediaObjects/417_2024_6470_Figa_HTML.png)
Introduction
Artificial intelligence (AI) natural language processing has witnessed a significant transformation with the advent of advanced large language models (LLMs) [1]. Two of the most important LLM models are Chat Generative Pretrained Transformer (ChatGPT, created by OpenAI, San Francisco, CA, USA) and Google Gemini (formerly Google Bard, Google, Mountain View, CA). Recently, the role of these models in medical settings has been explored [2].
In the ophthalmological panorama, ChatGPT has shown good results in comprehending clinical information and generating appropriate replies [3, 4]. ChatGPT models are trained on a textual database and are able to produce responses that are both coherent and contextually appropriate, based on the abstract relationship between words (tokens) within the neural network [1]. The capabilities of this LLM in answering to multiple-choice questions from the US Medical Licensing Examination (USMLE) were studied, and it was shown that ChatGPT not only answered properly to more than half of the questions, but also supplied good supporting justifications for the chosen options [5]. Additional features and restrictions of ChatGPT in ophthalmology have been discussed in literature [6].
On March 21, 2023, Google debuted its AI chatbot, Google Bard, an AI chatbot, which uses machine learning and natural language processing to simulate human-like dialogue. Bard can be accessible on several digital platforms, providing accurate answers and assistance in fields such as public health and disaster relief [7]. In the recent months, some research compared the performance of Google Bard to those of ChatGPT in other medicine subspecialties, such as neurology, neurosurgery, or emergency [8,9,10]. Recently, on February 8, 2024, Google presented the update to a new AI-based platform, called Gemini, embracing optimized features and enhancing multimodal analysis [11].
In a glaucoma setting, several research showed the efficacy of AI models [12,13,14]. Globally, glaucoma is a prevalent cause of permanent blindness, with a unique modifiable risk factor: intraocular pressure (IOP) [15,16,17]. Current treatment options include pharmacologic topical therapies reducing aqueous humor production or increasing its outflow, lasers targeting the trabecular meshwork and ciliary body, and surgical treatment [18]. The latter includes a variety of techniques, starting from the gold standard trabeculectomy to the newly developed microinvasive glaucoma surgeries (MIGS) and, in more challenging cases, glaucoma draining devices (GDDs). A tailored approach to surgery, including the search for risk factors for surgical failure, is now required to ensure a long-term IOP control. In order to use the optimum approach, certain risk variables, including age, race, conjunctival status, and past medical or surgical procedures, must be addressed specifically for each patient [19].
The aim of this study was to define the capability of two different LLMs (ChatGPT 3.5 and Google Gemini) in analyzing detailed case descriptions in a glaucoma setting and suggesting the best possible surgical choice. In order to evaluate the accuracy of the two chatbots, the answers were compared to glaucoma specialists’ responses and the level of agreement between them was assessed.
Materials and methods
Case collection
We conducted a retrospective review of the files of glaucomatous patients who underwent any type of surgical or para-surgical glaucoma treatment in Policlinico Universitario Agostino Gemelli between May 2019 and October 2023. This study adhered to the tenets of the Declaration of Helsinki, and the Institutional review board (IRB) approval was obtained from the Policlinico Universitario Agostino Gemelli Institutional Ethics Committee.
We selected a total of 66 medical records from a pool of over 250. Patient demographics and medical or ocular anamnesis, actual ocular diseases and topical medication, referred symptoms, and examination results were all described in detail in each instance. Finally, this information was summarized in a complete case description and submitted at the same time to the ChatGPT and Google Gemini interfaces.
Primary open-angle glaucoma, normal-tension glaucoma, primary angle-closure glaucoma, pseudoexfoliation glaucoma, pigment dispersion glaucoma, glaucomatocyclitic crisis glaucoma, aphakic glaucoma, neovascular glaucoma, and uveitic glaucoma were among the phenotypes represented.
Successively, we separated the mentioned instances into two categories one by one:
-
1.
“Ordinary” scenarios (n = 42): patients who had no previous eye surgery other than cataract surgery and no severe concurrent illnesses or ocular disorders. Only topical medication and/or prior laser treatments were used to control IOP.
-
2.
“Challenging” scenarios (n = 24): individuals with a considerably more complicated medical history, including past ocular operations such as unsuccessful glaucoma procedures, vitreoretinal surgeries, anterior segment surgeries, or other ocular diseases influencing the eye’s homeostasis. Furthermore, instances with concurrent systemic illnesses that may have influenced surgical decisions were included in this sample.
ChatGPT
Built on the “gpt-3.5-turbo” model from the GPT-3.5 series, ChatGPT (OpenAI, https://chat.openai.com/) is an optimized LLM. Up to September 2021, billions of text data taken from online articles were used to train Generative Pretrained Transformer 3 [1]. The model is trained to minimize the difference between the predicted and actual words in the training data set. After the model is trained, fresh text may be produced by giving it an instruction and letting it guess what word will come next. The process is then repeated until the model generates a whole sentence or paragraph, using the predicted word as the background for the next prediction [20]. In comparison to its predecessor, the most recent version, GPT-4, was recognized for having more accuracy and efficiency. The main breakthroughs were increased knowledge base, better language proficiency, and better contextual comprehension [21].
Google Gemini AI
Google Bard AI (Google, https://bard.google.com/chat/) is built on the Pathways Language Model 2 (PaLM 2), a massive language model designed to be very effective at understanding facts, logical thinking, and arithmetic [22]. It is able to respond to a broad range of user queries and prompts by simulating human-like interactions, providing also thorough and educational answers to user inputs after being trained on a large corpus of text data [22]. Due to its integration with Google’s vast search network, Bard AI has access to up-to-date online data [23].
Gemini 1.0 is a new AI model built in three various sizes (Ultra, Pro, and Nano); a result of extensive collaborative efforts of the Google Team, was announced by the Google Team on December 6, 2023, marking a major platform upgrade. With plans to soon include multimodal reasoning skills, Gemini Pro (Google, https://gemini.google.com/app/) is the first model that is now available and can respond to text-based cues [24].
Although Gemini and ChatGPT are comparable in several ways, Gemini has a remarkable aptitude for comprehending and responding to inquiries that need for exact details, while, on the other hand, ChatGPT may effectively produce a wide range of imaginative and varied texts, including articles, scripts, code, and more, all of which may or may not be entirely correct [22].
Chatbot-aided surgical choice
We entered every case description into either ChatGPT (version 3.5) or Google Gemini interface. Both platforms were accessed on March 8 and 9, 2024. Firstly, we left the LLM analyze the entire case and highlight key findings. Successively, we investigated about the model’s ability to provide coherent surgical choices, writing the question “What kind of surgery would you perform?” (Figs. 1 and 2). In some instances, when the answers were overly generalist, we asked the model for further explanations regarding the treatment of choice, based on the specific conditions of the clinical case, asking the chatbots to choose only one treatment. We logged all of the comments based on our initial inquiry about provisional and differential surgical choices so that the two chatbots might learn from the past. Finally, in order to assess chatbots’ answers consistency, all questions were repeated three times, and possible changes in surgical suggestions were collected.
Screenshot of Google Gemini responses in the same “challenging” case of Fig. 1. A Case description and Gemini analysis of the case; B when asked for surgical advice, Google Gemini provided more synthetic answers rich of web sources; C when asked to choose only one treatment, Gemini frequently answered “I can’t choose one treatment for this case.” However, it was able to present a list of surgical options, even though none of them was analyzed in details
In addition, three senior glaucoma specialists (G.G., A.B., and S.R.) were asked to analyze the same 66 cases and to define a shared surgical choice. Cohen coefficient analysis was used to ensure that the concordance was at least 0.90. If all three operators agreed, the ideal surgical treatment for that specific case was identified. Otherwise, the experts discussed the case, and if a univocal choice could not be reached, the case was excluded from the study. Since it was impossible to obtain a comprehensive scenario outline, six patients with incomplete medical histories or anamnesis, missing preoperative data or surgical records on the computer system, or receiving surgery straight from the emergency department were excluded from the study. In fact, the lack of complete pre-operative data raised doubts between the specialists and determined inconsistent results derived from chatbots’ answers.
A correct agreement between chatbots and specialists was considered real only if the chosen treatment of both were the same. If the chatbot was not able to pick one treatment among the presented list, but the answer was indeed present in the list, we defined it as “partial agreement.” In the remaining cases, the answer was considered incorrect. At the end of the dialogue with ChatGPT and Gemini, the 7 ophthalmologists included in this research graded the quality of the answers provided by the two chatbots from 1 (poor quality) to 5 (excellent quality), according to the Global Quality Score (GQS) [25].
As the main outcome of this research, we wanted to define the frequency of accurate surgical choice made by ChatGPT and Google Gemini, compared to those of glaucoma specialists facing the same clinical cases. As secondary outcome, we compared the GQS of the two chatbots.
Global Quality Score
The GQS, introduced by Bernard et al., is a subjective rate of the overall quality of each web site [25]. It consists in a 5-point Likert scale, taking into account the flow and ease of use of each web site [25]. The complete scale is visible in Table 1.
Statistical analysis
The statistical analysis was conducted using GraphPad PRISM software, (Version 9.5; GraphPad, La Jolla, CA). Data were presented as mean ± standard deviation. χ2 test, Fisher’s exact test, and Student’s t-test were conducted where appropriate. Correlation was calculated using Spearman’s coefficient. For the GQS, the average score of each answer was collected, and an overall average score was then calculated. In all cases, p < 0.05 was considered statistically significant.
Results
Overall, a total agreement between specialists, in terms of surgical choices, was reached in 60 cases (40 in the “ordinary” and 20 in the “challenging” subgroups. Glaucoma phenotypes of the studied cases are summarized in Table 2.
Overall, when repeated questions were inserted three times in chatbots’ interfaces, ChatGPT gave consistent results in 75% of cases (n = 57). In 18% of cases, the answers were consistent in 2 out of the 3 repetitions, while on the other side, repeating the question to Google Gemini led to a 72% rate of inconsistency. Google’s chatbot often answered “I cannot choose a specific treatment for this case,” “I cannot definitively recommend one specific treatment for glaucoma surgery,” or “As a large language model, I am not a medical professional and cannot perform surgery.” In these cases, the medical scenarios were presented up to 5 times to the chatbots, and the most frequent answer was taken into account for statistical analysis. When answers were completely inconsistent, the scenario analysis was considered incorrect.
Overall, ChatGPT definitive surgical choice was consistent with those of glaucoma specialists in 35/60 cases (58%), while Google Gemini showed concordance in surgical choices in only 19/60 cases (32%). In the remaining 42% of cases, the specialists’ choice was among the list proposed by ChatGPT in 21 cases (35%), while a completely incorrect answer was given in 4 cases. Conversely, Google Gemini was not able to complete the task when asked to perform a surgical choice, in 16 cases (27%). This statement was presented 11 times among “ordinary” cases and 5 times among “challenging” case. Out of the remaining 26 cases, the specialists’ choice was among the list proposed by Gemini in 18 of cases (30%), while 7 answers were considered incorrect (12%).
ChatGPT’s most frequent advice was trabeculectomy (n = 32, 53%), followed by glaucoma draining devices (n = 11, 18%), MIGS (n = 9, 15%), and laser trabeculoplasty (n = 4, 7%). Similarly, out of the 44 completed tasks, Gemini’s favorite answer was trabeculectomy (n = 22, 50%), followed by MIGS and GDD (n = 9 for both, 20%).
Contingency analysis showed significantly better results for ChatGPT when compared to Gemini’s performances (p = 0.0001).
ChatGPT vs. Google Bard in “Ordinary” cases
When considering the 40 aforementioned “ordinary” scenarios, ChatGPT showed 65% of consistency with glaucoma specialists (26/40), while Gemini was consistent in 15/40 cases (38%). The difference in performance between the two was statistically significant (p = 0.004).
A comparison of the answers in some “ordinary” cases is visible in Table 3.
ChatGPT vs. Google Bard in “Challenging” cases
Similar results were reported when considering “challenging” cases. In particular, ChatGPT was consistent with the specialists’ opinion in 9 out of 20 surgical choices, 45%, while Google Gemini performances were significantly lower (4/20 concordant answers, 20%), and in 10 cases (50%), the chatbot was not able to complete the task. Once again, contingency analysis showed significant differences regarding the performances of the two chatbots (p = 0.001).
A comparison of the answers of the two chatbots in “challenging” cases is presented in Table 4.
The percentage of agreements between specialists’ opinion and ChatGPT and Google Bard answers is summarized in Fig. 3.
Histograms showing A the level of agreement between ChatGPT and Google Gemini’s answers and those provided by glaucoma specialists in all cases and in “ordinary” and “challenging” scenarios. Complete agreement was assessed when the final choice of the chatbot was consistent with the one provided by specialists, while partial agreement included cases in which the correct answer was listed but not picked as preferred choice by the chatbot; B the comparison between the Global Quality Scores assigned by ophthalmologists to the two chatbots’ performance and usability (showed as mean and standard deviation). One asterisk (*) stands for statistical difference < 0.05; two asterisks (**) stand for p < 0.01
Global Quality Score results
Overall, ChatGPT’s GQS was 3.5 ± 1.2, being significantly higher than Google Gemini’s score (2.1 ± 1.5, p = 0.002). If only “ordinary” cases were considered, ChatGPT scored 3.8 ± 0.7, while Google Gemini 2.6 ± 0.9, highlighting the significant difference between the two (p = 0.02). Similarly, a significant difference was reported regarding “challenging” conditions, in which the Google Gemini score was significantly lower when compared to ChatGPT (1.5 ± 1.4 vs. 3.0 ± 1.5, p = 0.001) (Fig. 3).
The aforementioned Google Gemini missing answers detrimentally affected the GQS, since the platform appeared significantly less user-friendly for the ophthalmologists seeking for an advice in terms of surgical planning.
Discussion
Artificial intelligence–powered chatbots, such as ChatGPT and Google Bard, have been heralded as revolutionary turning points in the current AI revolution. These chatbots are LLMs that employ machine learning and natural language processing to interact with people via text or voice interfaces. The possible application of LLMs in medicine has garnered a lot of hype in recent months [2, 26]. Healthcare professionals may get evidence-based real-time advices from chatbots to enhance patient outcomes [27]. For complicated medical situations, they may provide clinical recommendations, identify possible medication interactions, and recommend suitable courses of action [28]. Chatbots are able to identify possible issues that human providers would not immediately notice, thanks to their rapid access to vast volumes of data and processing speed. They might also provide the most recent information on recommendations and treatment alternatives, guaranteeing the patients the best possible cares [27, 28].
In this research, we gathered 60 medical records to examine ChatGPT-4 and Google Gemini’s capacity to define a correct surgical plan for glaucomatous patients. In particular, in order to test the two chatbots’ capacity on performing correct differential diagnoses and deriving a coherent surgical planning, we divided our sample into “ordinary” and “challenging” scenarios and compared the given answers, to those given by glaucoma experts, which were asked to analyze the same pool of clinical cases. Furthermore, we exploited the 5-point Global Quality Score as a subjective parameter of chatbots’ quality, in terms of user-friendliness, speed of use, and accuracy and exhaustiveness in responses.
Overall, ChatGPT showed an acceptable rate of agreement with the glaucoma specialists (58%), while being able to provide a coherent list of answers in another 35% of cases, limiting completely incorrect answer to 7%. ChatGPT’s results significantly outperformed Gemini in this setting. Google’s chatbot indeed showed high unreliability, with only 32% rate of agreement with specialists and a 27% rate of non-completed tasks. In particular, Gemini often stated “As a large language model, I am not a medical professional and cannot perform surgery” or “I cannot definitively recommend one specific treatment for glaucoma surgery.” Moreover, while GPT-4 was almost always able to analyze the clinical case in details and develop coherent reasoning in order to offer an unambiguous choice, Gemini’s answers were much more generalist and synthetic, even though the cited literature appeared coherent.
As expectable, in the analysis of “ordinary” clinical cases, the accuracy of the two LLMs was higher when compared to specialists’ opinion, reaching 65% rate of agreement for ChatGPT and 38% for Google Gemini. However, ordinary cases often allow multiple surgical or parasurgical possibilities based on surgeons’ preference. When analyzing only clearly incorrect answers, GPT-4 only missed 8% of cases, while Gemini had a higher rate of errors (13%) and not given answers (15%):
When focusing on “challenging” scenarios, both chatbots’ performance lowered, but ChatGPT was shown to be much more accurate than Gemini (45 vs. 20% in terms of agreements’ rate with specialists). Surely, it is important to acknowledge that complex cases require a multimodal management and, meantime, that different kind of treatments should be evaluated. Nevertheless, even not always being concordant with specialists’ opinions, ChatGPT showed an incredible capability to analyze all parts of the scenario, as visible in the answers in Table 3, being able to propose combined surgeries (e.g., cataract, vitrectomy, artificial iris implantation) in a comprehensive treatment. On the other hand, Gemini analysis in these cases was often scarce and incomplete, thus not being able to define a thorough surgical plan.
These results reflected also on GQS, which were significantly higher for ChatGPT (3.5 ± 1.2) rather than Gemini (2.1 ± 1.5), and this difference augmented when focusing on “challenging” cases. ChatGPT was indeed more user-friendly and able to perform a more in-depth analysis of the clinical cases, giving specific and coherent answers even in the highly specific context of surgical glaucoma. Moreover, when asked to choose only one treatment, it was always able to pick one of the previously listed treatments, different from Gemini’s approach.
The effectiveness of ChatGPT in ophthalmology panorama was already investigated: a resent research showed that ChatGPT scored an accuracy of 59.4% answering questions of the Ophthalmic Knowledge Assessment Program (OKAP) test, while on the OphthoQuestions testing set, it achieved 49.2% [6]. ChatGPT also was much more accurate than Isabel Pro, one of the most popular and accurate diagnostic assistance systems, in terms of diagnostic accuracy in ophthalmology cases [29]. Similarly, the diagnostic capabilities of this LLM were studied by Delsoz et al. on glaucoma patients, reporting a 72.7% accuracy in preliminary diagnosis [30]. Furthermore, compared to ophthalmology residents, ChatGPT constantly showed a higher number of differential diagnoses [30]. Recently, Kianian et al. highlighted ChatGPT’s effectiveness in creating content and rewriting information regarding uveitis, in order to make them easier to digest and aid patients in learning more about this pathology and treat it more adequately [31].
Since its recent introduction, few studies have compared the performances of the former Google Bard and ChatGPT in healthcare settings. Gan et al. compared the performance of the two chatbots in triaging patients in mass casualty incidents, showing that Google Bard was able to make a 60% rate of correct triages, similar to that of medical students. On the other side, ChatGPT had a significantly higher rate of over-triage [9]. Koga et al. analyzed LLMs’ ability to generate differential diagnoses about neurodegenerative disorders, based on clinical summaries. ChatGPT-3.5, ChatGPT-4, and Google Bard included correct diagnoses in 76%, 84%, and 76% of cases, respectively, demonstrating that LLMs can predict pathological diagnoses with reasonable accuracy [8].
The theoretical advantage of using a LLM in a clinical setting relies on more accurate, fast, and impartial answers, indeed available at any time. Moreover, these models are able to actively train, based on reinforcement-learning capabilities, allowing the models to improve over time and rectify prior errors [30]. Although these models require less human monitoring and supervision for active training than supervised learning models, their training data contain non-peer-reviewed sources which could include factual inaccuracies [32]. It is important to say that due to a large overlap between physiological and pathological parameters, glaucoma management is extremely subjective and has low agreement, even among highly competent glaucoma experts [33]. In this setting, AI chatbots may not be able to handle circumstances requiring human judgment, empathy, or specialized knowledge due to their limits, making regular human oversight essential; before making any final judgment and doing the necessary steps, their regular oversight and evaluation are thus essential [34]. In our research, ChatGPT outperformed Google Gemini in terms of surgical choice, giving more specific answers on the majority of clinical cases. These results are consistent with a previous research conducted by our group, in which ChatGPT and Google Gemini were asked to face surgical cases of retinal detachment. In that setting, GPT-4 reached an 84% rate of agreement with vitreoretinal surgeons, while Gemini only reached 70% of agreement [35]. Notably, those results are significantly higher than those reported in a glaucoma setting. We indeed hypothesize that surgical decisions in vitreoretinal surgery are much more limited and follow more precise hinges, when compared to the immensity of glaucoma surgical treatments and the possible overlap between them. However, in both studies, ChatGPT showed better performances in terms of scenario analysis and surgical planning, suggesting that the newly presented Google Gemini still lacks optimization in the medical field.
Furthermore, although ChatGPT has demonstrated encouraging results, its immediate application in clinical care settings may be constrained. The incapacity to interpret diagnostic data, such as eye fundus pictures or visual fields, may impair the ability to conduct comprehensive examinations and furnish glaucoma specialists with accurate diagnoses. Considering how much ophthalmology depends on visual examination and imaging for patient diagnosis and treatment, it appears necessary to include additional transformer models—like the Contrastive Language-Image Pretraining model—that can handle different data sources [36].
As far as we know, this is the first investigation in which AI-chatbots were asked to outline a surgical approach in a glaucoma setting. Moreover, we focused on a very specific subject, such as surgical glaucoma, to analyze the computing performances of ChatGPT and Google Gemini. Indeed, our research is affected by several limitations. First off, since the case description was retrospective in nature, missing data could have influenced chatbots’ answers. Second, we concentrated on a limited sample size, making further studies necessary to assess the large-scale applications of our findings. Additionally, the comparison with glaucoma surgeons’ choices not always defines the best possible treatment, possibly limiting the repeatability of these results.
In conclusion, LLMs have the potential to revolutionize ophthalmology. In the future, particularly with the implementation of new inputs, such as video or images, AI-based chatbots might become reliable mates in clinical and surgical practice. Having already showed their role in ophthalmology teaching, we demonstrated that ChatGPT-4 has the potential to coherently analyze medical records of glaucomatous patients, showing a good level of agreement with knowledgeable glaucoma experts. On the other side, Google Gemini showed strong limitations in this setting, presenting high rates of unprecise or missed answers, thus still requiring significant updates before effective application in the clinic.
Data availability
The data that support the findings of this study are available from the corresponding author, MMC, upon reasonable request.
References
Ozdemir S (2023) Quick start guide to large language models: strategies and best practices for using ChatGPT and other LLMs. Addison-Wesley Professional
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29:1930–1940
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Scharli N, Chowdhery A, Mansfield P, Demner-Fushman D, Aguera YAB, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V (2023) Large language models encode clinical knowledge. Nature 620:172–180. https://doi.org/10.1038/s41586-023-06291-2
Nath S, Marie A, Ellershaw S, Korot E, Keane PA (2022) New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology. Br J Ophthalmol 106:889–892. https://doi.org/10.1136/bjophthalmol-2022-321141
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepano C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V (2023) Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2:e0000198. https://doi.org/10.1371/journal.pdig.0000198
Antaki F, Touma S, Milad D, El-Khoury J, Duval R (2023) Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci 3:100324. https://doi.org/10.1016/j.xops.2023.100324
Siad S (2023) The promise and perils of Google’s Bard for scientific research. AI 1:1–5
Koga S, Martin NB, Dickson DW (2023) Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders. Brain Pathol 8:e13207
Gan RK, Ogbodo JC, Wee YZ, Gan AZ, González PA (2024) Performance of Google Bard and ChatGPT in mass casualty incidents triage. Am J Emerg Med 75:72–78
Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Sullivan PLZ, Cielo D, Oyelese AA, Doberstein CE, Telfeian AE (2022) Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery: https://doi.org/10.1227/neu.0000000000002551
Team G (2024) Bard becomes Gemini: try Ultra 1.0 and a new mobile app today. Google, Inc. https://blog.google/products/gemini/bard-gemini-advanced-app/
Yousefi S, Pasquale LR, Boland MV, Johnson CA (2022) Machine-identified patterns of visual field loss and an association with rapid progression in the ocular hypertension treatment study. Ophthalmology 129:1402–1411. https://doi.org/10.1016/j.ophtha.2022.07.001
Medeiros FA, Jammal AA, Thompson AC (2019) From machine to machine: an OCT-trained deep learning algorithm for objective quantification of glaucomatous damage in fundus photographs. Ophthalmology 126:513–521. https://doi.org/10.1016/j.ophtha.2018.12.033
Yousefi S (2023) Clinical applications of artificial intelligence in glaucoma. J Ophthalmic Vis Res 18:97–112. https://doi.org/10.18502/jovr.v18i1.12730
European Glaucoma Prevention Study G, Miglior S, Pfeiffer N, Torri V, Zeyen T, Cunha-Vaz J, Adamsons I (2007) Predictive factors for open-angle glaucoma among patients with ocular hypertension in the European Glaucoma Prevention Study. Ophthalmology 114:3–9. https://doi.org/10.1016/j.ophtha.2006.05.075
Le A, Mukesh BN, McCarty CA, Taylor HR (2003) Risk factors associated with the incidence of open-angle glaucoma: the visual impairment project. Invest Ophthalmol Vis Sci 44:3783–3789. https://doi.org/10.1167/iovs.03-0077
Jonas JB, Aung T, Bourne RR, Bron AM, Ritch R, Panda-Jonas S (2017) Glaucoma Lancet 390:2183–2193. https://doi.org/10.1016/S0140-6736(17)31469-1
Bovee CE, Pasquale LR (2017) Evolving surgical interventions in the treatment of glaucoma. Semin Ophthalmol 32:91–95. https://doi.org/10.1080/08820538.2016.1228393
Sunaric Megevand G, Bron AM (2021) Personalising surgical treatments for glaucoma patients. Prog Retin Eye Res 81:100879. https://doi.org/10.1016/j.preteyeres.2020.100879
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
OpenAI (2023) GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. OpenAI San Francisco, CA, USA. https://chat.openai.com/
Singh SK, Kumar S, Mehra PS (2023) Chat GPT & Google Bard AI: a review 2023 International Conference on IoT, Communication and Automation Technology (ICICAT). IEEE 1:1–6
Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, Cheng H-T, Jin A, Bos T, Baker L, Du Y (2022) Lamda: language models for dialog applications. arXiv:2201.08239
Pichai S, Hassabis D (2023) Introducing Gemini: our largest and most capable AI model. Google Retrieved December 8 2023. https://blog.google/intl/en-africa/company-news/technology/introducing-gemini-our-largest-and-most-capable-ai-model/
Bernard A, Langille M, Hughes S, Rose C, Leddin D, Van Zanten SV (2007) A systematic review of patient inflammatory bowel disease information resources on the World Wide Web. Am J Gastroenterol 102:2070–2077
Dave T, Athaluri SA, Singh S (2023) ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 6:1169595. https://doi.org/10.3389/frai.2023.1169595
Pryss R, Kraft R, Baumeister H, Winkler J, Probst T, Reichert M, Langguth B, Spiliopoulou M, Schlee W (2019) Using Chatbots to support medical and psychological treatment procedures: challenges, opportunities, technologies, reference architecture. Digital Phenotyping and Mobile Sensing: New Developments in Psychoinformatics 1:249–260
Zagabathuni Y (2022) Applications, scope, and challenges for AI in healthcare. Int J 10:195–199
Ren LY (2019) Product: Isabel Pro–the DDX generator. The Journal of the Canadian Health Libraries Association= Journal de l'Association des Bibliothèques de la Santé du Canada 40: 63–69
Delsoz M, Raja H, Madadi Y, Tang AA, Wirostko BM, Kahook MY, Yousefi S (2023) The use of ChatGPT to assist in diagnosing glaucoma based on clinical case reports. Ophthalmol Ther 12:3121–3132. https://doi.org/10.1007/s40123-023-00805-x
Kianian R, Sun D, Crowell EL, Tsui E (2023) The use of large language models to generate education materials about uveitis. Ophthalmol Retina 8(2):195–201. https://doi.org/10.1016/j.oret.2023.09.008
Alser M, Waisberg E (2023) Concerns with the usage of ChatGPT in academia and medicine: a viewpoint. Am J Med Open 9(100036):1–2
Marks J, Harding A, Harper R, Williams E, Haque S, Spencer A, Fenerty C (2012) Agreement between specially trained and accredited optometrists and glaucoma specialist consultant ophthalmologists in their management of glaucoma patients. Eye 26:853–861
Fisher S, Rosella LC (2022) Priorities for successful use of artificial intelligence by public health organizations: a literature review. BMC Public Health 22:2146
Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E, D’Onofrio NC, Rizzo S (2024) Exploring AI-chatbots’ capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. British Journal of Ophthalmology: bjo-2023–325143 https://doi.org/10.1136/bjo-2023-325143
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. International conference on machine learning. arXiv:2103.00020
Acknowledgements
None.
Funding
Open access funding provided by Università Cattolica del Sacro Cuore within the CRUI-CARE Agreement. No funding was received for this research.
Author information
Authors and Affiliations
Contributions
Conceptualization, M.M.C.; methodology, M.M.C. and G.G.; validation, A.S.; formal analysis, E.C.; investigation, M.M.C.; writing—original draft preparation, M.M.C; writing—review and editing, F.G. and F.B.; project administration, S.R. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
Not applicable.
Consent for publication
No sensitive data were published.
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Carlà, M.M., Gambini, G., Baldascino, A. et al. Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison. Graefes Arch Clin Exp Ophthalmol (2024). https://doi.org/10.1007/s00417-024-06470-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00417-024-06470-5