ChatGPT fails challenging the recent ESCMID brain abscess guideline

Background With artificial intelligence (AI) on the rise, it remains unclear if AI is able to professionally evaluate medical research and give scientifically valid recommendations. Aim This study aimed to assess the accuracy of ChatGPT’s responses to ten key questions on brain abscess diagnostics and treatment in comparison to the guideline recently published by the European Society for Clinical Microbiology and Infectious Diseases (ESCMID). Methods All ten PECO (Population, Exposure, Comparator, Outcome) questions which had been developed during the guideline process were presented directly to ChatGPT. Next, ChatGPT was additionally fed with data from studies selected for each PECO question by the ESCMID committee. AI’s responses were subsequently compared with the recommendations of the ESCMID guideline. Results For 17 out of 20 challenges, ChatGPT was able to give recommendations on the management of patients with brain abscess, including grade of evidence and strength of recommendation. Without data prompting, 70% of questions were answered very similar to the guideline recommendation. In the answers that differed from the guideline recommendations, no patient hazard was present. Data input slightly improved the clarity of ChatGPT’s recommendations, but, however, led to less correct answers including two recommendations that directly contradicted the guideline, being associated with the possibility of a hazard to the patient. Conclusion ChatGPT seems to be able to rapidly gather information on brain abscesses and give recommendations on key questions about their management in most cases. Nevertheless, single responses could possibly harm the patients. Thus, the expertise of an expert committee remains inevitable.


Introduction
Brain abscesses represent a critical and potentially lifethreatening central nervous system (CNS) infection [1].They pose significant diagnostic and therapeutic challenges, often requiring urgent medical intervention to prevent severe neurological complications or even death [2].Historically, the management of brain abscesses has largely been guided by clinical experience and only limited studies; furthermore, an international guideline has been non-existing until recently [1].Recognizing this gap, the European Society of Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Infections of the Brain (ESGIB) took the initiative to develop a structured clinical guideline, addressing this need for a standardized approach [3].Central to this guideline's creation was the formulation and assessment of ten PECO (Population, Exposure, Comparator, Outcome) questions, intended to cover the most pertinent and debated aspects of brain abscess management [3].
At the same time, the rise of artificial intelligence (AI) has heralded a transformative era in various scientific domains, including medicine [4].The ability of AI to rapidly assimilate, process, and interpret vast data sets offers a tantalizing prospect: Are time-consuming processes to create guidelines still necessary or could AI models, trained on a wealth of medical literature, rival clinical experts in answering complex clinical questions?With AI poised to become a fundamental part in clinical research and decisionmaking, this study sought to evaluate its potential by pitting ChatGPT, a state-of-the-art AI language model, against the newly minted ESCMID guideline on management of brain abscesses.Specifically, we were interested in discerning whether ChatGPT could competently answer the same ten PECO questions that were central to the guideline's formation, thereby providing insights into AI's capability to support evidence-based clinical decision-making.While other studies have already tried to pit ChatGPT against medical guidelines, our study is the first not only in the neurological field but also to directly compare the recommendations of the AI program with medical experts by feeding the same scientific literature into the algorithm that was used for the guideline development.

Methods
The primary aim of this study was to assess the concordance between ChatGPT's recommendations on brain abscess diagnostics and treatment, derived from two different approaches, with the recommendations of the ESCMID guideline.
The study utilized ten key questions that were initially developed and appraised by the ESCMID committee for their brain abscess guideline including areas of diagnostic strategies and therapeutic modalities pertinent to brain abscesses.
The first approach involved direct querying of ChatGPT.For each key question, a new chat was used.Each of the ten questions was posed directly to ChatGPT (version 4.0) without any additional context or information.To achieve greater comparability between responses, ChatGPT was then prompted to answer the key question in two sentences.Chat-GPT's responses were documented verbatim for subsequent comparison.Next, ChatGPT was questioned on the grade of evidence and the strength of its recommendation, each in one sentence.
The second approach represented informed Querying of ChatGPT: before posing the same ten questions to ChatGPT (version 4.0), the AI was primed with data extracted from the studies that the ESCMID committee used in formulating their guideline (literature was identified through a structured literature review process [3]).This priming involved presenting the text from these studies to the AI model.Once primed, the same questions were asked, and responses were again documented verbatim.
The responses obtained from both the direct and informed approaches were independently compared against the recommendations from the ESCMID guideline.This comparison was carried out by three independent reviewers with expertise in infectious CNS diseases (MB, JB, MK) who-as ESGIB members-also played a leading role in the development of the ESCMID guideline.
For a comprehensive evaluation of AI's recommendations on questions about brain abscess, three scores were obtained (Table 1): (i) The first criterion reflected the clarity of the AI model's recommendation: (a) if a concrete recommendation was provided, (b) if a recommendation was given, but incomplete, and (c) if no clear recommendation was provided.(ii) Next, we employed an Alignment score, that was adapted from Cakir et al. [5] to match our study design: 1 point: completely correct match with the ESCMID guideline, 2 points: correct, but inadequate (some overlap, but lacking the complete depth of the ESCMID guideline), 3 points: a mix of correct and misleading information (significant divergence from guideline with minor overlap) and 4 points: completely incorrect (direct contradiction to the ESCMID guideline).A mean score ≤ 2.0 was rated as correct, while a mean score > 3.3 was evaluated as completely incorrect.Scores > 2.0 and ≤ 3.3 indicated mixed answers with correct and incorrect parts.(iii) The last assessment concerned the risk of patient harm due to ChatGPT's recommendation: (a) recommendation presents no patient hazard, (b) a patient hazard cannot be ruled out, (c) high risk of patient harm.
Scores attributed to the recommendations by the three reviewers were averaged for each response, providing a consensus score for each of the two approaches per question.The scores for response clarity, alignment and patient risk were analyzed for each approach thus indicating the quality of Chat-GPT's recommendations.Fleiss kappa values for interrater reliability were calculated using SPSS (version 29).

ChatGPT provided mostly clear responses to key questions on brain abscess
The clarity of ChatGPT's recommendations were valued between a (yes, concrete) and b (yes, but incomplete) in 80% of responses (Table 2).When answering key questions #2 on withholding antimicrobials until neurosurgery and #10 on primary-prophylactic antiepileptic treatment, the AI's responses were deemed not clear enough to guide physicians in respect to the question asked.In the survey with data prompting, clarity of answers was overall slightly better (90%), including answers for key questions #2 and #10.

Without data prompting, ChatGPT gave more correct recommendations than with data input
Regarding the alignment of ChatGPT's responses with the ESCMID guideline, a score from 1 (completely correct) to 4 points (completely incorrect) was raised.Overall, the mean score without data input (2.1 points) was significantly better than with data input (2.6 points).Without data prompting, the AI answered 70% of the key questions correctly (score ≤ 2.0).ChatGPT gave recommendations on withholding of antimicrobials (#2), consolidation therapy (#8) and primary prophylaxis with antiepileptics (#10) not aligning with the ESCMID guideline.No recommendation by Chat-GPT directly contradicted the ESCMID guideline.In the second survey after data entry, only 40% of key questions were answered correctly (score ≤ 2.0).In 60% of questions, alignment with the ESCMID guideline was lower after data entry than without data entry.Moreover, responses on the appropriate duration of antimicrobial therapy (#6) and on early transition to oral antimicrobials (#7) even contradicted the ESCMID guideline directly after data entry and were considered completely incorrect (score > 3.3).
Fleiss kappa values for interrater reliability in the assessment of the alignment score were 0.419 (without data entry) and 0.453 (with data entry) indicating moderate agreement (Table 2).

Patient hazard was possible in two recommendations by ChatGPT
At last, ChatGPT's recommendations were analyzed on their potential to constitute a patient hazard.Overall, almost all recommendations by ChatGPT were estimated of presenting no patient hazard.Interestingly, one reviewer assessed ChatGPT's answer without data prompting on the use of dexamethasone in brain abscess (#9) as even better than the ESCMID guideline's recommendation.However, for the AI's responses on key questions #6 and #7 after data input, which directly contradicted the ESCMID guideline, two out of three experts judged that a patient hazard cannot be ruled out if ChatGPT's recommendation were followed.

ChatGPT provided the grade of evidence and strength of its recommendations
When asked, ChatGPT provided estimations on the grade of evidence and strength of recommendation for most of its recommendations.In longer versions of ChatGPT's answers (data not shown), the AI model repeatedly used the GRADE (Grading of Recommendations, Assessment, Development and Evaluation) system [6] to evaluate the strength of its recommendation and grade of evidence.Without data input, grade of evidence was rated in six out of nine questions similar to the ESCMID rating, the strength of recommendation in seven out of nine questions.The ESCMID guideline did not provide a rating for key question #7.After data input, ChatGPT only provided the grade of evidence and strength of recommendation for seven of its recommendations.The grade of evidence was similar to the ESCMID rating in four out of six questions, but the strength of recommendation only in one out of six recommendations.In both surveys, alignment of grade of evidence and strength of recommendation with the ESCMID rating was not associated with the alignment of the content of recommendation.

Additional remarks
As the study was conducted before the publication of the new ESCMID guideline on brain abscesses, ChatGPT stated frequently that it was working with data up until September 2021 and did not have access to any more current data.Moreover, at the end of each recommendation (in the longer versions, data not shown), the AI model stated that these decisions in patients with brain abscesses should be made in consultation with a multidisciplinary team, including infectious disease specialists, neurologists, and neurosurgeons.It also added that as medical knowledge and practices evolve, the most current guidelines should be consulted.

Discussion
In summary, ChatGPT was able to give recommendations on the management of patients with brain abscess for most of the key questions, including assessment of grade of evidence and strength of recommendation.Without data prompting, 70% of questions were answered correctly and no patient hazard was present.However, in 30% of the cases, it did not come up with a correct or nearly correct advice.Although data input slightly improved the clarity of ChatGPT's recommendations, it led to less correctly answered key questions and two recommendations were found to be directly contradicting the guideline.Alarming is the fact, that a patient hazard seemed possible if ChatGPT´s advice was followed.
The AI's knowledge was from before September 2021 and it had no access to more current data such as the new ESCMID guideline from 2023.It must be added that the key questions in this study cover extremely complex medical issues, some of which are controversial even among experts and for some of which hardly any robust data are available, which was one of the reasons for drawing up the guideline.As we knew which studies had been included in the ESC-MID committee's answers to the key questions, we tried to optimize ChatGPTs outcome by entering the studies into ChatGPT, assuming that it would result in answers closer to the ESCMID guideline.However, our results showed impressively that this was not the case, but that, on the contrary, the recommendations after data entry align less with the ESCMID guideline.For the first approach, ChatGPT presumably drew on a wider pool of literature, including nonscientific literature.Yet for the second approach, the same scientific studies that were screened, reviewed, and evaluated for the guideline development following a strict protocol, were fed into the AI algorithm.The fact that ChatGPT's recommendations were inferior after data entry-especially in two PECOs-might be due to an overvaluation of the few observational studies provided for key questions 6 and 7, for one of which even the guideline panel was not able to give a recommendation as the evidence was rated insufficient to answer the question.As the exact operating procedures of ChatGPT remain intransparent, we hypothesized that while the AI model is able to process large amounts of data quickly, it may lack the ability to correctly classify and weight the data based on their scientific quality.Moreover, ChatGPT only seemed to take the last chat entry into consideration for answering the key question (#1), leading to a wrong response.It remains unclear which data are exactly used for ChatGPT's responses as the exact proceedings of the AI could not be traced.It can, therefore, be concluded that data entry of studies into ChatGPT is not necessarily improving medical recommendations.It should be noted though, that our findings are a temporary observation and a re-evaluation of the recommendation quality of ChatGPT should be reviewed on an ongoing basis following its evolution and further development.
Of note, kappa values for interrater reliability showed only moderate agreement in the assessment of alignment between ChatGPT and the guideline.The three reviewers being part of creating the ESCMID guideline might have influenced their assessments of concordance of recommendation.To mitigate this risk, predefined scores were used to render the assessments more objectifiable.
On the topic of post-colonoscopy management, ChatGPT provided responses with 90% adherence to guidelines and 85% accuracy [13], suggesting beneficial use for healthcare providers and patients.ChatGPT's recommendations on the management of lumbar spinal stenosis were also in line with findings in the current literature [9].When asked guidelinebased questions on urological topics, ChatGPT provided only 60% appropriate responses [10].The authors of the study criticize misinterpretation of clinical care guidelines as well as dismissal of important context by the AI.Similarly, the agreement between answers by ChatGPT and guideline recommendations on five hepato-pancreatico-biliary conditions lay at 60% as well [14].
In this context, the accuracy rate of ChatGPT in our study appears to be in the range of previous comparisons of AI's recommendations with medical guidelines.Since there was no assessment of ChatGPT's recommendations on another neurological disease before, our findings add value to the previous results as the medical knowledge of the AI program should be assessed on a broad spectrum of diseases and medical departments.
Inconsistencies in the repeated reportings did not only occur in our survey with ChatGPT, but was also observed in other studies on the efficacy and reproducibility of ChatGPT [11,15].
The most important limitation of current AI models lies in the lack of transparency: the fact that ChatGPT does not disclose the sources of its answers, consequently results in a risk of dubious literature being used that the user neither track, verify or control.We hypothesized that ChatGPT might be more accurate in the interpretation of RCTs than observational studies thus leading to more imprecise answers particularly in the brain abscess field where large RCTs are lacking.It also remains unclear to what extent ChatGPT is able to analyze data and assign them different levels of credibility depending on the risks of bias and confounding and interpreting them.In longer versions of its responses to the key questions (not shown), ChatGPT added that medical experts should be consulted and the most recent knowledge and guidelines should be used for clinical decision-making, thus attenuating its recommendations and acknowledging the fact that blindly relying on AI might put patients at risk.

Conclusion
While ChatGPT presents a valuable adjunctive tool in broad clinical contexts at first sight, wrong recommendations were given to single questions.This is alarming as it appears too dangerous to trust on recommendations given by ChatGPT in a medical context.The nuanced expertise of specialized committees remains essential, especially for complex clinical queries.As ChatGPT continues to evolve, it is necessary to reevaluate this question in the future.

Table 1
Scores evaluating the response quality of ChatGPT

Table 2 10
PECO questions on brain abscess with recommendations, grade of evidence and strength of recommendation by ChatGPT without and after data entry, compared to the ESCMID guideline.In another column, the score assessments of 3 reviewers on clarity, alignment with the ESCMID guideline and patient hazard are added.Fleiss kappa values at the bottom of the table indicate interrater reliability