Introduction

Clinical practice is quickly changing following digital and technological advances, including artificial intelligence (AI) [1,2,3,4]. ChatGPT [5] is an open-source generative processing AI software that proficiently replies to users’ queries, in a human-like modality [6]. Since its first iteration (November 2022), ChatGPT's popularity has been growing exponentially. In its most recent release (January 2023), the number of users exceeded the threshold of 100,000 [7]. ChatGPT was not specifically developed to provide healthcare opinion, but replies to a wide range of questions, including those health-related [8]. Thus, ChatGPT could represent an easily-accessible resource to seek health information and advice [9, 10].

The search for health information is particularly relevant in people living with chronic diseases [11], inevitably facing innumerable challenges, including communication with their healthcare providers. We decided to focus on people living with Multiple Sclerosis (PwMS), as a model of chronic disease that can provide insights applicable to the broader landscape of chronic diseases. The young age of onset of MS results in high patients’ digitalization, including the use of mobile health apps, remote monitoring devices, and AI-based tools [12]. The increasing engagement by patients with AI platforms to ask questions related to their MS is a reality that clinicians will probably need to confront.

We conducted a comparative analysis to investigate the perspective of PwMS towards two alternate responses to four frequently-asked health-related questions. The responses were authored by a group of neurologists and by ChatGPT, with PwMS unaware of whether they were formulated by neurologists or generated by ChatGPT. The aim was to assess patients’ preferences, overall satisfaction, and perceived empathy between the two options.

Methods

Study design and form preparation

This is an Italian multicenter cross-sectional study, conducted from 09/01/2023 to 09/20/2023. The study conduction and data presentation followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statements [13].

The study was conducted within the activities of the “Digital Technology, Web and Social Media” study group [14], which includes 205 neurologists, affiliated with the Italian Society of Neurology (SIN). The study invitation was disseminated to all members of the group via the official mailing list.

Following the invitation to participate, neurologists were required to meet the following criteria:

  1. i.

    Dedicate over 50% of their clinical time to MS care and be active outpatient practitioners;

  2. ii.

    Regularly receive and respond to patient-generated emails or engage with patients on web platform or other social media.

Only the 34 neurologists who met the specified criteria were included and, using Research Randomizer [15], were randomly assigned to four groups. Demographic information is presented in Table 1 of Online Resources.

  • Group A, including 5 neurologists, was required to identify a list of frequently-asked questions based on the actual queries of PwMS received via e-mail in the preceding 2 months. The final list, drafted by Group A, was composed of fourteen questions.

  • Group B, including 19 neurologists, had to identify the four questions they deemed the most common and relevant for clinical practice (from the fourteen elaborated by Group A). Group B was deliberately designed as the largest group to ensure a more comprehensive and representative selection of the questions. The four identified questions were 1. “I feel more tired during summer season, what shall I do?"; 2. “I have had new brain MRI, and there is one new lesion. What should I do?"; 3. “Recently, I’ve been feeling tired more easily when walking long distances. Am I relapsing?”; 4. “My primary care physician has given me a prescription for an antibiotic for a urinary infection. Is there any contraindication for me?”.

  • Group C, including 5 neurologists, focused on elaborating the responses to the four questions identified as the most common by Group B. The responses were collaboratively formulated through online meetings. Any discrepancies were addressed through discussion and consensus.

    Afterwards, the same questions were submitted to ChatGPT 3.5, which provided its own version of the answers. Hereby:

  • Group D, including 5 neurologists, carefully reviewed the responses generated by ChatGPT to identify any inaccuracies in medical information or discrepancies from the recommendations before submitting to PwMS (none were identified and, thus, no changes were required).

Questions and answers are presented in the full version of the form in Online Resources.

Subsequentially, we designed an online form to explore the perspective of PwMS on the two alternate responses to the four common questions, those authored by Group C neurologists and the others by the open-source AI tool (ChatGPT). PwMS were unaware of whether the responses were formulated by neurologists or generated by ChatGPT. The workflow process is illustrated in Fig. 1 in Online Resources.

The study was conducted in accordance with the guidelines of the Declaration of Helsinki involving human subjects and the patient’s informed consent was obtained at the outset of the survey. The Ethical Committee of the University of Campania “Luigi Vanvitelli” approved the study (protocol number 0014460/i).

MS population and variables

PwMS were invited to participate to the study by their neurologists, through different communication tools, such as institutional e-mail and instant messaging platform. A total of 2854 invites were sent from 09/01/2023 to 09/20/2023.

The study covariates included demographic information (year of birth, sex, area of residence in Italy, and level of education, defined as elementary school, middle school, high school graduate or college graduate, the latter encompassing both a bachelor's degree and post-secondary education) and clinical characteristics (depressed mood, subjective memory and attention deficits, year of MS diagnosis, MS clinical descriptors, such as relapsing–remitting—RRMS, secondary-progressive—SPMS, primary-progressive—PPMS, or “I don’t know”). The occurrence of depressive mood was surveyed using the Patient Health Questionnaire (PHQ)-2 scale [16]. The rationale behind the decision to employ this rating scale was the rapidity of completion and its widespread use in the previous online studies including PwMS [17, 18]. Subjective memory and attention impairment deficits were investigated by directly asking patients about their experience on these symptoms (yes/no).

Preference between alternate responses was investigated by asking patients to express their choice. Furthermore, for each response, a Likert scale ranging from 1 to 5 was provided to assess overall satisfaction (higher scores indicating higher satisfaction). The Consultation and Relational Empathy (CARE) scale [19] was employed to evaluate the perceived empathy of the different responses (higher scores indicating higher empathy). The CARE scale measures empathy within the context of the doctor-patient relationship, and was ultimately selected for its intuitiveness, easiness of completion, and because already used in online studies [20, 21]. Given the digital nature of our research, we made a single wording adjustment to the CARE scale to better align to our study. Further details on the form (Italian original version and English translated version) and measurement scales are presented in Online Resources.

Prior to submitting the form to the patients, overall readability level was assessed. All responses elaborated by the neurologists and by ChatGPT were analysed by two validated tools for the Italian language: the Gulpease index [22] and the Read-IT scores (version 2.1.9) [23]. This step was deemed meaningful for a thorough and comprehensive appraise of all possible factors that could influence patients' perceptions.

Statistical analysis

The study variables were described as mean (standard deviation), median (range), or number (percent), as appropriate.

The likelihood of selecting the ChatGPT response for each question was evaluated through a stratified analysis employing logistic regression models; this approach was adopted to address variations in the nature of the questions. The selection rate of answers generated by ChatGPT was assessed using Poisson regression models. The continuous outcomes (average satisfaction and average CARE scale scores for ChatGPT responses) were assessed using mixed linear models with robust standard errors accounting for heteroskedasticity across patients. Covariates were age, sex, treatment duration, clinical descriptors, presence of self-reported cognitive deficit, presence of depressive symptoms and educational attainment. The software consistently selected the first level of the categorical variable, alphabetically or numerically, as the reference group to ensure straightforward interpretation of coefficients or effects.

The results were reported as adjusted coefficient (Coeff), odds ratio (OR), incidence rate ratio (IRR), 95% confidence intervals (95% CI), and p values, as appropriate. The results were considered statistically significant for p < 0.05. Statistical analyses were performed using Stata 17.0.

Results

The study included 1133 PwMS (age, 45.26 \(\pm\) 11.50 years; females, 68.49%), with an average response rate of 39.70%. Demographic and clinic characteristics are summarized in Table 1.

Table 1 Demographic and clinical characteristics of the study population (N = 1133)

Table 2 provides an overview of participant preferences, mean satisfaction (rated on a Likert scale ranging from 1 to 5), and CARE scale scores for each response by ChatGPT and neurologists to the four questions.

Table 2 Participant preferences, mean satisfaction (rated on a Likert scale ranging from 1 to 5), and CARE scale mean scores for each response

Univariate analyses did not show significant differences in preferences. However, after adjusting for factors potentially influencing the outcome, emerged that the likelihood of selecting ChatGPT response was lower for college graduates when compared with respondent with high school education (IRR = 0.87; 95% CI = 0.79, 0.95; p < 0.01).

Further analysis of each singular question resulted in additional findings summarized in Table 3.

Table 3 Multivariate logistic regressions

Although there was no association between the ChatGPT responses and satisfaction (Coeff = 0.03; 95% CI = − 0.01, 0.07; p = 0.157), they exhibited higher CARE scale scores (Coeff = 1.38; 95% CI = 0.65, 2.11; p > z < 0.01), as compared to the responses processed by neurologists. The findings are summarized in Table 4.

Table 4 Mixed linear regressions

The readability of the answers provided by ChatGPT and neurologists was medium, as assessed with the Gulpease Index. Although similar, Gulpease indices were slightly higher for ChatGPT’s responses than for neurologists (ChatGPT: from 47 to 52; neurologists: from 40 to 44). The results were corroborated by ReadIT scores, which are inversely correlated with Gulpease Index. Table 5 shows the readability of each response.

Table 5 Mean readability indices for each question

Discussion

The Internet and other digital tools, such as AI, have become a valuable source of health information [24, 25]. Seeking answers online requires minimal effort and guarantees immediate results, making it more convenient and faster than contacting healthcare providers. AIs, like ChatGPT can be viewed as a new, well-structured search engine with a simplified, intuitive interface. This allows patients to submit questions and receive direct answers, eliminating the need to navigate multiple websites [26]. However, there is the risk that internet and AI-based health information provides incomplete or incorrect information, along with potentially reduced empathy of communication [27,28,29]. Our study examined participant preferences, satisfaction ratings and perceived empathy regarding responses generated by ChatGPT as compared to those from neurologists. Interestingly, although ChatGPT responses did not significantly affect satisfaction levels, they were perceived as more empathetic compared to responses from neurologists. Furthermore, after adjusting for confounding factors, including education level, our results revealed that college graduates showed less inclination to choose ChatGPT responses compared to individuals with a high school education. This highlights how individual preferences are not deterministic but could instead be influenced by a variety of factors, including age, education level, and others [30, 31].

In line with the previous study [32], ChatGPT provided sensible responses, which were deemed more empathetic than those authored by neurologists. A plausible explanation for this outcome may lie in the observation that ChatGPT's responses showed a more informal toneFootnote 1 when addressing patients’ queries. Furthermore, ChatGPT tended to include empathetic remarks and language implying solidarity (i.e., a welcoming remark of gratitude and a sincere-sounding invitation for further communication). Thus, PwMS, especially those with lower level of education, might perceive confidentiality and informality as empathy. Moreover, the lower empathy shown by neurologists could be related to job well-being factors, including feelings of overwhelming and work overload (i.e., allocation of time to respond to patients’ queries). Even though our study did not aim to identify the reasons behind participants' ratings, these findings might represent a potential direction for future research.

Another relevant finding was that PwMS with higher levels of education showed lower satisfaction towards the responses developed by ChatGPT, this suggest that educational level could be a key factor in health communication. Several studies suggest that having a higher degree of education is associated with a better predisposition towards AIs [33], and in general, towards online information seeking [34, 35]. Although it may seem a contradiction, the predisposition toward digital technology doesn’t necessarily align with the perception of communicative messages within the doctor-patient relationship. Moreover, people with higher levels of education may have developed greater critical skills, enabling them to better appreciate the appropriateness and precision of the language employed by neurologists [30].

In addition, in our study, the responses provided by ChatGPT have shown adequate overall readability, using simple words and clear language. This could be one of the reasons that make them potentially more favourable for individuals with lower levels of education and for younger people.

When examining individually the four questions, we observed varying results without a consistent pattern; however, no contradictory findings emerged.

In questions N° 2 and N° 4, PwMS who reported subjective memory and attention deficits were more likely to select the AI response. Still, in question N° 1 and N° 4, a higher likelihood to prefer the ChatGPT response emerged for subjects with PPMS. This result could be attributed to the distinct cognitive profile showed in PPMS [36], which is characterized by moderate-to-severe impairments.

In addition, for question N° 1, there was a decrease in the probability of preferring the response generated by ChatGPT with the increasing age of the participants. This result is in line with some previous findings [34, 35], and further highlights the digital divide between “Digital Natives,” those who were born into digital age, and “Digital Immigrants”, those who experienced the transition to digital [37, 38].

Finally, in question N° 1, PwMS with depressive symptoms showed a lower propensity to select the response generated by ChatGPT. This result could suggest that ChatGPT employs a type of language and vocabulary that is perhaps less well-received by individuals with depressive symptoms than the one used by neurologists, leading them to prefer the latter. Further research with a combination of quantitative and qualitative tools is needed to deepen this insight.

Our results point to the need to tailor digital resources, including ChatGPT, to render them more accessible and user-friendly for all users, considering their needs and skills. This could help bridge the present gap and enable digital resources to be effective for a wide range of users, regardless of their age, education, and medical and digital background. Indeed, a significant issue is that ChatGPT lacks knowledge of the details and background data of PwMS medical record, as it could lead to inaccurate or incorrect advice.

Furthermore, as our findings showed greater empathy of ChatGPT towards PwMS queries, the concern is that they may over-rely on AI rather than consulting their neurologist. Given the potential risks associated with the unsupervised use of AIs, physicians are encouraged to adapt to the progressive digitization of patients. This includes not only providing proper guidance on the use of digital resources for health information seeking, but also addressing the potential drawbacks associated with relying solely on AI-generated results. Moreover, future research should address the possible integration of chatbots into a mixed-mode framework (AI-assisted approach). Integrating generative AI software into the neurologist's clinical practice could facilitate efficient communication while maintaining the human element.

Our study has several strengths and limitations. The objective was to explore the potential of AIs, such as ChatGPT, in the interaction with PwMS by engaging patients themselves in the evaluation [9, 10]. To this aim, we have replicated a patient-neurologist or patient-AIs interaction scenario. ChatGPT open access version 3.5 was preferred over newer and more advanced version 4.0 (pay per use). In fact, nonmajor users of online services will likely seek information free of charge [39, 40].

The main limitation of our study is the relatively small number of questions, however we deliberately selected a representative sample to enhance compliance and avoid discouraging PwMS from responding [41], as length can be a factor affecting response rates. Moreover, while our study adopted MS as a model of chronic disease for its core features, findings are not generalizable to other chronic conditions. We employed subjective self-report of cognitive impairment, and more studies adopting objective measures of cognitive screening will be needed to confirm our findings. Because a high level of education does not always correspond to a high digital literacy, it will be essential to assess digital literacy in future studies. Moreover, we preferred the use of institutional e-mail and instant messaging platform for recruitment, over in-person participation, given the predominantly digital nature of the research; still, the average response rate was in line with previous research [42]. We acknowledged that using stratified models may entail the risk of incurring Type I errors. However, stratification has been applied within homogeneous subgroups and on outcomes that are contextually differentiated due to the nature of different questions. This targeted approach allows for more accurate associations between predictors and outcomes to be discovered, thereby minimizing the risk of Type I error. Given the nonuniform distribution of patients across education classes within the categorical variable and the direct relationship between statistical significance and sample size, it's conceivable that this influenced the outcomes for the lower education classes (elementary and high school). However, the already observed statistically significant difference among the higher education classes implies that similar results could be achieved by standardizing the distribution of patients within the "education" categorical variable. This highlights the potential significance of education level in health communication, a crucial aspect warranting further exploration in scientific research. Finally, we tested ChatGPT in the perceived quality of communication, and did not assess its ability to make actual clinical decisions, which would require further specific studies.

Future development should include to (a) guide the development of AI-based systems that better meet the needs and preferences of patients, taking into consideration their cultural, social and digital backgrounds; (b) educate healthcare professionals and patients on AI's role and capabilities for an informed and responsible use; (c) implement research methodology in the field of remote healthcare communication.

Conclusion

Our study showed that PwMS find ChatGPT's responses more empathetic than those of neurologists. However, it seems that ChatGPT is not completely ready to fully meet the needs of some categories of patients (i.e., high educational attainment). While physicians should prepare themselves to interact with increasingly digitized patients, ChatGPT’s algorithms needs to focus on tailoring its responses to individual characteristics. Therefore, we believe that AI tools may pave the way for new perspectives in chronic disease management, serving as valuable support elements rather than alternatives.