Background

Arabic is a culturally diverse language spoken daily by over 400 million people [1]. Consequently, the Arabic language is considered an important medium for delivering health-related information to a substantial number of native speakers [2]. The pursuit of ensuring access to accurate health information in the native language is essential for effective communication and better health outcomes [3, 4].

From a global health perspective, the “big three” infectious diseases — malaria, tuberculosis (TB), and human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) — rise as prevalent health concerns [5, 6]. Additionally, the profound impact of the coronavirus disease 2019 (COVID-19) pandemic, highlighted the need for effective health communication [7]. Furthermore, influenza continues to pose significant public and global health risks with the potential to cause epidemics and pandemics; therefore, effective public health measures are needed to address influenza threats [8].

In the current digital era, lay individuals increasingly seek health information via various online platforms [9]. While these online channels — including the recent rise of artificial intelligence (AI)-based chatbots — offer convenient access to data, these digital channels also present significant challenges and concerns about the reliability of the information provided [10,11,12]. The prevalence of misinformation or even disinformation on these platforms can pose significant risks. Lay individuals may encounter and act upon inaccurate health-related content, potentially compromising their health and well-being [13,14,15].

ChatGPT (by OpenAI, San Francisco, CA), Bing (by Microsoft Corporation, Redmond, WA), and Bard (by Google, Mountain View, CA) are AI-based conversational models that emerged as promising tools for various purposes including the ability to facilitate the acquisition of health information [16,17,18]. These chatbots garnered notable user attention due to their ease of use and perceived effectiveness in delivering a broad spectrum of information [19]. This includes health-related content and self-diagnosis options, marking a significant advance in digital health communication and information accessibility [20,21,22,23]. Consequently, a notable surge in research interest regarding the utility of generative AI models in healthcare has been noticed [24, 25]. This interest was motivated by generative AI models’ ability to synthesize and analyze huge medical data rapidly, offering possibilities for personalized medicine and enhanced diagnostic accuracy [26,27,28,29,30]. Studies have focused on evaluating AI effectiveness in tasks such as generating patient education materials, simulating physician-patient interactions, and automating parts of the diagnostic process [16, 31,32,33,34,35,36,37,38,39].

The impact of linguistics on the evolution and efficacy of Large Language Models (LLMs) is profound [40, 41]. To enhance the accessibility of LLMs across various cultural and linguistic contexts, linguistic insights are important for the development and evolution of LLMs capable of competent performance across multiple languages and dialects [42]. In healthcare, multilingual LLMs can equate access to medical information and healthcare services through circumventing language barriers [16, 27, 43]. Patients and healthcare professionals who speak different languages can benefit from real-time translation services, ensuring that crucial health information is both accessible and understandable to diverse populations [44]. Such advancements are important for the successful integration of AI technologies into healthcare, improving operational efficiency and enhancing patient care [27]. The application of deep learning and AI within the healthcare sector has led to transformative developments. Examples include the development of models capable of accurate differentiation between cancerous versus normal blood cells, determining the severity of COVID-19 through the analysis of radiographic images, and improving the accuracy of malaria parasite detection in blood samples [45,46,47].

While generative AI-based models like ChatGPT, Bing, and Bard are promising in disseminating health information and improving health literacy, it is crucial to recognize their limitations [16, 48]. For example, a notable issue is the occurrence of “hallucinations” where AI models generate plausible but incorrect responses [49]. This is particularly concerning in the context of health-related information, where such inaccuracies could lead to severe negative consequences [50]. Understanding and addressing the limitations of AI-based models is essential for the safe and effective use of AI in healthcare communication [16, 33, 48, 51].

The performance of generative AI-based models is highly influenced by the quality of the underlying training data [52]. Therefore, variations in AI-based model performance would reasonably be anticipated across different languages and cultural contexts [53]. Consequently, a thorough assessment of AI-based model performance in a variety of languages is needed, to ensure the accuracy and reliability of these models in diverse languages.

To address this critical issue, this study aimed to evaluate the performance of a group of popular AI-based models, namely ChatGPT, Bing, and Bard in English and Arabic languages. The focus of this study involved one aspect of health-related information by choosing queries on five infectious diseases (HIV/AIDS, TB, malaria, COVID-19, and influenza). By exploring the capabilities and shortcomings of these AI-based models in the context of health information dissemination in Arabic, the study aimed to highlight the need to enhance the quality of healthcare content that would be provided to native speakers of Arabic for better health outcomes within Arab communities. Additionally, the study sought to identify potential disparities in the language performance of AI-based models, which are predominantly trained on English datasets.

The evaluation of generative AI-based models across Arabic and English languages, particularly within the context of infectious diseases, holds specific importance for the following reasons. For example, infectious diseases remain a global health concern, necessitating rapid communication and dissemination of accurate information, which was manifested during the COVID-19 pandemic [54]. The deployment of AI models can facilitate immediate health guidance and insights, with an important need for consistent performance across languages, to ensure effective public health communication [55, 56]. Additionally, variability in the performance of generative AI models across languages may create disparities in access to accurate and dependable health information [57]. Such disparities have the potential to amplify health inequities. Therefore, evaluating generative AI models’ applicability in handling queries related to infectious diseases across different linguistic contexts is essential to identify and address potential deficiencies to ensure equitable access to health information across the globe.

Methods

Study design

This descriptive study was designed following the METRICS checklist for AI-based studies in healthcare [58, 59]. This framework involves careful consideration of the features and settings of the AI models, a detailed evaluation methodology, and clear specifications of prompts, languages, and data sources. Additionally, the study rigorously addressed factors such as the count of queries, the individual factors in query selection, and the subjectivity inherent in evaluation of the generated content. The study design also considered the issues of randomization and the range of topics tested, adhering to the principles of transparency and thoroughness.

Ethics statement

This study was approved by the Institutional Review Board (IRB) at the Faculty of Pharmacy, Applied Science Private University (Approval number: 2023-PHA-51, on 23 November 2023).

Features of the AI models tested

Four AI-based models were employed in this study as follows. Two versions of ChatGPT (the publicly available GPT-3.5 and the more advanced, subscription-based GPT-4), Microsoft Bing, using the more balanced conversational style, and Google Bard Experiment, both available for free. To ensure content replicability, each model was tested under its default configuration. The prompting of these AI models was carried out concurrently on a single day by the first author (M.S.), specifically on 23 December 2023, to maintain consistency and control for time-sensitive variables in their performance assessment.

Features and count of the queries used to test the AI models

In this study, 15 distinct queries were executed on each AI model. This query count was based on the calculated sample size necessary for comparing means between two groups: n = (Zα/2+Zβ)2 *2*σ2 / d2 considering a 90% confidence level, an 80% desired power, and an assumed difference and variance of 1 [60]. This yielded a minimum of 13 queries to elucidate possible differences between the two languages effectively. This decision was guided by the aim to effectively examine the AI-generated responses, while also accommodating the operational constraints imposed by the rate limits of the AI models.

Sources of data to formulate the infectious disease queries

The queries purposefully examined five common infectious diseases, focusing on transmission, treatment, diagnosis, prevention, and epidemiology. For each disease, three queries were randomly selected using Excel’s randomize function from a pool of 15 questions per topic to minimize selection bias. The initial pool of queries were retrieved from credible English sources and covered key questions on HIV/AIDS, malaria, TB, COVID-19, and influenza as follows [61,62,63,64,65,66,67,68,69,70]. For HIV/AIDS, the three questions were: (1) What is the extent of risk of HIV transmission through French kiss? (2) What is the extent of risk of HIV transmission through hijama? (3) Why gays have higher chance of getting HIV infection? For malaria, the three questions were (1) Is malaria a contagious disease? (2) Is it considered safe for me to breastfeed while taking an antimalarial drug? (3) How do I know if I have malaria for sure? For TB, the three questions were: (1) Who doesn’t get sick from tuberculosis? (2) How can TB be tested for? (3) Is BCG vaccination recommended for all children? For COVID-19, the three questions were: (1) Can COVID-19 be passed through breastfeeding? (2) Can COVID-19 infection affect HIV test result? (3) What is long COVID-19 condition? Finally, for influenza, the three questions were: (1) Can I get a COVID-19 vaccine and flu vaccine at the same visit? (2) Is it possible to have both COVID-19 and flu at the same time? (3) When will flu activity begin and when will it peak?

The questions were translated into Arabic by one bilingual author (M.B.) and back translated by another (M.S.), with subsequent discussions among the two authors leading to minor modifications for clarity.

Specificity of prompts used

The prompting approach for each AI model involved using the prompts as exact questions without any feedback. This was ensured by selecting the “New Chat” or “New Topic” options for each query. The “Regenerate Response” feature was not utilized to maintain the integrity of first responses. Additionally, each query was initiated as a new chat or topic when switching languages to prevent any carryover effects between languages. This approach was critical to ensure that the responses for the same query in different languages were independent and not influenced by previous interactions.

Evaluation of the AI generated content

The evaluation of the AI-generated content was conducted independently by two authors with expertise in infectious disease from clinical microbiology (M.S.) and pharmacy (M.B.) perspectives. To minimize subjectivity in the evaluation process, a consensus key response was formulated prior to assessment based on the query sources. The evaluation was based on the CLEAR tool across 5 components as follows: Completeness, Lack of false information (accuracy), Evidence-based content, Appropriateness, and Relevance [71]. Each component was assessed using a 5-point Likert scale ranging from 5 (excellent) to 1 (poor).

Statistical and data analyses

Statistical analyses were conducted using IBM SPSS Statistics for Windows, Version 26, with a significance level set at P < .050. The average CLEAR scores across the two raters were utilized, including both component-specific and overall CLEAR scores. Based on the non-normal distribution of the scale variables assessed using the Shapiro-Wilk test, the Kruskal Wallis H (K-W) and Mann Whitney U (M-W) tests were used for mean difference testing. The overall CLEAR scores were categorized for descriptive analysis of content quality as follows: 1–1.79 as “poor”, 1.80–2.59 as “below average”, 2.60–3.39 as “average”, 3.40–4.19 as “above average”, and 4.20–5.00 as “excellent”.

To assess the consistency of evaluation between the two raters, we employed Intraclass Correlation Coefficient (ICC) average measures with two-way mixed effects to quantify inter-rater agreement. The inter-rater reliability analysis was conducted on a set of 120 responses, evenly split between English and Arabic, with 60 responses per language. The disagreement among the two raters was not resolved through post-hoc discussions after the evaluations were conducted by the raters. This decision was based on a deliberate methodological choice to maintain the objectivity of the initial independent assessments.

Results

Overall performance of each AI model in English vs. Arabic

Using the average CLEAR scores, variability was observed between the content generated in English based on the model with the best performance for Bard (mean CLEAR: 4.6 ± 0.68) followed by Bing (mean CLEAR: 4.37 ± 0.59), ChatGPT-4 (mean CLEAR: 4.36 ± 0.76), and ChatGPT-3.5 (mean CLEAR: 4.15 ± 0.68, P = .012, K-W). In Arabic, the same differences were observed; nevertheless, the differences lacked statistical significance (mean CLEAR: 4.39 ± 0.89 for Bard, 4.21 ± 0.72 for Bing, 4.13 ± 0.97 for ChatGPT-4, and 3.81 ± 0.68 for ChatGPT-3.5, P = .082, K-W).

Consistent superior performance of the four AI models tested was noted in English queries as opposed to the Arabic content (Table 1). However, statistically significant differences were observed only with ChatGPT-3.5 and Bard. Based on the descriptive assignments of the CLEAR scores, the four AI models content in English was described as “Excellent” while the performance of both ChatGPT models in Arabic was “above average”, as opposed to “excellent” performance in Arabic in Bing and Bard.

Table 1 The performance of the four AI models tested in English and arabic stratified per average CLEAR scores

Performance of each AI model stratified per CLEAR components

In stratified analysis of AI model performance across the five CLEAR components, English content consistently scored higher in 19 out of 20 comparisons (95%). The exception was Bing’s superior relevance score in Arabic compared to English. Statistically significant differences were observed with ChatGPT-3.5 and Bard. Specifically, ChatGPT-3.5 exhibited superior performance in completeness and relevance in English as opposed to Arabic content, while both ChatGPT-3.5 and Bard showed higher accuracy (lack of false information) in English. Additionally, ChatGPT-3.5 and ChatGPT-4 content in English outperformed the Arabic content in appropriateness (Table 1).

Upon evaluation of the CLEAR tool across English and Arabic, it was observed that inter-rater agreement varied across different CLEAR components (Table 2). English evaluations demonstrated lower ICC values, particularly in Completeness and Relevance, which suggests a considerable disagreement between the two raters. This contrast was less pronounced in CLEAR components assessing factual accuracy, such as Lack of False Information (Table 2). For Arabic evaluations, the agreements between the two raters were consistently higher across all CLEAR components, with less pronounced differences in raters’ assessments (Table 2).

Table 2 Intraclass correlation coefficients (ICC) for English and Arabic evaluations by the two expert raters stratified per CLEAR components

Performance of each AI model stratified per infectious disease topic

Out of the 20 comparisons across the 2 languages for the four AI models, higher average CLEAR scores were observed across all infectious disease topics in English content, with the exception of better performance in Arabic for the influenza queries in Bing and Bard (Fig. 1).

Fig. 1
figure 1

Heat maps of the four artificial intelligence models’ performance in English (blue) and Arabic (red) based on infectious disease queries. Assessment was based on the average CLEAR scores. COVID-19: Coronavirus disease 2019; TB: Tuberculosis; HIV/AIDS: Human immunodeficiency virus/acquired immunodeficiency syndrome

In English, Bard topped the performance in HIV/AIDS, malaria, TB, and COVID-19 while ChatGPT-3.5 topped the performance in influenza. The lowest level of performance for HIV/AIDS and COVID-19 was seen in ChatGPT-3.5 content and for malaria and TB, the lowest performance in English was seen with Bing content, while the lowest for influenza was in Bard (Fig. 2A).

Fig. 2
figure 2

Box plots of the four artificial intelligence models’ performance in English (A) and Arabic (B) showing variability in CLEAR scores. Assessment was based on the average CLEAR scores. COVID-19: Coronavirus disease 2019; TB: Tuberculosis; HIV/AIDS: Human immunodeficiency virus/acquired immunodeficiency syndrome

In Arabic, Bard also topped the performance in four topics (TB, COVID-19, influenza, and malaria together with ChatGPT-4), while the best performance for HIV/AIDS was observed for Bing. The lowest level of performance per topic in Arabic was seen for ChatGPT-3.5 in HIV/AIDS, malaria, and COVID-19, and the lowest for TB was the Arabic content of Bing and the lowest for influenza was content generated by ChatGPT-4 (Fig. 2B).

Descriptive labeling of the performance of each AI model in English vs. Arabic

Compiled together as shown in (Fig. 3), the overall performance of the four models in English was “excellent” with a mean CLEAR score of 4.6 ± 0.4 while in Arabic it was “above average” with a mean CLEAR score of 4.1 ± 0.82 (P = .002, M-W).

Fig. 3
figure 3

Error bars of the four artificial intelligence models’ performance compiled together and showing the five CLEAR components and the overall CLEAR scores stratified per language. SE: Standard error of the mean; C: Completeness; L: Lack of false information; E: Evidence support; A: Appropriateness; R: Relevance

Discussion

In this study, we investigated one crucial aspect of generative AI models’ utility in acquisition of health information. This involves testing the hypothesis of existent language disparity in generative AI model performance. Specifically, the study pursuit was in the context of infectious diseases which represent a significant global health burden. Such a quest appears timely and relevant as generative AI models are increasingly accessed by lay individuals for health information [21, 72]. Concerns emerged regarding the potential of generative AI models to produce harmful or misleading content with recurring calls for ethical guidance, benchmarking, and human oversight [73,74,75,76].

The key finding in this study was the overall lower performance of the tested AI models in Arabic compared to English. In this study, the overall Arabic performance of generative AI models in the context of infectious disease queries could be labeled as “above average” as opposed to “excellent” performance in English. Additionally, the differences in performance across the two languages showed statistical significance in ChatGPT-3.5 and Bard. Another important observation was the uniformly excellent performance of the four generative AI models in English. This consistency highlights the effectiveness of these models in the English language in the context of infectious disease queries. Additionally, a consistent pattern where the four AI models exhibited superior performance in English extended across all the five tested infectious disease topics. However, a notable variability in performance in Arabic was evident, particularly in handling topics related to HIV/AIDS, TB, and COVID-19.

The disparity in generative AI model performance across languages may be attributed to the varying qualities of the AI training datasets [77]. Prior research that sought to characterize such disparity in generative AI model performance across languages remains limited with variable results despite its timeliness and significance [78,79,80,81,82,83]. This includes even fewer studies that compared the AI content generated for the same queries in multiple languages [84].

Several studies assessed AI model performance in non-English languages with variable results despite the overall trend of below-bar performance in non-English languages. For example, Taira et al. tested ChatGPT performance in the Japanese National Nursing Examination in the Japanese language in five consecutive years [85]. Despite approaching the passing threshold in four years and passing the 2019 exam, the results indicated the relative weakness of ChatGPT in Japanese [85]. Nevertheless, attributing this result to language limitations alone is challenging, given the superior performance of ChatGPT-4 in the Japanese language compared to medical residents in the Japanese General Medicine In-Training Examination, as reported by Watari et al. [86]. This study also exposed ChatGPT-4 limitations in test aspects requiring empathy, professionalism, and contextual understanding [86]. Conversely, another recent study highlighted ChatGPT-4’s capabilities in acting in human-like behavior, being helpful and demonstrating empathy, which suggests variability in AI performance based on the nature of the task required [87]. These contrasting findings highlight the need for further studies to explore the emotional intelligence aspects of generative AI models.

In a study by Guigue et al., ChatGPT limitations in French were evident, with only one-third of questions correctly answered in a French medical school entrance examination, mirroring its performance in obstetrics and gynecology exam [88]. Additionally, the worse performance of ChatGPT compared to students was seen in the context of family medicine questions in the Dutch language [89]. Conversely, in the Polish Medical Final Examination, ChatGPT demonstrated similar effectiveness in both English and Polish, with a marginally higher accuracy in English for ChatGPT-3.5 [90]. In Portuguese, ChatGPT-4 displayed satisfactory results in the 2022 Brazilian National Examination for Medical Degree Revalidation [91].

In the context of the Arabic language and in line with our findings, Samaan et al. showed less accurate performance of ChatGPT in Arabic compared to English in cirrhosis-related questions [92]. In a non-medical context, Banimelhem and Amayreh showed that ChatGPT’s performance as an English-to-Arabic machine translation tool was suboptimal [93]. In a comprehensive study, Khondaker et al. revealed that smaller, Arabic-fine-tuned models consistently outperformed ChatGPT, indicating significant room for improvement in multilingual capabilities, particularly in Arabic dialects [94]. In the current study, our results suggested that the pattern of lower performance in Arabic extends to all tested AI models despite lacking significance in Bing and Bard.

The use of the CLEAR tool in this study was crucial for pinpointing specific areas for improvement in each language. Specifically, the study findings revealed that in both GPT-3.5 and GPT-4 models, the appropriateness in Arabic lagged behind English. This highlights key areas for enhancement in Arabic, such as the need to improve areas of ambiguities in the generated content and the need to organize the content in a more effective style. Additionally, accuracy issues observed in ChatGPT-3.5 and Bard highlighted the need for content verification particularly in health-related queries as well as the necessity of acknowledging the potential for inaccuracies in these models (e.g., through clear flagging of potential inaccuracies within the generated responses). The enhanced performance of ChatGPT-4 in both English and Arabic, relative to its predecessor GPT-3.5, has been demonstrated in previous research, which highlights the rapid significant advancements in generative AI models [95]. These improvements are attributed to refined training algorithms and larger, more diverse datasets, which enable the AI models to generate more accurate and contextually appropriate responses [52].

In light of this study findings, several recommendations for subsequent research could be outlined to enhance the applicability of generative AI models in healthcare as follows. First, the AI developers are recommended to integrate cultural and linguistic diversity aspects into the generative AI models, especially for AI algorithms aimed generate health-related content. Addressing the linguistic disparities revealed in this study is important to enhance equitable access to health information across diverse languages and cultural contexts. Second, further research is needed to confirm if the observed discrepancies in generative AI models’ performance extend beyond the English and Arabic languages examined in this study. Thus, expanding the scope of research to include a wider array of languages and dialects can improve the collective understanding of linguistic biases inherent in generative AI models [96]. This is particularly relevant for languages underrepresented in the current AI training datasets.

Moreover, the ethical and cultural consequences of deploying AI in healthcare necessitate rigorous scrutiny [97]. Real-world implementation studies of generative AI models in healthcare across different linguistic regions could shed light on the practical limitations of implementing generative AI tools in patient care [27, 98]. Incorporating feedback from non-English speakers into the AI development process can help to identify unique user needs and preferences, which would help to guide the development of more accessible and user-friendly AI algorithms.

Finally, the establishment of rigorous standards and guidelines for the development and assessment of multilingual AI models in healthcare is important [99]. Such standards can be helpful to ensure that the generative AI tools meet the required standards prior to deployment in healthcare. A collaborative effort among AI developers, researchers, and healthcare professionals is essential to ensure the applicability of generative AI models in disseminating accurate healthcare information tailored to different cultural and geographic settings [16, 27].

Lastly, it is important to interpret the results of the study in light of several limitations as follows: First, the limited number of queries tested on each model, albeit sufficient to reveal potential disparities might limit the generalizability of the findings. Future studies can benefit from incorporating a larger and more diverse set of queries to further validate and refine the findings of this study. Second, the assignment of CLEAR scores may vary if assessed by different raters. To mitigate potential measurement bias, this study employed key answers derived from credible sources as an objective benchmark before CLEAR scoring of the AI generated content. Third, the study did not account for the various Arabic dialects, focusing only on the Standard Arabic. Future research could expand on this particular issue in light of the previous evidence showing potential variability in dialectical performance [94, 100]. Fourth, in this study, we adhered to pre-established consensus key responses to maintain objectivity upon expert assessment of the AI-generated content. However, this approach limited our ability to capture dynamic consensus that could emerge from direct raters’ interactions. The observed discrepancies in expert raters’ agreement, especially in the evaluations of AI-generated content in English, suggest that linguistic complexity and the subjective nature of certain CLEAR components impacted the consistency of assessments. The notably low agreement on the Relevance of content in English might reflect broader issues in interpreting relevance across different medical contexts, where the raters may hold divergent views based on their backgrounds and expertise. Future studies could benefit from refined guidelines for these components, potentially incorporating more structured and detailed criteria to aid raters in achieving higher consistency. Fifth, in translating the queries from English to Arabic, the study employed a simplified practical approach where two bilingual authors independently translated the queries. This expedient method did not follow the rigorous, standardized procedures recommended for cross-cultural healthcare research, such as those outlined by Sousa and Rojjanasrirat [101]. Consequently, this might have introduced variations in the semantic equivalence of the queries across languages, potentially affecting the reliability and validity of the responses. Future research should consider implementing a more structured translation methodology, including validation by a panel of linguistic and subject matter experts. Finally, future studies can benefit from including a broader range of queries involving not only infectious disease topics to achieve a more comprehensive understanding of AI performance in diverse health and linguistic contexts. Addressing these limitations in future studies can help to advance the collective understanding of multilingual generative AI applications and to enhance the generative AI tools’ reliability and equity in global healthcare settings.

Conclusions

This study demonstrated the language discrepancy in generative AI models’ performance. Specifically, a generally inferior performance of the tested generative AI models in Arabic was observed compared to English, despite being rated “above average”. These findings highlight the language-based performance gaps in commonly used generative AI chatbots. This suggests the need for enhancements in AI performance in Arabic. Nevertheless, further research is needed across various health topics and utilizing different languages to discern this pattern. To achieve equitable global health standards, it is important to consider the cultural and linguistic diversity in generative AI model fine-tuning for widespread applicability.