Introduction

The arsenal of therapeutic options available to patients with inflammatory rheumatic diseases (IRD) increased significantly in the last decades. The effectiveness of these therapeutics, however, largely depends on the time between symptom onset and initiation of therapy [1]. Despite various efforts [2], this diagnostic and resulting therapeutic delay could not be significantly reduced [2, 3]. Up to 60% of new referrals to rheumatologists do not end up with a diagnosis of an inflammatory rheumatic disease [4, 5]. On the contrary, due to a decreasing number of rheumatologists and ageing population, this delay is expected to increase even further in the near future [6]. Additionally, illegible and incomplete paper-based referral forms further complicate non-standardized subjective triage decisions of rheumatology referrals.

A big hope to accelerate the time until a final diagnosis are digital symptom assessment tools, such as symptom checkers (SC) [7,8,9,10,11,12,13]. One of the most promising tools that is currently available is artificial intelligence (AI)-based Ada, already used for more than 15 million health assessments in 130 countries [14]. In a case-vignette-based comparison to general physicians (GP) and other SC, Ada showed the greatest coverage of diagnoses (99%) and highest diagnostic accuracy (71%), although being inferior to GP diagnostic accuracy (82%) [15]. The physician version of Ada could significantly reduce the time until diagnosis for rare rheumatic diseases [16] and importantly the majority of rheumatic patients would recommend it to other patients after having used it [5, 7]. Additionally, patients who had previously experienced diagnostic errors are more likely to use symptom checkers [17].

Regarding the diagnostic accuracy of SC, Powley et al. showed that only 4 out of 21 patients with immune-mediated arthritis were given a top diagnosis of rheumatoid arthritis or psoriatic arthritis [18]. 19.4% of individuals using an online-self-referral screening system for axial spondyloarthritis were actually diagnosed with the disease by rheumatologists [19]. Recently we revealed the low diagnostic accuracy (sensitivity: 43%; specificity: 64%) of Ada regarding correct IRD detection [5] in a first randomized controlled trial in rheumatology. In this trial the diagnostic accuracy of Ada, that is solely based on patient medical history, was compared to the final physician diagnosis based on medical history, laboratory results, imaging results and physical examination. Solely based on medical history, Ehrenstein et al. previously showed that even experienced rheumatologists could correctly detect IRD status only in 14% of newly presenting patients [20]. We hypothesized that the relatively low diagnostic accuracy of Ada and other SC is largely based on the information asymmetry in the previous trials (physicians having access to more information than SC) and that the diagnostic accuracy of SC would not be inferior to physicians’ if only based on the same information input.

The objective of this study was hence to compare the diagnostic accuracy of an AI-based symptom checker app (Ada) and physicians regarding the presence/absence of an IRD, solely relying on basic health and symptom-related medical history.

Materials and methods

For this purpose, we used data from the interim analysis of the Evaluation of Triage Tools in Rheumatology (bETTeR) study [5].

The bETTeR dataset

bETTeR is an investigator-initiated multi-center, randomized controlled trial (DRKS00017642) that recruited 600 patients newly presenting to three rheumatology outpatient clinics in Germany [5, 7]. Prior to seeing a rheumatologist, patients completed a structured symptom assessment using Ada and a second tool (Rheport). The final rheumatologists’ diagnosis, reported on the discharge summary report was then compared as a gold standard to Ada’s and Rheport’s diagnostic suggestions. Rheumatologists had no restrictions regarding medical history taking, ordering of laboratory markers, physical examination or usage of imaging to establish their diagnosis.

However, to enable a fairer diagnostic performance comparison of Ada and physicians, in the present study, we reduced the information asymmetry by giving physicians only access to information (basic health data, present, absent, unsure symptoms) that was also available to Ada.

Description of AI-based symptom checker Ada

Ada (www.ada.com) is a free medical app, available in multiple languages, that has been used for more than 15 million health assessments in 130 countries [14]. Similar to a physician-based anamnesis the chatbot starts by inquiring about basic health information and then continues to ask additional questions based on the symptoms entered. Once symptom assessment is finished, the user receives a structured summary report including basic health data, present, excluded and uncertain symptoms. Furthermore, a top disease suggestion (D1), up to 5 total disease suggestions (D5) and the respective likelihood and action advice is also presented to the user. The app is artificial-intelligence-based, constantly updated and disease coverage is not limited to rheumatology [15]. Median app completion time was 7 min [5].

Online survey

An anonymous survey was developed using Google Forms, and eligible rheumatologists in leadership positions were contacted to complete the survey and invite further eligible colleagues. Participants had to confirm that they were (1) physicians, (2) fluent in German with (3) previous work experience in rheumatology care. Participants not fulfilling these criteria were not eligible. Basic demographic information including age, sex, resident/consultant status, years of professional work experience and current workplace (University hospital/other hospital/rheumatology practice) was queried.

Participants then completed four patient vignettes. Based on the presented basic health data, present, absent and unsure symptoms (see Fig. 1), participants were required to state if an inflammatory rheumatic disease was present (yes/no); a top diagnosis (D1), up to two additional diagnostic suggestions (D3) and their perceived confidence in making a correct diagnosis.

Fig. 1
figure 1

Example of the Ada symptom assessment report excerpt presented to physicians (adapted from original report and translated to English)

Case vignettes

The sample size was based on the interim results from the bETTeR study [5]. Including all diagnostic suggestions (up to five) Ada correctly classified 89/164 (54%) as non/inflammatory rheumatic diseases and correctly detected 29/54 IRD patients with a sensitivity of 54%. In a study by Ehrenstein et al. [20], rheumatologists had a sensitivity of 73% for detection of an IRD (55/75 correctly detected). Based on these assumptions, we did a sample size calculation using McNemar’s test for two dependent groups. With a power of 80% and a type 1 error of 5%, n = 113 completed case vignettes are needed to reject the null hypothesis that Ada and rheumatologists have an equal diagnostic accuracy regarding IRD classification of the top diagnosis.

To reflect a real-world IRD/non-IRD case mix, similar to the interim analysis [5] and a further observational study [4], we chose a mix of 40%/60% of IRD/non-IRD patient case vignettes. Additionally, 50% were “difficult” to diagnose cases. Difficult cases were defined as cases, where the referring physician suspected a different diagnosis than the gold standard diagnosis. The remaining 50% were “easy” to diagnose with a final gold standard diagnosis matching the suspected diagnosis of the referring physician.

Based on these predefined requirements, a total of 20 clinical patient vignettes (Supplementary Material 1) were randomly chosen from the interim bETTeR dataset. This set of 20 clinical vignettes was divided in five sets of four clinical vignettes per set to ensure completion of four clinical vignettes per participant.

Data analysis

Participant demographics were reported using descriptive statistics. All diagnostic suggestions were manually reviewed. If an IRD was among the top three (D3) or top five suggestions (Ada D5), respectively, D3 and D5 were summarized as IRD-positive (even if non-IRD diagnoses were also among the suggestions). Proportions of correctly classified patients were compared between rheumatologists and Ada using Mc Nemar’s test for two dependent groups.

The relationship between years of work experience (general and in rheumatology) and correctly classifying a patient as having an IRD was assessed using generalized linear mixed models with a random intercept, a binary distribution and logit link function.

Results

Participant demographics

A total of 132 vignettes were completed by 33 physicians between September 24, 2021, and October 14, 2021. Table 1 displays the participant demographics. Mean age was 39 years (27–57 years, standard deviation (SD) 8.2), 15 (46%) participants were female. 22 (67%) were board-certified specialists. An equal number of participants was working at a rheumatology practice or in a university hospital (both n = 16, 49%). Mean professional experience and experience in rheumatology care was 12 (SD 7.4) and 8.8 (SD 7.1) years, respectively.

Table 1 Participant demographics

Comparison of diagnostic accuracy

Correct classification as inflammatory rheumatic disease

Ada classified IRD status (IRD/non-IRD) significantly more often correctly compared to physicians according to top diagnosis, 93/132 (70%) vs 70/132 (53%), p = 0.002; as well as numerically more often according to the top 3 diagnoses listed, 78/132 (59%) vs 66/132 (50%), p = 0.011. Regarding the top diagnosis, this resulted in a sensitivity and specificity of Ada and physicians of 71 and 60%, compared to 64 and 47%, see Table 2. Figure 2 depicts the proportion of correctly identified IRD status from Ada and physicians by number of included diagnoses and case difficulty according to IRD-status from the gold standard diagnosis.

Table 2 Accuracy, sensitivity, specificity, positive and negative predictive value of Ada and physicians for correct classification of inflammatory rheumatic diseases
Fig. 2
figure 2

Percentage of correctly classified IRD status by diagnosis rank, vignette difficulty and IRD status

Work experience was not related to correctly detecting IRD among the top 3 diagnoses for rheumatologists (Odds ratio (OR) per year of work experience 1.01; 95% CI 0.94; 1.06), neither were years of experience working in rheumatology (OR 0.99; 95% CI 0.93; 1.06). The mean self-perceived probability of a correct diagnosis was 60% for case vignettes in which the rheumatologists were able to detect the correct IRD status within the top 3 diagnoses and 55% for the case vignettes in which they were not.

Correct final diagnosis

Ada listed the correct diagnosis more often compared to physicians as top diagnosis 71/132 (54%) vs 42/132 (32%), p < 0.001; as well as among the top 3 diagnoses, 78/132 (59%) vs 55/132 (42%), p < 0.001). Supplementary Fig. 1 lists the most common top diagnosis suggested by participants per case. Figure 3 depicts the percentage of correctly classified patients reported by Ada and physicians by a number of considered diagnoses and case difficulty according to the final diagnosis as gold standard. Probabilities for correct top diagnoses of physicians and Ada were mostly meaningfully higher than those of incorrect diagnoses, although Ada reported a higher probability for incorrect diagnoses in difficult cases, see Fig. 4.

Fig. 3
figure 3

Percentage of correct exact diagnoses by diagnosis rank, vignette difficulty and IRD status

Fig. 4
figure 4

Probabilities of diagnosis. The bars show the interquartile range. Correct and incorrect refers to the top diagnosis compared to the actual diagnosis

Work experience was not related to suggesting the correct diagnosis among the top 3 for rheumatologists (Odds ratio (OR) per year of work experience 0.98; 95% CI 0.93; 1.03), neither were years of experience working in rheumatology (OR 0.97; 95% CI 0.93; 1.03).

The mean self-perceived probability of a correct diagnosis was 61% for case vignettes in which the rheumatologists were able to detect the correct diagnosis among the top 3 diagnoses and 55% for the case vignettes in which they were not.

Discussion

In this study, we compared the diagnostic accuracy of physicians with clinical experience in rheumatology to Ada, an AI-based symptom checker, in situations of diagnostic uncertainty, i.e. solely relying on basic health and symptom-related medical history. This situation reflects the current onboarding process to rheumatology specialist care and the growing necessity to triage patients with IRD from those with non-inflammatory symptoms. Rheumatologists often have access to limited information (no imaging results, no laboratory parameters) to make a standardized, objective triage decision of referrals, resulting in non-transparent and potentially wrong triage decisions. Digital referral forms are rarely used [2], often resulting in additional poor readability of the hand-written information.

In contrast to our hypothesis, we did not show inferiority but to the best of our knowledge, for the first time a significant superiority of a symptom checker compared to physicians regarding correct IRD-detection (70 vs 53%, p = 0.002) and actual diagnosis (54 vs. 32%, p < 0.001). This superiority of Ada was independent of case difficulty and IRD status.

In line with the results by Ehrenstein et al. [20], we could show the high diagnostic uncertainty of physicians when deprived of information exceeding medical history, resulting in a low diagnostic accuracy. Additionally, we were able to show that physicians and Ada are mostly able to correctly assess the likelihood of a correct diagnosis (Fig. 3). Interestingly, Ada reported a higher probability of incorrect diagnoses in difficult cases.

Our results highlight the potential of supporting digital diagnostic tools and the need for a maximum of available patient information to inform adequate triaging of rheumatic patients. Electronically available patient information would reduce data redundancy and increases readability and completeness of data.

We think that similarly to increasing the diagnostic accuracy of rheumatologists [20], an essential step to improve the diagnostic accuracy of symptom checkers in rheumatology would be to include laboratory parameters (i.e. elevated CRP, presence of auto-antibodies) and imaging results (i.e. presence of sacroiliitis for axial spondyloarthritis). To improve triage decisions a symptom-based checklist of mandatory additionally required information could be made available to referring physicians. Routine measurement of the level of diagnostic (un)certainty could help to standardize symptom-based test-ordering decisions and continuously improve the triage service [21].

Surprisingly, we could also show that the diagnostic accuracy of physicians was not increasing with years of clinical experience (in rheumatology). In contrast, in a previous study with medical students, we could show that years of medical studies were the most important factor for a correct diagnosis and more helpful than using Ada for diagnostic support [22]. This could be due to the fact that rheumatologists only had access to Ada’s summary report and could not actively interact with the patient. Additionally, this study showed that the probability stated by Ada for an incorrect diagnostic suggestion is often higher than for a correct diagnostic suggestion, in line with results for difficult cases from this study.

This study has several limitations. Although vignettes were carefully selected to include cases of various difficulty and a representative sample of IRD cases, the sample size remains limited and further studies are needed. Importantly, previous studies indicated that the diagnostic accuracy of Ada is very user and disease dependent [22, 23]. Furthermore, Ada had the advantage of interaction with patients and physicians only had access to Ada’s summary reports (not being able to interact with patients and ask additional questions). To address these limitations, we are currently prospectively assessing Ada’s diagnostic accuracy used by patients compared to physicians limited to medical history taking (with no access to Ada’s results). The power calculation and inclusion of physicians with varying levels of experience in rheumatology care and different working sites strengthen the results of this study.

Conclusion

Limited to basic health and symptom-related medical history, the diagnostic accuracy of physicians was lower compared to an AI-based symptom checker, highlighting the importance of access to complete and sufficient information and potential of digital support to make accurate triage and diagnostic decisions in rheumatology.