FormalPara Key Summary Points

Rheumatoid arthritis (RA) is among the most common rheumatologic diseases.

Precision medicine with the aid of artificial intelligence (AI) is becoming more common each day.

Numerous machine learning and deep learning algorithms exist that could assist physicians in every step of RA care, including primary prevention, diagnosis, treatment, and rehabilitation.

Nonetheless, many challenges exist in the path of expanding AI-guided precision medicine, and especially its application in RA, which could and should be overcome through multi-disciplinary scientific effort.


Artificial intelligence (AI) is defined as "the capability of a machine to imitate intelligent human behavior" [1]. In today's world, technologies are expanding faster than ever, with capabilities one could have never thought of in the past. Machines are now able to perform tasks not only as good as humans, but even at higher qualities in many instances. AI is being used in various scientific fields, and medicine is not an exception [2]. Researchers in almost all healthcare sectors and specialties are now studying potential applications of AI, ranging from image processing in pathology [3] and radiology [4], precision medicine, and drug discovery [5] to making estimations and predictions in public health [6]. Machine learning (ML) is a branch of AI, in which the intelligence mentioned above is acquired through practice, similar to how a human learns skills. ML improved significantly in the early 2010s with the introduction of deep learning (DL) [7], which is basically combining multiple ML processes with each other [8].

Rheumatoid arthritis (RA) is the second most prevalent autoimmune disease, with an estimated global prevalence of nearly 20 million cases as of 2019 [9, 10]. The disease is characterized by destructive joint changes starting in the small joints of extremities and may continue to involve larger joints if left untreated. Rheumatoid arthritis is diagnosed clinically, and the lack of well-established diagnostic criteria [11] or a gold standard test makes the diagnosis challenging. Several classification methods have been proposed to distinguish RA from other autoimmune diseases and also stratify patients based on their disease characteristics [11]. Currently, the 2010 American College of Rheumatology/European League Against Rheumatism (ACR/EULAR) classification system is the most commonly used criteria for RA diagnosis and classification [12]. Treatment of RA aims to reduce inflammation and joint destruction. Initial therapies include non-steroidal anti-inflammatory drugs (NSAIDs) and corticosteroids, followed by disease-modifying anti-rheumatic drugs (DMARDs) [13]. Methotrexate (MTX) is the initial DMARD choice, although it may be substituted or accompanied by other treatments if indicated [13].

The medicine we know today is a result of experiments and, more precisely, data analysis. Therefore, utilizing the vast amount of the currently available data in the most efficient way is of great value. As evaluating all these data is virtually impossible for humans, AI helps us achieve this goal by incorporating machine-like speed and human-like comprehension. Almost all available data could be used by AI systems: laboratory findings, omics data, medical images, electronic health records (EHRs), data derived from sensors and wearable technologies, clinical features, demographic data, etc. (Fig. 1). The results obtained from these inputs could provide us with useful insights into various aspects of a disease, such as its pathophysiology and epidemiologic features. They could also assist researchers in discovering novel diagnostic methods and biomarkers, leading to quicker and more accurate diagnoses. Moreover, given the invaluable benefits of precision medicine [14], AI algorithms are able to tailor medical services and treatments for each patient according to their unique biological profile (e.g., genomics) and disease status.

Fig. 1
figure 1

The variety of input data sources for artificial intelligence (AI) models, CT computed tomography, MRI magnetic resonance imaging, US ultrasound

Given the emerging role of AI in diagnosis, monitoring, and management of autoimmune rheumatologic diseases, including RA, a thorough understanding of the achievements that have been obtained so far in the field and the existing knowledge gaps is critical to facilitate their incorporation into clinical practice and delineate the path for future studies. In this study, after reviewing the basic concepts of AI, we provide an updated comprehensive summary of the advances and applications of AI in RA clinical practice and research. Furthermore, we point out areas with a paucity of literature and challenges that have to be addressed and provide future directions for researchers on this topic.


We conducted an online search using PubMed in March 2022 using the following keywords: "rheumatoid arthritis" AND ("artificial intelligence" OR "machine learning" OR "machine intelligence" OR "computational intelligence" OR "deep learning" OR "neural network*" OR "convolutional network*" OR "Bayesian learning" OR "random forest" OR "reinforcement learning" OR "hierarchical learning" OR "computer vision"). No publication date or study type limit was applied to the search. We also searched the reference lists of the retrieved studies for identification of potentially relevant studies. Study selection was independently performed by two reviewers (SM and AN). This study was conducted in accordance with the ethical principles of the Declaration of Helsinki of 1964 and its later amendments. It is based on previously conducted studies and does not contain any new studies with human participants or animals performed by any of the authors.

Artificial Intelligence, Machine Learning, and Deep Learning

Artificial intelligence is a domain of computer sciences referring to a wide variety of interdisciplinary approaches aimed at enhancing machine capabilities. Machine learning is a subdiscipline of AI constituted of techniques for complex problem solving by automatedly learning the patterns of interaction between variables without explicit programming [15]. Compared to traditional statistical models that are hypothesis-driven and aim to identify relationships between outcomes and datapoints, ML approaches learn from the data, and their goal is to make accurate predictions with less focus on inference. Deep learning is a subset of ML identifying patterns in data using a layered structure of artificial neural networks (Fig. 2). In the past decade, due to the enhancement of computational power and availability of massive datasets, DL has been at the forefront of image analysis, genomic analysis, and drug discovery [16]. Compared to ML approaches (e.g., logistic regression, support vector machine (SVM), and random forest), DL models can perform more complex tasks; however, they require larger training data and longer training time. Moreover, DL models are able to process high-dimensionality data, such as medical images and EHRs [17]. Table 1 depicts the fundamental concepts in the most commonly used ML algorithms and neural networks.

Fig. 2
figure 2

Evolution of artificial intelligence, machine learning, and deep learning

Table 1 Fundamental concepts in the most commonly used artificial intelligence algorithms

The process in which an ML algorithm learns to produce the desired outcome is called "training". Machine learning approaches are commonly categorized into three broad classes based on their training method, namely supervised, unsupervised, and reinforcement learning [18]. In supervised learning, models are trained to predict future values by learning patterns from known input and output data. Random forest, SVM, neural networks, and natural language processing (NLP) models are some of the most popular supervised approaches (Table 1). Natural language processing models aim to analyze text and speech by inferring the words and can be utilized in EHR analysis [19]. In contrast to supervised learning, in unsupervised learning, the goal is not assigning the correct label, but inferring underlying patterns and relationships within the input (e.g., finding clusters within the data by reducing data dimensionality) [15]. In reinforcement learning, the model learns to achieve a specific goal by interacting with its environment through trial and error, demonstration, or a hybrid approach. In healthcare, reinforcement learning is commonly used in models applied in robotic surgery [19].

Understanding the fundamental concepts of AI familiarize physicians with the potential application of AI-based models in their clinical practice and helps them detect robust models applicable in practice. Several guidelines have been developed to ensure production of reliable models. Multiple items should be considered when assessing the robustness of an algorithm, including the size of the dataset used to train the model (as more training data results in a more precise model), external validation of the model, significance of the clinical problem addressed by the model, performance of the model compared to other algorithms or clinician performance, and availability of the utilized algorithm on public repositories, which can enable independent validation of the performance and reproducibility of the model [17, 20,21,22,23,24].

Artificial Intelligence in RA

Assessment of RA Development Risk

Currently, the most commonly used method for detecting pre-clinical RA in individuals is by measuring autoantibodies such as anti-citrullinated protein antibodies (ACPAs) or rheumatoid factor (RF), which could be present even years before the symptomatic disease [25]. However, they have a poor positive predictive value [26]. Hence, a reliable predictor of future RA development is yet to be found, and artificial intelligence could assist in this regard. O'Neil et al. [25] designed regression models with serum proteome as input to identify patients who are likely to eventually develop RA (i.e., progressors) among first-degree relatives of those with confirmed disease (i.e., at-risk population). Among ACPA-negative cases, least absolute shrinkage and selection operator (LASSO) regression recognized progressors using 17 proteins with an accuracy of 100%. However, another model for ACPA-positive individuals was less accurate (accuracy = 86.9%). Among all at-risk individuals, a third model was developed using 23 proteins as variables which demonstrated 91.2% accuracy (area under the curve (AUC) = 0.93) in the validation set in identifying progressors.

Multiple studies have attempted to identify single-nucleotide polymorphisms (SNPs) associated with RA development risk and the epistatic relationships among them. Kruppa et al. [27] used a random-jungle model and identified a 496-SNP panel closely associated with RA (AUC = 0.89). Negi and colleagues [28] also investigated SNPs and found that four SNPs were significantly associated with the disease, with maximum and minimum odds ratios (OR) being 1.42 and 0.86, respectively. One gene in which polymorphisms are associated with RA is PTPN22 [29, 30]. Briggs et al. [31] identified epistatic relations between PTPN22 and several SNPs that could augment the effect of PTPN22 on susceptibility to RA. Epistatic relationships were also probed by Gonzalez-Reico et al. [32], where they evaluated interactions between human leukocyte antigen (HLA) and non-HLA genes using Bayesian LASSO regression.

Jin et al. demonstrated that some eye diseases are associated with RA development in patients aged 50 and above [33]. In their study, cataract and other non-glaucoma eye diseases significantly increased the risk of developing RA, after adjusting for multiple other covariates (ORs = 1.33 and 1.43, respectively).

Table 2 summarizes studies incorporating ML for the assessment of RA development risk [25, 27, 28, 31,32,33,34,35,36].

Table 2 Studies incorporating AI for the assessment of RA development risk

Diagnosis/Early diagnosis

Early diagnosis of RA is of paramount importance as early interventions in the disease course can impede inflammatory destruction of the joints and lead to better outcomes [37].

According to the ACR/EULAR 2010 RA classification criteria, RF, ACPAs (often tested as anti-cyclic citrullinated peptide (anti-CCP) antibodies), erythrocyte sedimentation rate (ESR), and C-reactive protein (CRP) can be used as biomarkers for diagnosis of RA [38]. Nevertheless, RF and ACPA lack optimal sensitivity [39], while ESR and CRP have limited specificity. The absence of an optimal biomarker with high sensitivity and specificity necessitates the development of novel biomarker panels for early identification of RA [40]. Analysis of omics, i.e., genomics, transcriptomics, proteomics, metabolomics, lipidomics, glycomics, or metagenomic, using ML approaches enables simultaneous assessment of the association of numerous biomolecules with RA [41, 42]. Incorporating omics data into medical decision-making has several benefits. They are easily acquired from body fluids and are objectively interpreted. Furthermore, their extensiveness provides us with a vast amount of information. Of course, their limitation must also be kept in mind, such as being more complex and expensive.

Moreover, imaging findings, e.g., evidence of synovitis, in combination with clinical data and data derived from sensors, play a critical role in diagnosis, monitoring, and management of RA. Improved data analysis using AI can facilitate early detection of the disease and more efficient use of human resources [38, 43]. Herein, we summarize the applications of ML approaches in the diagnosis of RA using omics, imaging, clinical, and sensor data.

Using omics data in the diagnosis of RA

Several studies developed panels of multiple coding or non-coding ribonucleic acid (RNAs) within the serum or plasma to establish an accurate RA diagnosis using ML approaches. In a recent study, Liu and colleagues assessed gene expression profiles of peripheral blood cells and identified 52 differentially expressed genes in patients with RA. Further protein–protein analysis identified nine hub genes with crucial roles in the development of RA, which are fundamental in immune regulation, namely CFL1, COTL1, ACTG1, PFN1, LCP1, LCK, HLA-E, FYN, and HLA-DRA. The logistic regression and random forest models showed an AUC ≥ 0.97 for the panel of these nine messenger RNAs (mRNAs) in distinguishing RA from healthy samples [44]. In one other investigation of gene expression profile, Pratt et al. showed that a 12-gene transcriptional pattern in peripheral blood cluster of differentiation (CD) 4 + T cells could predict the development of RA in patients with undifferentiated arthritis during a median follow-up of 28 months. While the autoantibody showed a higher sensitivity in the ACPA-positive patients, the newly developed expression signature had a higher sensitivity and specificity in seronegative patients. Notably, the expression of most of these genes was induced by interleukin (IL)-6-mediated STAT3 upregulation. The combination of the 12-gene risk metric with the Leiden prediction rule (AUC = 0.84) outperformed the Leiden prediction rule alone—which is a classic tool for predicting RA progression from undifferentiated arthritis—in seronegative patients (AUC = 0.78), highlighting the clinical significance of these biomarkers [45, 46]. Lastly, recently, non-coding RNAs have garnered considerable research attention as diagnostic biomarkers in RA [47]. Ormseth and colleagues used LASSO variable selection with logistic regression to develop a panel of microRNAs (miRNA) differentiating patients with RA from controls, which resulted in the selection of miR-22-3p, miR-24-3p, miR-96-5p, miR-134-5p, miR-140-3p, and miR-627-5p, all of which were upregulated in patients with RA. The miRNA panel showed an AUC of approximately 0.8 in discriminating patients with RA (seropositive or seronegative) from controls. However, the panel might be an unspecific signature in autoimmune diseases as it could not differentiate RA from systemic lupus erythematosus [48].

Multiple investigations employed proteomic approaches to discover circulating diagnostic biomarkers using mass spectrometry. In such studies, the sample sizes are commonly relatively small, whereas each sample includes a large number of input variables. This atypical data pattern makes decision tree-based algorithms suitable for analysis of the data as they can handle the disproportionate high dimensionality of the input data compared to the number of samples [49]. In such settings, Geurts and colleagues showed that the boosted decision tree outperformed other ML approaches, including SVM and k-nearest neighbors (kNN) [49]. Using this method, several patterns of protein peaks were proposed to differentiate patients with RA from controls and patients with other autoimmune diseases with high sensitivity and specificity [49,50,51]. The association of the positivity of the serum for the proteomic analysis and intensity of the peaks with levels of anti-CCP antibody highlights the potential role of the patterns of protein peaks in early diagnosis of RA [51]. However, the lack of absolute protein quantification or protein identification is a limitation of these studies, which needs to be addressed by detecting the protein species represented by the peaks on the spectra [50].

Several other diagnostic models have been developed using omics data derived from serum, particularly inflammatory and oxidative stress markers. Analysis of circulatory levels of 38 cytokines using an artificial neural network (ANN) resulted in a model with a sensitivity and specificity of 100% in differentiating patients with RA from controls and patients with osteoarthritis (OA). Nevertheless, the ANN is a Blackbox providing limited information for further clinical inference. Therefore, Heard and colleagues utilized a single decision tree to identify cytokines leading the program to its output. These cytokines included CD40L, transforming growth factor (TGF)-α, epidermal growth factor (EGF), interferon (IFN)-γ, eotaxin, macrophage inflammatory protein (MIP)-1β, tumor necrosis factor (TNF)-α, IL-1α, granulocyte colony-stimulating factor (G-CSF), fractalkine, growth-regulated oncogene (GRO), and vascular endothelial growth factor (VEGF) in a descending order of importance for classification of RA, OA, and controls. Of the mentioned cytokines, eotaxin, G-CSF, IL-1alpha, TGF-α, and TNF-α levels were not statistically different between the groups when analyzed using conventional statistics. This finding highlights the necessity of applying ML algorithms in addition to conventional statistical methods for development of optimal diagnostic panels [52]. 4-hydroxy 2-nonenal (HNE) is another inflammatory marker inducing inflammation in various diseases, including RA (with elevated circulatory levels in patients with RA). A recent study investigated the diagnostic value of autoantibodies against unmodified and HNE-modified peptides in detecting RA in Taiwanese women. The model identified three isotypes of anti-HNE-modified peptides discriminative between RA and controls [53].

Machine learning approaches using metabolomics and glycomics have also shown promising results in the diagnosis of RA. Ahmed and colleagues assessed the diagnostic value of damaged proteins of the joints, including oxidized, nitrated, and glycated proteins and oxidation, nitration, and glycation free adducts released in the circulation by investigating plasma, serum, and synovial samples. Their algorithm, which featured levels of ten damaged amino acids in plasma, hydroxyproline, and anti-CCP antibody status, successfully differentiated early RA from controls and patients with other arthritis. Notably, the levels of damaged amino acids were higher in patients with advanced than early stages [54]. Chocholova et al. trained ML-based diagnostic models using glycomics data with a comparable diagnostic accuracy between ANN and LASSO regression in seropositive patients. Nevertheless, ANN outperformed LASSO regression in detecting seronegative patients in their study [55].

In addition to the circulatory biomarkers, major advancements have been accomplished in diagnosis and patient stratification by assessment of synovial tissue [56]. Long et al. found a 16-gene profile expressed in the synovial samples differentiating patients with RA and OA using supervised ML approaches. This can be particularly useful in seronegative and elderly patients having an inflammatory presentation of OA [57]. Correspondingly, Yeo and colleagues found a panel of ten most informative chemokine genes discriminating patients with established RA from uninflamed controls using ML methods. As shown by their study, synovial biomarkers can assist in the early identification of patients developing RA as well. They found that mRNA levels of chemokine (C-X-C motif) ligand (CXCL)4 and CXCL7 can accurately distinguish early RA from resolving arthritis with higher levels in early RA compared to longer established RA or controls [58].

Furthermore, even within RA patients, ML algorithms can facilitate patient stratification. Orange et al. identified three patterns of synovial gene expression using a clustering algorithm, including a high inflammatory subtype with extensive infiltration of leukocytes, a low inflammatory subtype specified by enrichment in pathways mediated by TGF-β, glycoproteins, and neuronal genes, and a mixed subtype. Subsequently, they developed a model predicting the synovial subtype according to the histological features. Notably, in the high inflammatory subgroup, the severity of pain significantly correlated with the CRP levels. Therefore, they concluded that pain mechanisms might be variable in patients with different synovial subtypes. This finding can result in potential clinical application for patient treatment stratification for pain management [59].

In addition to the above-mentioned omics data, the human microbiome has recently drawn immense research attention. Dysbiosis can be associated with various diseases, including RA. Machine learning-based approaches analyzing metagenomic data are optimal for exploiting the large biological datasets created by the evolving microbiome research [60]. Wu and colleagues used a logistic regression prediction algorithm to improve multiclass classification between patients with RA, type 2 diabetes mellitus, liver cirrhosis, and controls. While no biomarker was specific to type 2 diabetes mellitus and RA, their model had a favorable diagnostic performance with an AUC near 0.95, highlighting the value of microbiome biomarkers in disease diagnostics, especially disease screening, within a large-scale population [61]. However, in a recently published meta-analysis, Volkova and colleagues found specific features in the gut microbiome distinguishing RA from healthy controls and other autoimmune diseases using random forest algorithms. They found that increased levels of Clostridiaceae Clostridium and Lachnospiraceae and reduced abundance of Erysipelotrichaceae were the most distinctive features in RA compared to other autoimmune diseases [62]. In addition to the gut microbiome, assessment of the oral microbiome using ML approaches may also provide promising diagnostic biomarkers [63].

Table 3 illustrates studies incorporating ML for diagnosis of RA using omics data [44, 45, 48,49,50,51,52,53,54,55, 57,58,59, 61, 62, 64, 65].

Table 3 Studies incorporating AI for diagnosis of RA using omics data

Using imaging Data in the Diagnosis of RA

Radiological findings are critical in the diagnosis and staging of RA [66]. Conventional radiography is a commonly available and widely used modality. Multiple models have been developed to diagnose RA using inputs of hand X-ray data [67, 68], such as convolutional neural networks (CNN), with an accuracy as high as near 95% [67]. Compared with conventional radiography and computed tomography (CT), magnetic resonance imaging (MRI) and ultrasound are superior in detecting early soft tissue changes [66]. The characteristic imaging features of RA are synovitis, bone erosions, bone marrow edema, joint space narrowing, joint effusion, and subcortical cysts. Late imaging findings may include subluxation or luxation, scar formation, fibrosis, and bony ankylosis [66]. To the best of our knowledge, AI-based models have been exploited in the detection of synovitis [69,70,71], bone erosions [72, 73], bone marrow edema [74], and joint space narrowing [75]. However, we did not find investigations on other features, such as subcortical cysts, joint effusion, or late imaging findings.

Machine learning-based algorithms, both supervised and unsupervised, have been developed to detect and quantify synovitis using MRI images [71, 76]. Computer-aided diagnostic approaches have been highly consistent with manual synovitis quantifications in dynamic-contrast enhanced (DCE) MRI, while they can significantly reduce the time spent by the observer reading the image [76, 77]. We did not find any DL-based study assessing synovitis on wrist MRI. Moreover, few studies were designed to classify and quantify synovitis using ultrasound images [70, 78, 79]. In a recent investigation, Wu and colleagues developed a DL-based model assessing the severity of RA by classifying synovial proliferation captured by ultrasound [78].

Several studies used images obtained from different modalities to create models detecting and grading bone lesions. Most studies utilized hand X-ray images to identify erosions [73, 80]. A recent study showed that severity scores acquired from a DL-based model analyzing hand X-ray images could be comparable to the scoring of a human assessor [81]. Artificial intelligence-based models also performed well in detecting joint space narrowing in RA on plain X-rays [75, 80]. However, conventional radiography may underestimate number and size of erosions because of their projectional character [72]. Therefore, utilizing CT images for automatic detection and quantification of bone erosions can facilitate a more accurate assessment of disease activity [72, 82]. Moreover, clustering methods have been useful in detecting and quantifying bone marrow edema, a prominent feature in RA, on wrist MRI [74].

Other than conventional radiography, CT, ultrasound, and MRI, molecular imaging can also play a key role in diagnosis and management of patients with RA [83]. Nevertheless, we did not find any AI-based investigation of enhancing or analyzing molecular imaging data in RA. In addition to the radiologic modalities, reliable diagnostic models have been developed using hand photographs [84] or a combination of thermal and RGB hand images, demographic data, and hand gripping force [85]. Notably, given the accessibility of acquiring the required data, such algorithms can be used as screening tools for RA [85].

Table 4 provides a summary of the ML and DL studies that used imaging data as input to diagnose patients with RA.

Table 4 Studies incorporating AI for diagnosis of RA using imaging data

Using Clinical and Sensor Data for Diagnosis of RA

Several models have been developed for the diagnosis of RA using clinical data (Table 5) [86,87,88]. Singh and colleagues showed that a fuzzy inference system could have an acceptable diagnostic performance when fed with data on clinical symptoms [87]. In a novel approach, Fukae et al. converted clinical information to two-dimensional array images and used CNN (AlexNet) to distinguish patients with RA. The results of their algorithm showed a favorable agreement with the diagnosis made by three rheumatologists [88].

Table 5 Studies incorporating AI for diagnosis of RA using clinical or sensors data

Sensor data, which are rich datasets for disease diagnosis and monitoring, are acquired using technologies such as wearable devices, thermography sensors, and image sensors [89,90,91]. In a recent study, ML algorithms using features extracted from lymphocyte images generated by an electronic image sensor were highly accurate for RA classification, with accuracy rates as high as 97.5%. Notably, electronic image sensors convert optical images into electronic data [90]. Furthermore, thermograms are noninvasive methods used to assess joint inflammation in RA [92]. Bardhan et al. developed a two-stage classification algorithm correctly labeling nearly three-fourths of the knee thermograph scans (stage one was detection of arthritis-affected knees, and stage two was detection of knees affected by RA) [91].

Phenotype identification using EHRs

In the context of EHRs, "phenotype" is a clinical condition or characteristic that can be obtained via an automated method from EHR system or clinical data repository using a specific group of data elements and logical expressions. Electronic health records contain a comprehensive pool of data, which can be widely used in clinical and translational research. Nevertheless, due to the large amount of data, the manual review and extraction can be extremely time-consuming and inefficient. Both rule-based and ML (supervised or unsupervised) models have been used to identify disease status using EHRs. Phenotype identification algorithms usually combine various sources of information, e.g., billing codes, laboratory data, medication exposures, and NLP, to make accurate predictions [93, 94].

Several models have been developed to identify patients with RA efficiently from EHRs using NLP and ML (Table 6) [95,96,97,98,99,100,101,102,103,104,105,106,107,108,109]. Support vector machine is one of the most commonly used algorithms for phenotype identification. In 2010, Carrol and colleagues developed an SVM model with a favorable performance (AUC > 0.90) in predicting RA disease status using naïve and refined data (i.e., naïve data curated to only include RA-related items). Notably, the SVM model had higher patient identification precision than a deterministic model [108]. Importantly, given the changes in EHR systems, addition of novel DMARDs, and updates of the ICD codes, the validity of such phenotype identification algorithms should be routinely investigated with contemporary data. A recent assessment of the performance of Carrol et al.'s model using 2017 data showed that even though the diagnostic codes and medications have changed from 2010, the model still performed robustly and outperformed rule-based algorithms. Nevertheless, updating the model using ICD-10 codes resulted in a slight improvement in the sensitivity of the model [100]. In a recent study, Maarseveen et al. found that between naïve Bayes, SVM, gradient boosting, random forest, decision tree, neural networks, and a random classifier, SVM outperformed others in disease identification using EHR [99]. They showed that the performance of the proposed model was similar to a manual chart review using the 1987 and 2010 RA classification criteria [110].

Table 6 Studies incorporating AI for phenotyping RA using EHR

Several other supervised ML models have been developed for phenotype identification. Zhou and colleagues applied random forests algorithm and proposed a model identifying the most informative predictors of RA status using a large pool of data from patients in primary and secondary care settings, with an overall accuracy of 92.3%, which was comparable with methods derived from expert clinical opinion [105].

Not only can ML models facilitate disease status prediction, but they also could aid in stratification of patients. For instance, Lin et al. developed a classification algorithm to predict cases with MTX-induced liver toxicity. They found that incorporating temporality, i.e., the temporal relation between the presence of liver toxicity events and receiving MTX, can improve the performance of the model [106].

In a novel approach, Cai et al. developed a supervised model to facilitate participant selection for clinical trials by providing an alternative solution for the costly and time-consuming process of eligibility screening and chart review. They combined random forest and logistic LASSO regression to produce a model identifying potentially eligible patients from EHRs for an RA clinical trial. Compared with two rule-based systems, the AI algorithm had a better positive predictive value than one and a better sensitivity than the other; therefore, creating a balance between including and excluding too many patients for manual review [95].

Requirement of a large number of labeled data for training the supervised models is a major challenge in their application for phenotype identification. The quantity of needed annotated samples can be reduced by using semi-supervised and unsupervised models [101, 102]. Semi-supervised models usually use a small-sized labeled dataset and also a large-sized unlabeled dataset to classify data. Few semi-supervised models have been created for phenotype identification using EHRs. Gronsbell and colleagues developed a semi-supervised model that was validated with real data from patients with RA and multiple sclerosis (MS) with a performance comparable to the supervised methods [104]. Moreover, Chen et al. combined SVM and active learning, a form of semi-supervised learning method, and developed a model that outperformed passive learning and reduced the number of the required annotated samples by approximately two-thirds [107]. PheNorm is an unsupervised phenotyping algorithm that has been validated using four phenotypes, namely coronary artery disease, RA, Crohn's disease, and ulcerative colitis, with an accuracy comparable to that of supervised models [102]. Lastly, Gronsbell et al. developed a two-step model, with the first step being an unsupervised clustering method followed by a regularized regression as the second step using unlabeled observations to identify the most informative features from text fields available in the entire EHR. Their model showed a favorable performance (AUC = 0.93) with improved efficiency by reducing the number of labels required [103].

Importantly, the potential of EHRs can be further unraveled by enhancing the performance of the models through developing more complex networks incorporating DL and ANN [111, 112]. Algorithms with high performance can ultimately supersede ICD billing codes, which have the limitation of considerable error rates due to inconsistent terminology [113].

Predicting Treatment Response

Methotrexate is generally the initial DMARD choice for RA. If MTX fails to suppress the disease (which is the case in half of MTX monotherapy patients [114]), the treatment is stepped-up, and other anti-inflammatory drugs are administered, which are usually more expensive [115]. However, treatment failure still persists in some patients on second- or third-line medications, which can only be overcome by trial and error. Hence, a precision medicine treatment approach (also known as personalized or individualized medicine) based on each patient's biological profile could reduce treatment irresponsiveness and its consequences for both the patient and the healthcare system. The data used for choosing the proper treatment plan for a patient could range from simple variables, such as sex and age, to complex data, such as proteomics and transcriptomics.

Patients' demographic and clinical information are generally easily accessible. Such availability of vast amounts of input can result in accurate precision medicine algorithms. Machine learning algorithms have been shown to be able to predict response to MTX with AUCs as high as 0.84 using demographic and clinical data, such as past medical history and laboratory measures [116, 117]. Patients who do not respond to initial treatment should be stepped-up to more powerful medications. Morid et al. [118] evaluated multiple supervised and semi-supervised ML techniques to find the most accurate one to forecast a need for treatment step-up within 1 year among 120,237 patients. One-class SVM showed the best performance with a sensitivity and specificity of 89% and 83%, respectively. Despite the step-up therapy and trying several regimens, response failure persists in some patients (i.e., difficult-to-treat patients) [119]. An extreme gradient boosting algorithm [119] was able to identify these patients with a comparatively high accuracy (AUC = 0.73, sensitivity = 79%, specificity = 50%).

Omics are valuable input sources for predicting treatment response and vary greatly between patients due to different genetic materials and disease molecular basis. Artacho et al. created a random forest model that could identify MTX responders using gut microbiome data with an AUC of 0.84 [114]. When only patients with high (≥ 80%) or low (≤ 20%) chances of response were taken into account, the AUC of the algorithm increased to 0.94. The algorithm did not select pharmacogenetic predictors when provided as input, demonstrating a close relationship between gut microbiota and treatment response [114]. In another study, Plant et al. [120] incorporated transcriptomics and were able to predict MTX response among patients in early treatment stages with an AUC of 0.78. Not all studies yielded such favorable results, and AUCs for predicting MTX response reached as low as 0.61 [115].

Utilizing omics data seems more beneficial in predicting response to second- or third-line biological DMARDs (bDMARDs) than MTX [121, 122]. For instance, an SVM algorithm recognized patients responding to infliximab with an AUC of 0.92 using genomics data [122]. Some studies fed clinical data (e.g., lab results and disease activity measurements) in addition to omics, to their algorithms [123,124,125] and produced treatment response prediction AUCs as high as 0.83 [126], although the results were fairly heterogeneous.

Imaging data can also be employed in models predicting response to treatment. Kato et al. [127] developed a scoring system based on severity of synovitis, tenosynovitis, and enthesitis on ultrasound images in patients with RA and spondyloarthritis, assessing treatment response. An unsupervised random forest, in addition to uniform manifold approximation and a projection algorithm, was implemented, which divided patients into two clusters with significantly different responses to treatment as measured by the American College of Rheumatology 20, 50, and 70 (ACR20/50/70) criteria.

However, several shortcomings need to be acknowledged in studies applying AI to predict response to treatment. The variety of evaluation methods in determining treatment response makes the comparison of the results between different studies difficult and inaccurate. The EULAR criteria [128] was the most commonly used measure of response, which takes disease activity scores, ESR, and patient's global assessment into account (several variations exist). However, some studies used other definitions for treatment responsiveness, such as the continuation of MTX administration [117] and dose adjustments [129]. Furthermore, most studies are performed on MTX, and few have evaluated treatment outcomes using other RA treatments, especially non-biological DMARDs. Identifying patients for whom non-biological DMARDs are safe and effective substitutes using AI algorithms can be immensely helpful considering the higher cost of bDMARDs and their unavailability to many patients [130].

Table 7 lists studies incorporating ML for predicting treatment response in RA [114,115,116,117,118,119,120,121,122,123,124,125,126,127, 129, 131,132,133,134].

Table 7 Studies incorporating AI for assessment of treatment response in RA

Monitoring Disease Course and Predicting Prognosis

Measuring disease activity is crucial in choosing the optimal treatment plan, determining response to therapy, and prognosis. Moreover, predicting disease severity early on could assist in timely administration of the most suitable medications. Disease activity score in 28 joints (DAS28) is one of the most utilized severity measures of RA [135, 136]. This index could be calculated based on various inflammatory markers, including ESR or CRP [137]. An adaptive deep neural network [137] was able to outperform non-DL methods in predicting DAS28-ESR from demographical and clinical data with an AUC of 0.73 (categorical prediction) and mean standard error of 0.9 (numerical prediction). However, the attempt by Rychkov et al. [138] to predict DAS28 using omics data yielded unsatisfactory results, and their novel RA score showed only a weak (r = 0.33) correlation with DAS28. The clinical disease activity index (CDAI) [139] is another scoring system that only uses clinical data and can be calculated more rapidly than DAS28. Norgeot et al. developed a model using neural networks with a remarkable AUC of 0.91 in predicting disease activity according to the CDAI [140].

Predicting risk of needing treatment step up to tocilizumab in patients who do not respond to initial therapy is another example of applications of AI in monitoring disease course in RA. A logistic regression model [141] showed that higher age and remission CDAI were the most important risk and protective factors for tocilizumab monotherapy, respectively (OR = 1.04 and 0.17, respectively) when excluding other treatments as variables. For any tocilizumab use (either monotherapy or in combination), the highest and lowest ORs belonged to the number of comorbidities (OR = 1.16) and remission CDAI (OR = 0.20) (excluding other treatments as factors).

Rheumatoid arthritis is associated with a wide range of comorbidities, particularly cardiovascular, atherosclerotic, musculoskeletal, and neurological diseases [142,143,144]. Preventing these complications requires timely identification of patients at risk. Carotid ultrasound is a non-invasive and efficient modality to assess atherosclerotic plaques. ML and DL algorithms enable enhanced cardiovascular risk stratification in patients with RA by analyzing these images [145]. Machine learning algorithms developed by Wei et al. using demographic, clinical, and laboratory data as input performed satisfactorily in predicting the incidence of coronary heart disease (CHD) in patients with RA (AUC = 0.79, accuracy = 76%). Their logistic regression model outperformed conventional cardiovascular disease (CVD) risk score, i.e., Framingham Risk Score [146]. However, another investigation found a statistically comparable AUC for predicting stroke using a complex logistic regression model fed with laboratory data compared to the Framingham Risk Model [147]. Remarkably, in a recent investigation, ML classifiers outperformed the classical cardiovascular disease risk score when they were fed with cardiovascular risk factors, including conventional risk factors, laboratory-based blood biomarkers, and ultrasound images [148].

Musculoskeletal complications are one of the other major comorbidities in patients with RA. Risk factors for bone loss in patients with RA were identified by Hu et al. [149] using conventional logistic regression, LASSO regression, and random forest methods. The highest and lowest OR belonged to age for femoral neck bone loss (OR = 1.17) and TNF inhibitor use in the past year for lumbar spine bone loss (OR = 0.27). Other affecting factors included body mass index (BMI) and serum vitamin D levels.

Wearable and portable devices can play a substantial role in monitoring disease activity as well. Many of the devices used in today's medicine have become portable, such as pulse oximeters and cardiac Holter monitors. Newer wearable devices can measure a wide variety of indicators and have the capacity to be programmed to produce the most helpful outputs. The most common use of wearable sensors is probably tracking physical activity [150], which in recent years has been finding its way into medicine [151, 152]. Patients with RA may experience flares throughout their disease course, which will most likely hinder their physical activity due to the acute inflammation [153, 154]. Furthermore, flares are associated with disease progression and worse outcomes [155], even in those with low disease activity [156]. Hence, keeping an accurate track of flares could greatly improve patient care. Gossec et al. [157] developed a naïve Bayes model that utilized physical activity input from a watch to detect flares (as reported by the patients themselves). Their algorithm showed 95.7% sensitivity and 96.7% specificity for detecting flares, suggesting wearable sensors as potentially reliable devices for monitoring flares.

Table 8 summarizes studies implementing ML and DL for monitoring disease course and predicting prognosis [79, 137, 138, 140, 141, 146, 147, 149, 157,158,159,160,161,162,163,164,165,166].

Table 8 Studies incorporating AI for assessment of disease course and prognosis in RA

Drug Discovery

Rheumatic diseases are generally chronic in nature and require long-term treatment. Hence, developing novel drugs that are well tolerated and effective is of utmost importance. Drug discovery is an expensive process [167]; thus, it is necessary to make the involved procedures as efficient as possible. Many pharmaceutical projects fail due to incorrect target selection [168], which is an inevitable consequence of hypothesis-driven testing. Zhao and colleagues [169] addressed this issue by creating ML models that proposed potential treatments by inspecting expression profiles of patients being treated with a drug already proven to be effective and presenting targets that, if targeted, result in similar expression profiles. Their results for finding candidate targets for RA using random forest and gradient boosting machine algorithms showed significant concordance with an external database listing potential. Such investigations shift research flow from assumption-based and hypothesis-derived studies to studies based on known and proven data, which was not possible until recently due to challenges in handling the colossal amount of available information.

Basic Science Research

Similar to many other rheumatic diseases, not all aspects of the pathways involved in RA pathogenesis are known (133), mainly due to the complexity and extensiveness of involving factors. Machine learning algorithms are specifically designed to handle such conditions. For instance, two recent studies [170, 171] have pointed toward the possible role of gut microbiota in RA pathogenesis. Devaprasad and colleagues [172] acquired the immunome of 316 samples with immune-mediated inflammatory diseases, which were used to identify disease-related genes and cells of 12 inflammatory conditions, including RA. Their non-negative matrix factorization algorithm identified two main clusters of patients with different sets of cells and genes, further shedding light on immunological pathways involved in RA pathophysiology.


This comprehensive updated study reviewed published investigations incorporating AI, including ML and DL related to RA, the second most prevalent autoimmune disease. Artificial intelligence models are used to assess RA development risk, diagnose RA using omics, imaging, clinical, and sensor data, detect RA patients within EHR, predict treatment response, monitor disease course, determine prognosis, discover novel drugs, and enhance basic science research (Fig. 3). We showed that a growing body of evidence supports the potential role of AI in revolutionizing screening, diagnosis, and management of patients with RA. However, the proposed models may vary significantly in their performance and reliability. Notably, since every decision made in the healthcare setting may have dire and irreversible consequences, considering the limitations of AI and the challenges of its implementation in healthcare is immensely important.

Fig. 3
figure 3

The role of artificial intelligence (AI) in enhanced diagnosis and management of rheumatoid arthritis (RA)

In 2020, Stafford and colleagues systematically reviewed the available literature on AI applications in autoimmune diseases [113]. After MS, the RA had the highest number of manuscripts dedicated to itself (41 and 32, respectively), followed by inflammatory bowel syndrome (30) and type 1 diabetes (17). Although less in number, RA studies investigated more types of outcomes than MS, utilized more data sources and AI methods, and had a higher median sample size (338 versus 99). In fact, RA had the widest range of input data sources among all autoimmune diseases, indicating the vast potential of AI application in the field. Furthermore, AI-based precision medicine approaches could especially be effective in RA due to the diversity in treatment options and disease phenotypes.

Challenges and Limitations of Implementing AI

Multiple technical challenges hinder applying AI models in patient care. The need for large and accurately labeled data is a major issue in training supervised models. Importantly, small training datasets can result in over-fitted models. Creating large and high-quality open-access databases can aid in tackling this challenge. The presence of such datasets also facilitates performance comparison between different models. The variability of test datasets in various studies does not usually allow for making accurate comparisons [173, 174]. The osteoarthritis initiative study is an example of such datasets, which has been used to test and train dozens of AI models to improve diagnosis and prediction of pain progression and outcome in osteoarthritis [175,176,177].

Moreover, the clinical applicability of AI models cannot necessarily be represented by the accuracy of the model. In many cases, the accuracy measures reported in a scientific paper may represent the performance of the model in a small dataset from a specific population instead of providing generalizable results to other populations [178]. The variation between the input datasets is a limiting factor in the clinical implementation of AI models [179]. Datasets obtained from different healthcare environments may vary in data acquisition method, coding, and patient population. As a result, the model might perform differently when applied to datasets different from the training input. External validation can show the effect of input data variation on the performance of the model. However, in most of the studies included in this review (approximately 70%), validation using an independent external dataset was not performed.

The AI models are technically prone to several other challenges as well. These models use any signal that helps them achieve the highest performance. However, these signals may include unknown confounders, incorporation of which in the model may damage the generalizability of the model. For instance, a model designed to detect hip fractures used confounding features, including the scanner model and "priority" marks on scans, to classify the input data [180]. Moreover, data manipulation (adversarial attack) can have damaging effects on the performance of the AI model. Adversarial examples are inputs with small changes made to fool the model intentionally [181, 182].

The retrospective study design in most investigations in this field can also limit the real-world application of AI models. While historically labeled data are the most commonly used resources for training and testing AI models, the true additional value of AI algorithms in the diagnosis and management of patients can be best captured by trials with a prospective design. Nevertheless, only a few prospective studies have been conducted on the real-world applications of AI in the medical field [183], and research related to RA is not an exemption. As an example of prospective trials, a multi-center randomized controlled trial was performed to compare the accuracy of an AI algorithm with senior consultants in diagnosing childhood cataracts and choosing optimal treatment options [184].

In addition to the mentioned challenges, in many cases, particularly for neural networks, it is very difficult to convey the intuitive notions driving the conclusion of the model. These models that are too complicated for a straightforward interpretation of the factors involved in the decision making are also referred to as the "black box". The opaque rationale behind decisions made by the model can cause ethical and social challenges. Such models may fail in engendering user trust as transparency is a fundamental factor in gaining credence. Additionally, not understanding the rationale behind the decisions and the potential sources of error may increase the chances of inaccuracy in the decisions made by the model, especially in new datasets obtained in a different setting. Notably, given that healthcare is a high-stakes field, it is critical to minimize the margin of error as much as possible [185, 186].

Algorithmic bias is another ethical challenge raised by the use of AI. In 2019, Panch et al. defined algorithmic bias as when the application of an AI model aggravates existing inequities in society, such as racial and sexual discrimination [187]. For instance, a recent paper showed that one of the commonly used algorithms in healthcare is racially biased, considering the same risk score for White patients and Black patients while the Black patients are considerably sicker. They found that the underlying cause of this bias is that the algorithm predicts healthcare costs instead of disease severity. Due to the discrimination in access to care, as less money is spent on the care of Black patients compared to White patients, the model generates biased results [188]. In another example, under-representation of skin cancer images from patients with darker skin can result in less accurate results for patients of color as the model has not been trained on a sufficient number of observations representing these populations [173, 189].

The intention behind the development of AI algorithms should also be acknowledged as one of the potential ethical challenges of implementing AI in healthcare. Given the growing importance of quality measures, private-sector developers may be inclined to create algorithms suggesting clinical decisions that improve quality metrics without necessarily enhancing quality of care [190]. An example of this action has been observed in the car industry, where software was used to reduce emissions [191]. Additionally, AI algorithms might be designed in a way profiting their developers or buyers by suggesting certain drugs, tests, or devices to increase profit, while the clinicians using the algorithm may not be aware of such biases [190].

Future Directions

Our study shed light on eight recommendations for future investigations. Notably, these directions can be used in studies related to other autoimmune musculoskeletal disorders as well. (1) Adherence to guidelines ensuring good conduct is critical in AI studies. The Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [22] and the guideline released by the National Health Service (NHS) for "good practice for digital and data-driven health technologies" [192] are examples of such recommendations. (2) Open communication of the complete source codes is indispensable for verifying the reproducibility of the results by testing them on external datasets. Nevertheless, among studies reviewed in this paper, only a few provided open-access codes [97,98,99, 102, 103, 133, 140, 157, 158]. (3) It is vital that AI studies conduct external validation as it is a key component in assessing performance of a model in the real-world setting. However, among studies included in this review, almost half of the studies did not have an independent external dataset to validate the model. (4) As an AI model can be only as good as the data used to train it, future investigations need to ensure using high-quality data in large quantities. This can be achieved by creating large-scale multimodal datasets containing data on demographic, clinical, laboratory, genomic, imaging, and lifestyle features of the patients. (5) Future studies require consideration of the potential risk of algorithm bias during model development, and they should include sufficient data points representing minorities to reduce the risk of bias. (6) AI algorithms can be further used to assess extra-articular involvement, such as skin and ocular manifestations, in patients with RA. (7) Furthermore, currently, most investigations have compared the performance of AI algorithms with human experts. However, evaluating the performance of the collaboration of AI algorithms and human experts versus human experts alone would provide more realistic and applicable results [174]. (8) Lastly, real-world, and wide application of AI algorithms would heavily rely on design of prospective trials, ideally multi-center and randomized, assessing the performance of these models. Of note, our study paved the way for future reviews focusing on applications of AI in other high-burden autoimmune and inflammatory rheumatological and musculoskeletal diseases, such as MS and systemic lupus erythematosus.


Artificial intelligence (AI) can facilitate screening, diagnosis, monitoring, risk assessment, prognosis determination, achieving optimal treatment outcome, and de novo drug discovery for patients with rheumatoid arthritis, as well as broadening the knowledge of the disease pathophysiology by enhancing basic science research. Incorporating these machine and/or deep learning algorithms into real-world settings would be a key step in the progress of AI in medicine. Future investigations are required to ensure development of reliable and generalizable algorithms while they carefully look for any potential source of bias or misconduct.