Introduction

Ankylosing spondylitis (AS) is a chronic inflammatory disease manifested by progressive spinal stiffness and fusion; this disease primarily involves the sacroiliac joints in the initial stages [1,2,3]. In the USA, it is estimated that the prevalence of AS ranges from 0.2 to 0.5% [4]. Although AS is thought to affect approximately 350,000 people in the USA, it is an underdiagnosed condition and can take approximately 14 years before it is diagnosed correctly [5]. A number of factors likely contribute to the delayed diagnosis of AS, including the broad prevalence of mechanical back pain in the population, gradual onset of the disease, and lack of specific symptoms or biomarkers unique to AS [6]. Additionally, in the USA, the majority of patients experiencing the onset of low back pain visit general practitioners, orthopedists, or chiropractors. The ability of these providers to accurately diagnose AS is unknown, and there are no clear guidelines to refer patients with suspected AS to a rheumatologist [6, 7]. A lack of diagnostic criteria is another factor that may delay identification of patients with AS. The modified New York AS classification criteria become useful for diagnosis only in later stages of the disease, when the patient has experienced symptoms for many years with possible loss of function [8]. Delayed diagnosis and treatment contribute to the economic, physical, and psychological burdens on patients and their caregivers [6,7,8,9].

Multiple studies have identified factors that play a role in delaying diagnosis and developed recommendations for improving referrals [6, 10, 11]; however, results were mixed, with many studies limited by the use of relatively small databases based on electronic health records at a small number of healthcare institutions. Small sample sizes can be overcome through use of a large administrative claims database, although earlier studies using databases were hampered by the application of limited filtering and selection criteria [12, 13]. Predictive systems driven by machine learning techniques, which evolve based on empirical data, are ideal tools for recognizing the patterns obscured by the volume of claims in such a database. Machine learning models can efficiently mine patient information—including diagnoses, medication and treatment plans, and laboratory and test results—from these large databases and provide valuable insight into disease management [14, 15].

We used mutual information (MI) [16] to identify features that differentiated AS from a control population to better understand the journey of patients with AS prior to the receipt of an AS diagnosis. Since patients with AS are routinely misdiagnosed in routine clinical practice [17], we hypothesized that analyzing this process by examining claims data may help distinguish true cases of AS. We used machine learning algorithms to identify features that differentiate patients with AS from a control population; these were then used to predict an AS diagnosis. The overall objective of this analysis was to develop and refine a predictive mathematical model for AS using features observed in the medical and claims history of patients with and without a diagnosis of AS to aid in the earlier identification of the disease.

Materials and methods

Data source

This retrospective cohort study used administrative claims data from > 182 million patients from January 2006 to February 2018 in the Truven Health MarketScan® Commercial and Medicare Supplemental databases. The MarketScan databases represent one of the largest collections of deidentified US patient data available for healthcare research, with 23.9 and 1.9 million covered lives in the Commercial and Medicare data sets, respectively, in 2016. These databases provide the opportunity to simultaneously sample from various sources (e.g., employers, states, and health plans) and feature representation from > 350 unique carriers.

Patient data consisted of medical and pharmacy claims history, including diagnoses received (represented by the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM] 720.0 codes), procedures undergone (represented by Current Procedural Terminology codes), and prescription drugs received as specified by Therapeutic Detail Code and Therapeutic Class variables from Truven’s RED BOOK. Additionally, Clinical Classifications Software was used to categorize services and procedures into clinically meaningful diagnosis and procedure categories [18].

This study was conducted in accordance with the Guidelines for Good Pharmacoepidemiology Practices of the International Society for Pharmacoepidemiology, the Strengthening the Reporting of Observational Studies in Epidemiology guidelines, and the ethical principles in the Declaration of Helsinki. All database records are fully compliant with US patient confidentiality requirements, including the Health Insurance Portability and Accountability Act of 1996. As this study is a retrospective analysis of deidentified claims data in compliance with US patient confidentiality requirements, and did not involve the collection, use, or transmittal of individually identifiable data, institutional review board approval was not required to conduct this study.

Segment 1

Patient data were initially collected between January 2006 and September 2015, the period designated as Segment 1. In Segment 1, a cohort of 6325 patients with AS was identified; these patients had received ≥ 2 ICD-9-CM diagnosis codes of AS given by rheumatologists ≥ 30 days apart and had ≥ 12 months of continuous enrollment in the Truven databases prior to receiving their first AS diagnosis. In addition, 14,832,350 controls in Segment 1 were identified. Each control was matched by age and sex to each patient with AS; additionally, each control had no history of AS and had been enrolled ≥ 1 year prior to the diagnosis date of the patient with AS. The “index date” of the control was defined as the diagnosis date of the AS patient to which they were matched. The entire cohort of patients with AS and 49,198 of the controls were considered the training set; these patients were used to build and train the initial predictive model, Model A, to predict a diagnosis of AS. The application of Model A to the overall control population not used for training revealed that many patients were erroneously predicted to have AS (false positives). Therefore, a second machine learning model, Model B, was trained using the same set of 6325 patients with AS but using 50,000 false positive patients from the overall control population as controls.

Segment 2

The time frame of October 2015 to February 2018, or Segment 2, was the observation window to ascertain whether patients predicted to have AS in Segment 1 (patients scoring above our model score prediction threshold of 0.5; more detail follow) subsequently received an ICD-10-CM AS diagnosis code (M45.x or M08.1). A summary of the characteristics of patients and controls in Segments 1 and 2 are shown in Table 1.

Table 1 Summary of the characteristics of the cohorts in our study

Development and refinement of predictive models (Models A and B)

Figure 1 a and b present a high-level overview of the development, validation, and application of the AS-predictive risk models.

Fig. 1
figure 1

Overview of model development, where (a) Model A/B was first trained using a subset of patients in Segment 1 and subsequently used to predict an AS diagnosis among the remaining patient population in Segment 1, and (b) patients predicted as likely to have AS were tracked into Segment 2 to determine whether they eventually were diagnosed with AS. AS ankylosing spondylitis, Dx diagnoses, Proc procedures, Rx prescriptions

Model A

Mutual information [16], a predictor selection technique, was used to extract claims data from patients with AS and the matched control population. Mutual information identified positive and negative predictors of AS. Model A was first trained using the Segment 1 cohort of 6325 patients with AS and 49,198 matched controls using these predictors. Please see the Supplementary Appendix for technical details on model development. Model A was subsequently tested on all controls in Segment 1 who were not used for training.

Model B

To improve the accuracy of the machine learning model, a second model (Model B) was trained using the same set of 6325 patients with AS, as well as a control set of 50,000 false positives—i.e., controls whom Model A had erroneously predicted as having AS. Because this control population was considered especially challenging by the machine learning model to accurately predict as having AS, its use for the training of Model B provided a better sense of how the model would perform in more difficult cases. This model served to distinguish true patients with AS from the subset of patients who may be diagnosed with AS.

Model A/B

In order to evaluate the likelihood that a patient has AS, the models were applied sequentially: Model A was first used to eliminate patients who likely did not have AS—i.e., select for patients who scored above the model cutoff threshold of 0.5—and subsequently were processed by Model B to further remove false positives and provide a final risk score. The serial application of Models A and B, our machine learning model, is henceforth referred to as Model A/B.

Evaluation of the optimized predictive models (Model A/B)

Patients predicted to have AS in Segment 1 by Model A/B were followed up through Segment 2 to see whether they truly had AS, i.e., received an ICD-10-CM AS diagnosis code (M45.x or M08.1). The overall performance of the model was measured by its diagnostic accuracy. Metrics of diagnostic accuracy include estimates of sensitivity, the potential to recognize true patients with AS; specificity, the ability to identify patients without AS; positive predictive value (PPV), the proportion of patients predicted to have AS who truly have AS; and area under the receiver operating characteristic curve (AUC) [19], a global measure of diagnostic accuracy that estimates the overall ability to discriminate between 2 groups. An AUC of 0.5 is equivalent to a coin toss, and an AUC of 1 represents a perfect model [19]. AUC is commonly used in the machine learning community to evaluate classifier performance [19].

Lastly, to assess the operability of Model A/B, we created a simpler, linear regression model that operates using the same machine learning codes.

Comparative performance of Model A/B

The performance of Model A/B was also evaluated against that of a clinical model and an “all-AS” model. The clinical model was built to predict a diagnosis of AS based on the spondyloarthritis features described in the Assessment of SpondyloArthritis international Society classification criteria (aged < 45 years; presence of back and joint pain; and presence of ≥ 1 clinical feature of psoriasis, uveitis, inflammatory bowel disease, and/or enthesitis and ≥ 1 nonsteroidal anti-inflammatory drug treatment) [20]. The all-AS model was built to predict that all patients have AS. These predictive models were applied to the same patients identified and tracked from Segment 1 to Segment 2 using Model A/B.

Results

Development and refinement of predictive risk models (Models A and B)

As summarized in Table 1, a total of 6325 patients with AS were identified in the MarketScan databases based on the patient selection criteria of ≥ 2 ICD-9-CM AS diagnosis codes given by a rheumatologist, ≥ 30 days apart, as well as continuous enrollment in the Truven databases for 12 months prior to the first AS code. Concurrently, 14,832,350 demographically matched participants without any ICD-9-CM AS code served as controls (“Segment 1 overall control population”). An overview of the diagnosis, procedure, and prescription codes used in Models A and B from 0 to 12 months before diagnosis is shown in Fig. S1a and b, respectively.

Model A was trained using patients with AS and 49,198 controls selected from Segment 1 overall control population (AUC = 0.81; Fig. S2a). Please see the Supplementary Appendix for technical details on model development. Model A then scored all patients from the Segment 1 overall control population and predicted patients from the control population as having AS. To further improve the diagnostic accuracy of our machine learning model, Model B was built using 50,000 random patients from the Segment 1 overall control population who scored > 0.5 in Model A (i.e., erroneously identified as patients with AS and hence more challenging to diagnose) as a new control population (AUC = 0.79; Fig. S2b).

Evaluation of Model A/B

Model A/B evaluated 228,471 patients in Segment 1 without any history of AS who were followed in Segment 2. Based on a 0.5 model score prediction threshold, the model predicted that 1923 patients in Segment 1 would develop AS in Segment 2. Of these 1923 patients, 120 had received ≥ 1 ICD-10-CM AS diagnosis code, yielding an AUC of 0.629 and a PPV of 6.24% (120/1923 patients). The sensitivity of Model A/B was 9.66% (120/1242 patients), as there were a total of 1242 patients with ≥ 1 ICD-10-CM AS diagnosis code in Segment 2 and only 120 of 1242 patients were predicted by the model to have AS (Table 2). Figure 2 shows the relationship between PPV and sensitivity as the model cutoff threshold changes. When the model cutoff threshold was increased, the PPV of the model also increased but at the cost of reduced sensitivity. Importantly, the sensitivity of the model decreased with increasing time to AS diagnosis (Fig. S3). For example, patients 5 months away from diagnosis could be predicted to receive an AS diagnosis with 12% sensitivity, and those 25 months away could be predicted with 5% sensitivity. The decrease in sensitivity is expected, as patients who are farther from diagnosis may be less symptomatic and hence have lower signal for the model to detect.

Table 2 Measures of diagnostic accuracy of various predictive models
Fig. 2
figure 2

Comparison of sensitivity and PPV measures across all model thresholds. Sensitivity and PPV are a function of the model cutoff threshold. If the threshold is increased, fewer patients are predicted to have AS, therefore decreasing sensitivity and increasing PPV. At high PPVs, Model A/B outperformed the linear regression and clinical models. AS ankylosing spondylitis, PPV positive predictive value

The linear regression model, which is a simplified version of Model A/B (machine learning model), demonstrated higher sensitivity and AUC than Model A/B, although its PPV was orders of magnitude lower. This can be seen in Fig. 2, where the superior performance of Model A/B is apparent at high PPVs.

The diagnostic accuracy of the all-AS and clinical models was also evaluated, along with that of Model A/B in Segment 2. The all-AS model demonstrated a sensitivity of 100%, a PPV of 0.54%, and an AUC of 0.5, indicating no predictive value (Table 2). The clinical model, based on features more likely to be associated with a diagnosis of AS, yielded a PPV of 1.29% and sensitivity of 18.84% (Table 2).

Discussion

The need for earlier referral of a patient with possible AS to a rheumatologist has been recognized for quite some time [11]. Several strategies for timely referral of patients with potential AS to rheumatologists are established in Europe [21,22,23,24,25,26]. Identifying the population of patients for screening and referral may be useful to avoid overwhelming rheumatologists with high volumes of patients with back symptoms.

Regarding time to diagnosis, some patients receive a diagnosis of AS within a year [7], while it may take up to 10 years for others [7, 27]. Educating providers in nonrheumatology settings on how to identify critical, and perhaps elusive, elements for quicker AS diagnosis is challenging. Our exploratory study using a large, US administrative claims database was novel, since we attempted to use a machine learning approach to describe and analyze sequences of health events (e.g., diagnosis, procedures, treatments) in patients with AS to identify predictors that led to an AS diagnosis. More specifically, our objective was to determine the data elements that had statistical relevance to a future AS diagnosis. For example, uveitis and back pain are more common in patients with AS than in the general population, whereas a finger injury may be equally common in patients with and without AS. Additionally, patients with AS may tend to visit their provider more frequently as they attempt to obtain a diagnosis for their disease. In this example, uveitis, back pain, and frequent doctor visits would have more relevance to an AS diagnosis than a finger injury. Here, we developed, trained, and refined a series of predictive risk models to identify specific features that differentiated patients with AS from controls and prospectively applied the models to a population of patients in Segment 2 to predict a diagnosis of AS. We then compared the performance of Model A/B with that of traditional statistical and clinically based models. The latter two models lack the power to measure and apply these predictors to determine a diagnosis of AS.

Studies within the USA to identify patients with early AS have been hampered by small sample sizes and/or by applying limited filtering and selection criteria to larger databases [6, 7]. The explosive growth in collected data over the past few decades has stimulated the development of sophisticated computational tools designed not only to tabulate and sort through vast databases but also to “learn” about and extract insights from that data [12, 13]. Such machine learning software can identify subtle patterns and connections that older analytical systems were not designed to do. This capability has already been implemented in criminal investigation, business, and the military [13, 28,29,30,31]. In healthcare, predictive models have been used to detect patients in claims databases with depression, cataracts, rheumatoid arthritis, and type 2 diabetes mellitus [32,33,34,35]. Google recently used machine learning software to identify tumors on mammograms [36] and detect diabetic retinopathy in retinal fundus photographs [37]. The value of machine learning in healthcare is evident in the timely processing of large volumes of data beyond human capability, and the subsequent insights generated can be used to aid in clinical decision-making processes. Our study is directly applicable to patients with AS, and, to our knowledge, is the first such study attempted in a patient population with AS.

Although Model A/B has a low PPV, it is important to note that it is still higher than the PPV of both clinical and linear regression models. The linear regression model has a higher AUC and sensitivity than our machine learning model, but it also has a higher false positive rate. There are no fixed standards of diagnostic accuracy; unlike sensitivity or specificity, metrics such as PPV are dependent on the prevalence of the disease [38]. Maximizing the predictive capability of any machine learning model comes with the trade-off of decreased sensitivity. We attempted to improve on this by challenging the algorithm using “AS mimics,” i.e., patients with AS characteristics but who did not have a diagnosis of AS. At the same time, we acknowledge that certain predictors used by our model, or by machine learning models in general, may not be clinically rational; our model should not completely replace the medical or scientific judgment of trained professionals. The use of machine learning algorithms in combination with targeted clinical evaluations may be more effective in obtaining an early AS diagnosis. As a next step, our model needs to be applied to another large database in order to predict a diagnosis of AS, and, as an added verification step, confirmation of the diagnosis needs to be performed by a rheumatologist.

There were limitations with our study. As with any analysis of claims-based data, patient diagnoses may have been coded incorrectly by healthcare providers. The diagnosis of AS in our study was dependent on ICD-9-CM codes, which were originally developed for research purposes; however, in the USA, ICD-9-CM codes are often used for billing and may not reflect an accurate diagnosis. In this study, there was no independent confirmation of patient diagnoses. Additionally, age was not considered a feature in our study. As the age of onset of inflammatory back pain is typically < 45 years in patients with AS, our predictive models may perform better in a younger population.

In summary, we developed and refined predictive models for AS diagnosis in this analysis of a US administrative claims database. As Model A/B continues to be refined and optimized, its ability to distinguish patients with AS correctly from the general population will improve. The various iterations of the AS-predictive models underscore the importance of balancing big data analytics with real-world clinical observations. These predictive models may be vital for a timely diagnosis of AS despite their low PPV. Further validation of these models needs to be performed in a separate commercially insured patient population database, where a diagnosis of AS can be verified by an in-person medical assessment.