FormalPara Key Summary Points

Why carry out this study?

There is a large demand for early real-world evidence (RWE), however real-world data (RWD) for newly launched drugs takes time to accumulate and depends on a number of factors, including drug uptake in the market.

Linkage of RCT data to administrative claims can provide an opportunity to study the clinical trial population in greater detail, which can also serve as an early and timely assessment of real-world performance of newly launched molecules.

What was learned from the study?

The probabilistic linkage methodology presented in this study demonstrates the feasibility of linking patients from clinical trials to real-world data, which can provide a means to assess additional information not usually captured/collected within a clinical trial.

Digital Features

This article is published with digital features, including a summary slide, to facilitate understanding of the article. To view digital features for this article go to https://doi.org/10.6084/m9.figshare.14230274.

Introduction

Rheumatoid arthritis (RA) is a chronic, systemic inflammatory disease characterized by progressive joint damage that can lead to loss of physical function and disability [1,2,3,4,5]. Multiple treatment options are available for RA, including conventional synthetic disease-modifying antirheumatic drugs (csDMARDs), biologic DMARDs (bDMARDs), which include tumor necrosis factor-alpha (TNF) inhibitors and non-TNF bDMARDs, and Janus kinase inhibitors (JAK)—the newest class of RA therapy [2, 5,6,7]. Baricitinib is an oral selective inhibitor of JAK1 and JAK2 and has been shown to improve RA signs and symptoms in phase II/III randomized controlled trials (RCTs) in adult patients with moderately to severely active RA despite treatment with TNF inhibitors (RA-BEACON trial), csDMARDs (RA-BUILD trial) and methotrexate (MTX; RA-BEAM trial), and in csDMARD-naive patients (RA-BEGIN trial) [8]. Patients who completed these trials were eligible to participate in a long-term extension study (i.e., the RA-BEYOND trial) [9]. Patients enrolled in the RA-BEYOND study were consented to have their RCT data linked to administrative claims data.

There is a large demand for early real-world evidence (RWE), however real-world data (RWD) for newly launched drugs takes time to accumulate and depends on a number of factors, including drug uptake in the market. This delay limits the timely assessment of real-world outcomes. To assess the feasibility of mitigating this limitation, we conducted an experiment where US patients enrolled in the RA-BEYOND long-term extension study were consented to have their RCT data probabilistically linked to health plan claims. Although Protected Health Information (PHI) provides an option for direct linkage to RWD sources, patient privacy concerns may contribute to low patient-participation rates. In addition, there are restrictions placed on the re-identification of patients in many RWD data sources, precluding the use of direct identifiers. As such, a probabilistic linkage method was selected to match RCT patients in the respective datasets. Probabilistic linkage includes methods to incorporate indirect identifiers, assigning a higher priority to variables upon which linking is less likely.

Administrative data from insurance claims databases can be an important resource for clinical studies, particularly when linked to clinically rich and detailed RCT data. To date, several studies have linked RCT data to administrative claims data with linkage rates ranging from 55.6 to 90.8% [10,11,12]. Strom et al. [11] assessed the accuracy of identifying cardiovascular events through billing claims in place of clinical event adjudication in structural heart disease trials by linking clinical trial data to inpatient Medicare claims. Their results indicated that claims could be used to trigger evaluation of neurological events, potentially improving the efficiency of the evaluation of techniques and devices designed to reduce such events. Brennan et al. [12] assessed the accuracy and completeness of claims- versus site-based follow-up with clinical event committee adjudication of cardiovascular outcomes and found similar randomized treatment effects using the two methods, also suggesting that claims data can be used to support clinical research by leveraging routinely collected data.

Linkage of RCT data to administrative claims can provide an opportunity to study the clinical trial population in greater detail, which can also serve as an early and timely assessment of real-world performance of newly launched molecules. This study aimed to assess the feasibility of probabilistic linkage of RCT data to US claims data using a sample of patients from baricitinib clinical trials.

Methods

Data sources

This retrospective cohort study utilized IQVIA’s Patient Centric Medical Claims (Dx) Database, IQVIA’s Longitudinal Prescription Claims (LRx) Database and Lilly’s baricitinib RCT data from the sample of patients that consented to the linkage of their de-identified RCT data to de-identified insurance claims.

The Dx database is derived from professional fee claims using the CMS-1500 billing form. It provides patient-level diagnoses and procedures for visits to US office-based physicians, ambulatory, and general health care sites. It includes claims paid by commercial insurance, Medicare, Medicaid and those paid in cash. The Dx data provides patient level information including the following key analysis variables: age, gender, three-digit ZIP code, geography, international classification of diseases (ICD) diagnostic codes, current procedural terminology (CPT) procedure codes, healthcare common procedure coding system (HCPCS) products, date of service, location of care, reported cost of service, payer type (e.g., third party, Medicare), and rendering practitioner. The Dx database is composed of approximately 1 billion outpatient medical claims per year submitted by over 860,000 practitioners in the US. All data are rendered non-identified to protect patient privacy.

The LRx database consists of pharmacy claims for dispensed prescriptions collected from pharmacies. The database covers approximately 90% of all dispensed prescriptions from US retail pharmacies and over 1.4 billion prescriptions per year, representing claims from all payer types, including self-pay/cash, Medicare, Medicaid, and other third-party payers. The LRx database captures information starting in 2001 on adjudicated dispensed prescriptions sourced from retail, mail, long-term care, and specialty pharmacies. The database provides the following patient-level information: prescriber, payer type, products, age, gender, three-digit ZIP code, prescription fill date, quantity dispensed, days of supply, number of refills, and cost information. All data are rendered non-identified to protect patient privacy.

RA-BEYOND is a clinical trial that was designed to investigate the long-term safety and any side effects of baricitinib in participants who completed a previous baricitinib rheumatoid arthritis study, providing additional treatment with baricitinib for 7 years [9].

The study protocol and patient consent form were reviewed and approved by an Institutional Review Board (IRB). Investigators participating in RA-BEYOND were informed of the purpose of this study and trained before they approached patients for informed consent. The study protocol was approved by Quorum Review Institutional Review Board (file number 28020). The study was conducted in agreement with requirements of registered clinical trials (ClinicalTrials.gov identifier NCT01885078) and with the Declaration of Helsinki.

Patients in the RA-BEYOND trial were invited to participate in the current study during an RA-BEYOND trial visit. All participants in the current study consented to have their de-identified RCT data linked to de-identified insurance claims. De-identified patient information of the consented patient cohort was transferred from Lilly to IQVIA (who has access to de-identified insurance claims data) via secure transmission using a secure file transfer protocol (SFTP) method. Data were stored on secure servers and only necessary members of the study team had access to the data.

Study Design

Patients who completed the RA-BEACON, RA-BUILD, RA-BEAM, or RA-BEGIN trial were eligible to participate in the RA-BEYOND trial. All patients who participated in the RA-BEYOND trial and consented to have their de-identified RCT data linked to de-identified insurance claims were included in the analysis. Exclusion criteria were not applied a priori. Patients were initially matched on age, gender, and three-digit ZIP code of the RCT-participating provider. To further improve matching, a point scoring system was implemented where ≥ 3 points constituted a sufficient match (Fig. 1). The scoring system described below was developed based on the available RCT data:

  1. Investigator/Provider match using National Provider Identifier (NPI) (1 point)

  2. Prior DMARD exposure. DMARDs of interest included auranofin, azathioprine, cyclophosphamide, cyclosporine, gold sodium thiomalate, hydroxychloroquine, leflunomide, methotrexate, minocycline, mycophenolate, penicillamine, sulfasalazine, abatacept, adalimumab, certolizumab pegol, etanercept, golimumab, infliximab, rituximab, sarilumab, tocilizumab, and tofacitinib.

    1. -

      No prior DMARDs used (1 point)

    2. -

      Number of prior DMARDs used (1 point for one drug matched, 2 points for two drugs matched, etc.)

  3. Prior non-DMARD exposure. Non-DMARDs were defined as any drug that is not a DMARD of interest

    1. -

      No prior non-DMARDs used (1 point)

    2. -

      Number of prior non-DMARDs used (1 point for one drug matched, 2 points for two drugs matched, etc.)

  4. RCT informed consent date matched (± 30 days) to a Dx or LRx claim with the same date of service (1 point)

  5. Clinical trial code (i.e., ICD-9-CM code V70.7; ICD-10-CM code Z00.6) observed on a Dx claim at any time from the informed consent date to last RCT visit date (1 point)

  6. Visit dates

    1. -

      At least three visit dates matched (1 point)

    2. -

      At least 75% visit dates matched (1 additional point)

Fig. 1
figure 1

Matching point system

To identify a 1:1 match ratio among RCT patients with a “one to many” match ratio, a variable hierarchy (i.e., NPI > DMARD use > visits > non-DMARD use > RCT code) was applied to the baricitinib patients with 2–3 matches. The hierarchy order was based on the sensitivity and specificity of each variable. For example, NPI was ranked as the most important variable in the hierarchy because NPI is the most specific variable followed by DMARD use, visits, and non-DMARD use. The RCT code variable was ranked as the least important variable in the hierarchy because it is the least specific (all RCTs are included under the same ICD codes) and not particularly sensitive due to under reporting of RCT participation in claims data.

Statistical Analyses

Patient demographic and clinical characteristics were reported as descriptive statistics. Categorical variables were reported as percentages. Continuous variables were reported as means with standard deviations. Age distributions were compared via t test, gender via Chi-square, with an alpha level of 0.05 to assess statistical significance. All statistical analyses were performed using SAS® version 9.4 (SAS Institute Inc., Cary, NC, USA).

Results

A total of 245 patients from 49 RA-BEYOND participating US sites were eligible for this study. A total of 78 (31.8%) of the eligible patients consented to participate in the study. Mean (SD) age of the total study sample (N = 78) was 54.8 (10.9) years and 69.2% were female. Of these 78 consented patients, 69 (88%) were successfully matched on age, gender, and three-digit ZIP code of the provider (Table 1). Among the 268 sufficiently matched pairs (i.e., total matching points ≥ 3; Table 2), 212 pairs remained after applying RCT DMARD inclusion/exclusion criteria. Using the point scoring system to further improve matching ratios, 44 consented RCT patients had at least one sufficient match (Table 3) as follows: 23 of 44 (52.3%) patients matched at a ratio of one RCT patient to one Dx/LRx patient (no duplication), 11 (25.0%) at a ratio of 1:2, seven (15.9%) at a ratio of 1:3 and three (6.8%) at a ratio of 1:4 or greater. Two RCT patients matched to the same Dx/LRx patient and both pairs were subsequently excluded. The variable hierarchy (i.e., NPI > DMARD use > visits > non-DMARD use > RCT code) was applied to the 18 baricitinib patients with 2–3 Dx/LRx matches resulting in 38 of 78 (48.7%) baricitinib patients’ successful 1:1 match to patients from the claims database. The mean (SD) age of the 38 successfully matched patients was 57.5 (9.8) years and 81.6% were female. The matching ratio of three RCT patients with 2–3 Dx/LRx matches could not be reduced to 1:1.

Table 1 Probabilistic linkage details
Table 2 Distribution of total points among RCT-Dx/LRx patient pairs
Table 3 RCT-Dx/LRx matching ratios

The mean (SD) age of the 38 successfully matched patients was 57.5 (9.8) years and 81.6% were female. Age and gender comparisons revealed significant differences between the final matched sample and the initial sample of consented patients with the final matched sample (n = 38) being significantly older and with more females compared to the initial sample of consented patients (n = 78) [57.5 vs. 54.8, p = 0.0318 and 81.6 vs. 69.2%, p = 0.0213, respectively].

Discussion

Long-term follow-up is often an essential component of clinical trial research in chronic diseases including RA as both the benefits and the adverse effects of treatment may take years to emerge in the real world. The linkage of clinical trial data to administrative claims data allows for leveraging of existing data to create a unique linked claims-clinical database that contains rich clinical information paired with long-term medical and prescription data to increase the depth and breadth of the research capabilities (including assessment of early real-world evidence of treatment outcomes).

This study explored a method of probabilistic linking of RA RCT patients to claims data in the real world. A total of 44 of the 78 (56.4%) consented patients had at least one sufficient match using the point scoring system; 23 of 78 (29.5%) patients matched at a ratio of one RCT patient to one Dx/LRx patient, 11 of 78 (14.1%) at a ratio of 1:2, seven of 78 (9.0%) at a ratio of 1:3 and three of 78 (3.8%) at a ratio of 1:4 or greater. To further improve matching, a variable hierarchy was applied to the 18 RCT patients with three or more matches. Overall, 38/78 (48.7%) baricitinib patients from the RCT data were successfully matched 1:1 to patients from claims data demonstrating the feasibility of linkage, to a moderate degree, of RCT data to claims data. Probabilistic data linkage has also been demonstrated to inform research in other disease areas (e.g., cancer, chronic obstructive pulmonary disease, stroke, pediatric cardiovascular medicine) [10,11,12,13,14,15,16]. Hay et al. successfully demonstrated the feasibility of linking RCT data to administrative data in Canada with a linkage rate of 90.8% [10]. Another study by Strom et al. achieved a match rate of 79.8% by linking clinical trial data to Medicare inpatient discharge data [11]. Brennan et al. linked Medicare claims data to the clinical data from seven randomized cardiovascular trials and achieved a linkage rate of 55.6% [12].

Compared to previous studies using similar probabilistic linking approaches without relying on personal identifiers, our study achieved a slightly lower linkage rate (48.7%). One of the potential reasons for the observed difference could be related to the number of matching variables that were utilized. For example, in the study by Hay et al., probabilistic data linkage was based on birth date, patient initials, gender, diagnosis site, diagnosis date, and histology. The only variable the current study had in common with the Hay study was gender, which was utilized in the matching process. In addition to the three required variables used in the current study (age, gender, and three-digit ZIP code of the RCT provider), another six variables of interest were also selected for further matching according to a point scoring system in order to increase the matching accuracy though this may have resulted in a relatively lower linkage rate. The difference in the source of the claims data may have been another driving factor in the observed differences in linkage rates. The administrative data used in most of the literature included inpatient claims, however, in the current study, inpatient data variables were not available potentially yielding a lower linkage rate as hospitalizations and their associated dates may be more unique/specific compared to outpatient visits and prescriptions.

Results of the current study provide a framework for future research to obtain RWE through linkage of RCT patients’ non-identified information to real-world data (e.g., non-identified insurance claims databases). This framework may provide an opportunity to assess additional information from treated patients not collected within a clinical trial such as healthcare resource utilization and associated costs. These types of additional information allow for early RWE to inform future studies on the treatment’s associated outcomes (e.g., comparative cost-effectiveness, healthcare cost-offsets).

This study had several limitations and challenges. Obtaining informed consent from patients after the trial initiation limited the number of participants in this study and may have led to a sample that was not representative of the original RCT population. Differences in mean age and gender distribution were observed between the final matched sample and the initial sample of consented patients though these differences may be explained by the small sample sizes. Limited familiarity of clinical investigators with RWE studies contributed to challenges in enrolling greater numbers of patients for this particular study. This challenge highlights the importance of having a robust plan for patient consent/enrollment at the start of a clinical trial in case this type of subsequent research is of interest. The results of this study are limited further due to the inability to validate the matches. Validation was not feasible as PHI was not collected in this study (to preserve patient privacy). Unlike deterministic linkage methods, mismatches are possible with probabilistic linkage methods. For example, two different patients could be linked (i.e., a false-positive link) resulting in incorrectly reported outcomes, or two records from the same patient may not be linked (i.e., a false-negative link), resulting in potential missing information. Minimizing mismatches depends on the quality and unique identification of the information used in the linkage process. Lastly, unlike claims databases that are “closed” (i.e., with the exception of claims paid for in cash, all of the claims are captured for each patient from one data collection source), “open-source” claims databases (i.e., databases with multiple data collection sources), such as those used in the current study, often do not contain information related to patient enrollment into pharmacy and medical benefits limiting the ability to ensure complete capture of a patient’s claim history.

This study yielded a few recommendations for future research. Expanding the originator trial consent to include the possibility of this type of subsequent research during the trial design is recommended to accelerate the generation of RWE of newly launched molecules. Additionally, if linkage studies were incorporated into the original clinical study protocol, inclusion of comparative patient groups (i.e., data flags to indicate placebo or active treatment) could provide an opportunity to investigate the real-world impact of the treatment. The development of algorithms to validate the probabilistic linkage is needed. In addition, utilizing administrative claims data with patient enrollment information would help to minimize the limitation of missing patients claim history.

Conclusions

The probabilistic linkage methodology presented in this study demonstrates the feasibility, at a moderate rate, of linking patients from clinical trials to real-world data, which can provide a means to assess additional information not usually captured/collected within a clinical trial (e.g., healthcare resource utilization and associated costs).