Background

Juvenile idiopathic arthritis (JIA) and uveitis can cause disability and increased comorbidity risk into adulthood if diagnosed late or treated ineffectively [1]. Patients’ day-to-day experiences are varied and cannot be fully captured in clinical trial settings; while real-world data can help answer many research questions, substantial resources and time are needed to collect these. New datasets derived from existing data have many benefits: maximising the availability of larger sample sizes that may not be feasibly collected individually, improving the generalisability and validity of research, and providing multi-disciplinary and multi-centre collaborative opportunities [2].

CLUSTER [3], a UK Research and Innovation Medical Research Council/Versus Arthritis funded multidisciplinary consortium, aims to improve personalised treatments and predict disease outcomes for JIA and JIA-uveitis through bringing together knowledge, studies, and data. It builds on the work of the MRC-funded CHART consortium (Childhood Arthritis Response to Treatment), which explored how to bring together clinical and biological data from 4 UK observational JIA research cohort studies to create a larger unified dataset for analysis of predictors of treatment response. CLUSTER aims to create a large-scale JIA data resource by harmonising existing data collected in clinical trials and real-world JIA cohort studies. Maximising CLUSTER’s clinical and biological data by successfully harmonising multiple datasets is integral to producing robust analyses with maximal power in this rare disease, and facilitates the goals of defining distinct strata across disease and treatment sub-groups.

As heterogeneous datasets are often collected autonomously, for specific analytical objectives and not in coordination, as well as being nuanced (requiring prior knowledge of data capture methods and coding), a key challenge is in managing and combining disparate, non-standardised datasets. Different systems, data structures, and cultural barriers, such as apprehension to share data and restrictions related to ethical, legal and consent-procedures, are also very common [2, 4].

There are many ways to bring together disparate datasets for analysis, such as data pooling or federated data analyses with subsequent meta-analyses of results: all approaches requiring the critical step of data harmonisation. Local data laws and study specific governance may dictate to what extent data pooling and linkage can occur; but where pooling is planned, knowledge of the potential of duplicated subjects across datasets is also key. Data linkage, where data are combined from two or more sources of data with the objective of consolidating facts concerning an individual or event that are not available in any separate record [5], will also enhance the final dataset, although substantial data cleaning, wrangling, and computational resources may be required.

Objective

In 2021, data from 4 JIA datasets under the CLUSTER umbrella were successfully pooled and made available for analyses. Here, we describe and evaluate the current data harmonisation processes derived as part of CHART and CLUSTER, and highlight how this enables the research and data sharing goals.

Methods

Community

CLUSTER is a multi-disciplinary consortium made up of clinicians and researchers in JIA, uveitis, molecular science, epidemiology, bioinformatics, and data science. Through collaborating with the CLUSTER Champions (our patient and public involvement group), our independent international Scientific Advisory Board, and our industry partners, we involve and are able to capture the needs of all relevant stakeholders in our work to improve treatment outcomes for children and young people with JIA and/or JIA uveitis.

Source data

CHART was the starting point for this ambitious initiative, which required the identification and extraction of data related to JIA treatment and treatment response. Study-specific metadata, including inclusion criteria and data captured in four existing national JIA studies (Table 1; Childhood Arthritis Prospective Study (CAPS) [6], Childhood Arthritis Response to Medication Study (CHARMS) [7], Biologics for Children with Rheumatic Diseases Study (BCRD), BSPAR-Etanercept registry (BSPAR-Et) [8]) was reviewed and brought together into a single, pooled common data model (CDM). These studies were chosen as it was known that they all captured details of JIA disease characteristics, treatment exposures and treatment response, as well as biologic data. These 4 key studies are summarised in Table 1. CLUSTER expands beyond the four CHART studies and also aims to bring in data from other UK studies including JIA-Pathogenesis Study, UK JIA Genetics Consortium (UKJIAGC)) and two clinical trials of new treatments for JIA-Uveitis (SYCAMORE [9] & APTITUDE [10]).

Table 1 Summary of key CLUSTER studies

Data harmonisation

Establishing the Common Data Model

Key data items to include in the CDM were initially determined through a combination of a literature review, biological plausibility, data and metadata items and by reviewing individual data items and their harmonisation potential. It was deemed important to review the data available on the agreed JIA Core Outcome Variables (COV) [11] in order to calculate established JIA disease activity measures (such as the Juvenile Arthritis Disease Activity Score (JADAS) [12] and the American College of Rheumatology (ACR) Paediatric response criteria [11]). A core and feasible treatment CDM, including common coding, based on clinical measures was defined and agreed.

Agreeing the datasets’ purpose

The initial 4 JIA datasets had permissions for data sharing and therefore pooled datasets would be created to facilitate analyses. It was decided that initial distinct datasets of children starting methotrexate (MTX) and tumour necrosis factor inhibitors (TNFi), the 2 most common systemic treatments for JIA, would be created. Each dataset was mapped and transformed to the CDM prior to pooling. Differing levels of granularity were accounted for by combining data using the most inclusive definition. Details on CDM items available in each study are detailed in Table 2, and Supplementary Fig. 1 gives an overview of the data flow and clinical dataset creation process in CLUSTER.

Table 2 Availability of CDM elements across the CLUSTER consortium studies

Our source datasets were all longitudinal and facilitated time point analyses for the study of treatment response. Optimal time points were defined based on individual study designs and the collection time points of data items; baseline (drug start) and 6 months were selected as the best fit to the clinical research question and the data available. As each study allowed flexibility around data collection to fit around routine hospital visits, a window of 3 months before drug start for baseline and 3–12 months for follow-up were defined to capture data that was collected around the chosen time points; if there were multiple visits in this 3–12 month period, the visit closest to 6 months was selected. Some studies collected data more frequently and if a patient had multiple entries of data within the time point window, their closest measurement to baseline/6 months was used.

Data harmonisation hurdles

Some key variables of interest were not available in all studies (e.g. pain visual analog scale (VAS), antinuclear antibodies (ANA) and human leukocyte antigen (HLA) B27), either at the study level or missing at the individual level. Where this occurred, data was recorded as missing and plans to use either complete case analyses or various imputation methods (where appropriate), such as multiple imputation by chained equations (MICE), would be used. Each drug-specific dataset had the same patient inclusion/exclusion criteria applied (diagnosis of JIA, classified by International League Against Rheumatism (ILAR) subtype; treatment-naïve to MTX/TNFi (dataset dependent); MTX/TNFi continued for at least 3 months after starting; at least one COV with no missing data). These were as broad as possible to give flexibility in deciding dataset specificity based on analysis requirements (e.g. reducing time point parameters for timepoint 2, i.e. selecting COV data from 4–8 months after treatment start instead of 3–12 months).

Identifying duplicate participants

Given that there was significant overlap in the time period for data collection and location of study sites across the 4 studies, there was a high possibility of the same participants taking part in more than one study. Conducting data linkage in a well-organised secure environment is essential, and using unique identifiers as linkage criteria (deterministic linkage [13]) can reduce the probability of incorrectly linking individuals. However, limitations such as identifier accuracy can result in missed matches and unseen bias [14].

An approach to identify these subjects, and a strategy for combining individual-level data, was needed. All studies captured either the UK National Health Service (NHS) (England, Wales, Northern Ireland) or Community Health Index (CHI) number (Scotland), a number unique to each person in the UK, assigned at birth or at point of immigration/registration with the NHS. The NHS/CHI number is a unique identifier which falls under protection from UK General Data Protection Regulation (GDPR) and therefore can facilitate data transfer and the identification of duplicates but could not be shared with third parties.

We used OpenPseudonymiser [15], an open source software, hosted in secure research data storage settings at the University of Manchester and University College London which are both certified to NHS Data Security & Protection Toolkit standards. OpenPseudonymiser processes CSV files and pseudonymises identifiable fields such as NHS numbers. It checks the validity of the NHS number and encrypts it to produce a string of output characters, known as the digest. A salt file means the digest output is unique to CLUSTER – any project also using OpenPseudonymiser on the same list of NHS numbers will not produce the same pseudonymised output, unless they use our salt file. The digest can then be used to link data across our cohort studies, as the same NHS number with the same encryption will produce the same digest.

Dealing with duplicate participants

Each participant was assigned a unique CLUSTER identifier based on their digest; this digest is unique for each individual and identifies which children participated in multiple studies. These matches were also confirmed additionally against existing known duplicate lists (created from internal NHS number and/or genetic comparisons), where any identified errors in NHS numbers, sample labels, and duplicate pairs were corrected.

Not all duplicate records resulted in duplicate data, as children could enter the various studies at different stages of their disease (e.g. at disease onset to CAPS, at start of etanercept to BSPAR-ETN). Where duplicate records of the same child containing common data across the same treatment were identified, we agreed on a broad set of hierarchical rules to decide which records to keep (e.g. if a child had records in CHARMS and CAPS, we kept their CHARMS record as CHARMS was set up to evaluate treatment response and overall the data had lower missingness across the core variables), as shown in Supplementary Fig. 2. We considered merging data cross-study if an individual had duplicate records with missing data points; this could have flaws which impact on analysis, such as treating data from different dates/studies during the treatment pathway as though from the same time point/study, and would involve making individual-level decisions on which data to keep for hundreds of individuals. We decided to take a pragmatic approach and keep/remove duplicate data on an individual-level, not a variable-level.

Data platform

CLUSTER facilitates data access to consortium members and external researchers using the open source tranSMART data warehouse platform [16] (Supplementary Fig. 3). TranSMART presents clinical and biological data in an integrated and easily accessible web interface which facilitates data exploration, cohort identification, and complex queries for hypothesis generation and validation. The final CLUSTER treatment datasets will be stored in tranSMART and a data access policy is in place.

Results

Data harmonisation

Overall, a total of 7013 records (from 5435 individuals) were identified across the 4 studies; 2882 records (41%, corresponding to 1304 individuals) represented the same child across the 4 studies: 197 individuals had multiple treatment records within 1 study, 961 across 2 studies, 142 in 3, and 4 children had records in all 4 studies. The crossover of duplicate records is shown in Fig. 1.

Fig. 1
figure 1

Distribution of individuals with more than one record at any timepoint (“duplicates”) across CLUSTER studies

Two harmonised, pooled clinical datasets have been created: MTX starters (n = 2889) and TNFi starters (n = 2401), after removing 250 MTX and 605 TNFi duplicate records respectively; 1018 patients are included in both datasets having started both treatments at different, consecutive timepoints. Both harmonised datasets are accessible on the tranSMART platform with accompanying data and metadata documentation.

Table 3 describes key clinical characteristics in the CLUSTER MTX and TNFi datasets compared to the source cohort studies. Key characteristics are broadly similar across the studies. Where there are more noticeable differences, these can potentially be explained through our hierarchical choice on which data to keep out of duplicate records and the nature of each CLUSTER dataset (i.e. each dataset wants to capture patients who have taken MTX/TNFi for the first time).

Table 3 Key clinical characteristics across the CLUSTER harmonised datasets and the original cohort studies

Data missingness

Although combining datasets increased the sample size considerably, it did not significantly improve levels of missingness, despite attempting to choose the record with the lowest missingness for children with duplicate records (Fig. 2). This likely reflects the fact that these real-world data were extracted from the same medical records and deposited into parallel studies with a high level of existing overlap in their case report forms. Some missingness was expected, particularly in variables that are not routinely collected in clinical visits, and in many cases, higher levels of missingness related to differences in the individual study design (e.g., ANA and HLA-B27 not captured in all studies).

Fig. 2
figure 2

Percentage of missingness across key variables in the CLUSTER MTX and TNF datasets

MTX Methotrexate, TNF Tumour necrosis factor inhibitors, T1 Timepoint 1 (closest values to baseline; allowed -3 months to drug start), T2 Timepoint 2 (closest value to 6 months after drug start; allowed 3–12 months after drug start), JIA Juvenile idiopathic arthritis, RF Rheumatoid factor, HLA B27 Human leukocyte antigen B27, ANA Anti-nuclear antibody, CHAQ Childhood health assessment questionnaire, ESR Erythrocyte sedimentation rate, CRP C-reactive protein, VAS Visual analogue scale

Data crossover

A significant number of these participants also have stored biological samples – the percentages of children who gave biological samples in each original study are shown in Table 1. CLUSTER is currently conducting several analyses which include biological data – by using the CLUSTER ID to identify children included in multiple analyses, we can see the extent of data overlaps within CLUSTER. For example, our ongoing genome-wide association study (GWAS) on JIA patients who started first-line MTX includes 44% of patients in the MTX dataset, and 19% of those in the TNFi dataset, and our ongoing uveitis HLA-B27 fine-mapping study includes 48% and 33% of the MTX and TNFi cohorts, respectively.

Discussion

Successes

Data from over 5400 individual patients with JIA were harmonised to create prospective detailed JIA treatment datasets at a scale rarely seen – the highest number of participants in one of the contributing cohort studies is around 2000 across both MTX and etanercept (BSPAR-Et), compared to 2899 in this MTX and 2401 in this TNFi dataset. Many of these studies continue to recruit patients and collect further data; by logging the processes to create the existing dataset, it can be updated at intervals to expand it further. This is invaluable to progressing meaningful JIA research, particularly into personalised treatments and disease outcomes. Integration has added depth, enables big-data approaches such as machine learning, and highlights inconsistencies that would not be apparent in the individual datasets.

Encrypting and matching duplicates led to improvements in identifying erroneous NHS numbers and biological sample labels in the original studies. Using encrypted NHS numbers and pseudonymised study IDs maximised data usage through pooling individual treatment data from multiple sources and time points to create a more complete picture of a patient’s treatment pathway. This process can also bring in further data as it is generated or discovered. With a common unique identifier facilitating data pooling, larger datasets can now be anonymised and shared with external collaborators and third parties.

Building CLUSTER into a multi-disciplinary community was key in achieving our goals, particularly the early involvement of informatics and data science professionals. These datasets also provide the opportunity to expand our community and link with established consortia such as IMID-BIO-UK [17] to facilitate cross-disease comparison. The additional inclusion of public datasets from the Gene Expression Omnibus (GEO) repository will allow cross-comparison and confirmation in external datasets.

Challenges

Whilst the duplicate patient identification process appears to be accurate, mismatches were identified. As some of these resulted from inaccurate recording of the NHS number in the original study database, it is possible to unknowingly miss duplicate pairs. It is also possible to miss duplicate pairs if a patient is missing an NHS number in one study, though this is a rare occurrence.

Harmonising data is a laborious process as each study is nuanced and significant data cleaning is needed to account for this. Losing specificity impacts detail available for analysis and broad duplicate removal rules could be disadvantageous. For example, where duplicate records existed, we kept CHARMS data over other studies, but automatically lost some pain VAS outcome measures as these are not collected in CHARMS, and the records were retained at the person level and not at the variable level. The impact of this may be an area for future data science research as the impact on our prediction studies has not been fully realised. We also lost granularity, e.g. ethnicity had to be coded in the final CLUSTER dataset as Caucasian/Non-Caucasian as that was the least granular classification across all studies, but much more detailed ethnicity information is available in some studies.

When creating harmonised datasets from existing observational studies, missing data are expected. Our aim was to maximise dataset sizes by avoiding limiting to complete cases only; something that would only be needed for some comprehensive measures of JIA disease activity change. Including those with some missing data retains statistical power and reduces potential biases. However, this could mean that established and validated JIA disease scores cannot be used in some circumstances if missing data are high; though this issue would also exist in the source data. If we choose to apply imputation methods, we can use all available data and make unbiased estimates of expected values, thereby providing more validity than ad hoc approaches to missing data while preserving our sample sizes and power. Imputation methods could also facilitate the inclusion of certain variables within larger analyses that were not collected at all in the source data.

Conclusion

Data pooling and harmonisation are important tools for research, enabling the development of larger, richer datasets which contain detailed treatment response data across patients’ treatment pathways. CLUSTER has succeeded in integrating large, complex JIA datasets and provides a useful reference to similar future projects. Agreeing a framework pre-integration was essential – focusing on a specific, well-defined research question for each dataset meant they were manageable and tailored to their intended use, whilst easily enabling adjustments. Additionally, CLUSTER’s collaborative process was pivotal as data integration on this scale requires a committed, knowledgeable, and diverse community.

However, there are many challenges to consider: time/costs, false linkage, loss of detail, the introduction of errors, systematic biases, and missingness. It is important these limitations are recognised to avoid misinterpretation of findings. Transparent and consistent reporting and appraisal of linked datasets can assist in improving future data collection, coding practices and linkage processes. This again highlights the importance of standardised data collection in the clinical setting.

Ongoing and future studies in JIA should focus on FAIR (findable, accessible, interoperable, reusable) principles [18] to ensure data utility in research outside of initial study plans. One potential solution is to use a consensus-agreed core outcome dataset, which is then widely implemented in clinical care, captured in electronic patient records that are compatible with fast, efficient data download (with appropriate consent for research) such as the one created by CAPTURE-JIA [19].