CPRD Transformation to the OMOP CDM
For the transformation, we used the CPRD version that contained data collected through July 29, 2013 and began by designing an extraction, transformation and loading process .
Table 1 provides the CDM table names, descriptions and CPRD source data tables for all CDM tables  that had the equivalent data in CPRD. We sought to populate all these CDM fields with the appropriate CPRD data. Not all patients from the CPRD raw data were included within the CPRD CDM; those that met the CPRD provided definition of a valid patient for research purposes were included (met acceptability criterion and had observation time in the database). Out of 15,000,986 patients in the raw CPRD data, 11,342,669 met the definition for inclusion in the CPRD CDM or 75.6 %. Additionally, data not within the patient’s valid observation period by convention are not converted over to the CDM; 23 % of drug exposures, 35 % of conditions, 27 % of procedures and 16.7 % of observations were not within the patient’s valid observation period and were not included in the CPRD CDM. The overwhelming majority of CPRD data not within valid observation time are medical history data, data that are prior to the patient joining the practice or prior to the practice data being classified as ‘up-to-standard’. One notable omission of CPRD source data from the CPRD CDM is referral information such as specialty and urgency (referral conditions, procedures and observations and their event dates were captured). In the following, we report mapping difficulties, imputation required, or structural differences between the two data sources we encountered.
Multilex to RxNorm Mappings
RxNorm, a US-based normalized naming system for generic and branded drugs, is the standard drug lexicon used in the CDM. CPRD uses Multilex codes to identify medication. All content (e.g. conditions, procedures) in the OMOP CDM are referred to by concepts. The OMOP Standard Vocabularies are used to understand and make use of these concepts. We assigned an RxNorm concept to each Multilex source code using mappings from version 4.3 of the OMOP Standard Vocabularies for the CPRD CDM conversion . These mappings used the Multilex code components that identify ingredient, strength and form of each drug and then constructed a mapping to equivalent RxNorm components. A full mapping to an RxNorm product was established if all components could be mapped and a product in RxNorm existed with the same combination of ingredient, strength and form. In the case where a product available in the UK was not available in the US (because of strength or formulation differences) or an ingredient approved by the European Medicines Agency (EMEA) has not been approved by the US Food and Drug Administration (FDA), a new Multilex concept with all the attributes of a RxNorm concept (ingredient, strength and form) was created for inclusion in the OMOP standard dictionaries.
Validation efforts for drug exposure mappings included generating the proportion of database records in the CPRD CDM drug exposure file mapped to RxNorm concepts and proportion of terms mapped (Multilex codes that exist in the raw CPRD data) to RxNorm concepts. In addition, the top 100 most frequently occurring therapies in CPRD were reviewed for mapping completeness and accuracy and the top 100 unmapped therapies were evaluated to determine if mappings were in fact possible for these high-frequency codes. To test the theory that other avenues of CDM information loss may occur besides mapping losses, drug prevalences for all database years between the CPRD raw data and the CPRD CDM were compared. We wrote a SQL program to estimate prevalence against the raw schema, and independently wrote a different SQL program to estimate prevalence against the CDM. Because Multilex to RxNorm mappings are 1:1 in only 13 % of cases, we applied the Multilex to RxNorm mappings to the CPRD raw data to Multilex codes that occurred during valid patient observation time. We only included patients in the raw data prevalences that were acceptable patients and had valid observation time and drug exposure dates needed to be within the patient’s observation period.
Ingredient Information for Drug Products
In the OMOP Standard Vocabularies, RxNorm clinical drug (drug product) concepts that contain strength and formulation information have relationships with ingredient concepts. This allows drug exposure data to be aggregated based on ingredients to create drug eras, which can be described as an inferred period of continuous exposure to a certain ingredient over a certain period of time with a 30-day persistence window (duration allowed between subsequent drug records) . It is important to note that the OMOP CDM applies a standard convention for deriving drug eras based on a 30-day persistence window and this convention is applied consistently across all databases. However, if a specific analysis use case requires a different set of assumptions for inferring consistent episodes of exposure, the CDM can accommodate this with the drug exposure table. If a clinical drug to ingredient relationship is not provided in RxNorm, then that drug was not included in the drug era file, which contains drug eras as described above. To assess the impact of this on the CPRD to CDM conversion, we evaluated the percentage of CPRD CDM drug exposure clinical drug records that had no ingredient relationship, and the proportion of drug exposures affected.
Drug Exposure Duration Imputation
In the CPRD data, prescription duration is not a required field, and only 7 % of drug exposures are recorded with a duration value. Drug quantity is recorded more consistently (99.3 % of all drug exposures have a valid quantity value) and a CPRD-derived numeric daily dose field is provided for drug exposures. However, 26 % of numeric daily dose values are invalid, primarily for prescriptions with instructions to take ‘as needed’ or ‘as directed’ with no clear daily dose indicated. In addition, medications such as inhalers may have an amount of containers given in the quantity field that will not yield a valid duration when divided by numeric daily dose. Thus, we imputed drug duration for all exposures with invalid duration values of 0 (93 % of data) or >365 (0.0004 % of data) days. We performed the imputation stepwise:
If the CPRD duration field was invalid, we used the most common valid duration in the data for that combination of product, numeric daily dose, quantity, and number of packs given.
If such a combination did not produce a valid duration value in the data, then the most common valid duration in the data for the product only was used.
Last, if there were no valid durations in the data for a particular product, we set the duration to 1 day.
For validation purposes, we identified and examined problematic imputations after filtering out credible records with durations of 28 or 30, those with absolute difference no greater than 5 between quantity/numeric daily dose and imputed duration, and numeric daily dose = 0 (implies duration will be difficult to assess). We also examined separately database records with a valid numeric daily dose (>0) to calculate proportions of database records with imputed duration equivalent to quantity/numeric daily dose and proportions of database records with absolute difference no greater than 5 between quantity/numeric daily dose and imputed duration.
CDM Domain Classification Efforts
The Read dictionary version 2 is a coded thesaurus of clinical terms, in use in the UK National Health Service (NHS) to capture all aspects of patient care, including diagnoses, symptoms, findings, procedures, laboratory tests and care administration. This contrasts with coding systems in US claims databases that typically provide separate dictionaries for diagnoses and procedures and/or a way to distinguish between the two. In addition, US claims databases generally place codes for procedures and diagnoses in separate fields while Read codes are placed in one field with no domain information provided from the data structure or the Read code itself. Therefore, a domain classification effort for all Read codes was necessary to partition Read code records into the appropriate condition, procedure and observation CDM domains.
A method making use of the hierarchical nature of the Read dictionary was devised to perform this partition. Read codes are comprised of five hierarchical levels, with a higher level functioning as the ‘parent’ of the next lower ‘child’ level. The first level contains the Read chapter that provides a crude indication of domain (e.g. Read chapters A–Z usually indicate conditions, 7 indicates procedures). Though there were multiple domain types within chapters, the first four levels could be used to identify domains systematically. Therefore, all Read codes with the same values in the first four levels were reviewed manually by a clinician and classified to the same CDM domain in the OMOP Standard Vocabularies. To validate this method, the 100 most frequently occurring conditions, procedures and observations in CPRD were reviewed for domain classification accuracy.
Read to SNOMED-CT Mappings
In the CDM, the systematized nomenclature of medicine-clinical terms (SNOMED-CT) is the standard lexicon for conditions, procedures and observations. It provides a collection of medical terms with codes for anatomy, diseases, findings, procedures and other domains. For this CPRD CDM transformation, we applied Read to SNOMED-CT mappings provided by the NHS.
We validated this approach by generating the proportion of database records mapped to SNOMED-CT concepts in the CPRD CDM and proportion of terms (Read codes found in the CPRD raw data) mapped to SNOMED-CT concepts for the condition occurrence, procedure occurrence and observation files and reviewed the 100 most frequently occuring conditions, procedures and observations for mapping completeness and accuracy. The 100 most frequent unmapped conditions, procedures and observations were also evaluated to determine if mappings were in fact feasible for these high-frequency codes.
Information loss for conditions was also assessed by comparing condition prevalences for all database years in the CPRD raw data and the CPRD CDM. This was accomplished with a SQL program that estimated prevalence against the raw schema, and an independently written second SQL program that estimated prevalence against the CDM. We examined condition Read codes that occurred during valid patient observation time. To estimate prevalence, we included patients that had an indicator flag for being an ‘acceptable’ patient and had valid observation time. We considered all condition occurrences where the condition dates fell within the patient’s valid observation period and compared the Read code-based prevalence from the raw source with the SNOMED-CT-based prevalence from the CDM. Read codes were analyzed in this manner separately in three groups: those that had a 1:1 mapping with SNOMED-CT concepts, Read codes with the same text description but ‘NOS’ (not otherwise specified) grouped in the raw source and conditions where there was more than one Read code such that the Read-to-SNOMED_CT mappings were applied to the CPRD raw data to produce a condition prevalence estimate.
CPRD Lifestyle and Clinical Data
Valuable patient lifestyle information, such as smoking status and body mass index, and clinical measurements, such as blood pressure and laboratory results, are provided in the CPRD data. Because lifestyle and clinical information are potential confounders in observational studies, and laboratory results may be useful for assessing disease status, it was important to include them in the CPRD CDM. CPRD raw lifestyle/clinical data are housed in two tables; within these two tables each data category (e.g. smoking) has a varying number of data elements (e.g. status, cigarettes per day, cigars per day) and these data elements are associated with varying lookups. We created an algorithm to process all data elements in the same manner despite the unusual format described above. Custom source codes were constructed from the data category and data element information and mapped to the Logical Observation Identifiers Names and Codes (LOINC) dictionary concepts (e.g. source code of ‘4–2’ was assigned a source code description of ‘Lifestyle Smoking Cigarettes per day’) for the implementation.
The algorithms were validated by examining patients with a representative mix of data element types in the raw data against the same patients in the resulting CPRD CDM and by having a second programmer independently code and execute the algorithm and confirm that the results agreed.
Replication Study Methods
A replication of a prior published study by Schlienger et al.  was performed using our instance of raw CPRD data and also the transformed CPRD CDM to compare the results; agreement would serve to further validate the accuracy of our CPRD CDM transformation. Because of changes to the data since the original study was published, it was expected that results found in our raw data study would not have perfect agreement with those reported in the original paper.
Raw Data Analysis
Cases in the Schlienger et al.  study had an incident AMI between January 1, 1992 and October 31, 1997. Each patient’s observation period began at the latest of: the date the patient’s current period of registration with the practice began and the date the practice was deemed to be of research quality, and ended at the earliest of: the date the patient transferred out, the date of last collection of practice data and the patient’s date of death; incident AMI diagnoses had to be within the patient’s observation period. Patients were required to be aged ≤75 years at the date of their AMI (the index date), have an observation period that began at least 3 years prior to that, and not have had one of the following diagnoses between the start of their observation period and 60 days before their index date: AMI, angina pectoris, unexplained chest pain, cardiac arrhythmias, congestive heart failure, stroke, intermittent claudication, venous thromboembolism, chronic renal disease, hypertension, hyperlipidemia, diabetes mellitus, or connective-tissue disorder.
Read code lists for the condition classes referenced above were created using relationships available within the OMOP Standard Vocabularies, all Read code and Multilex code lists for the raw data study mentioned herein are provided in Online Resource 1. Generally, we used the OMOP Standard Vocabularies to generate source code sets for the raw data study rather than the CPRD-provided dictionaries because using the former allows relationships between clinical concepts to be leveraged so that source codes with different terminologies for the same clinical concept can be identified. Standard string searches that can be used instead require a priori knowledge of all possible terminologies. Because some of the condition classes were broad for the prior history exclusion, higher-level SNOMED-CT classification concepts or MedDRA (Medical Dictionary for Regulatory Activities) High-level Group or High-level Term concepts were used to extract Read source codes from the OMOP Standard Vocabularies. For example, to gather all cardiovascular disease Read codes for the prior history exclusion, the SNOMED-CT term ‘Cardiovascular disease’ was used. All cardiovascular disease concepts hierarchically ‘below’ this concept were identified and Read source codes generated from these concepts. A manual review of these codes was performed to make sure unwanted clinical concepts were not included.
As the original analysis specified, four controls were chosen per case and matched on index date, year of birth, gender, physician practice attended and total observed time prior to the index date. The same exclusions applied to the cases were applied to the controls. As an additional sensitivity analysis, we required controls to exhibit visit activity up to 1 year prior to the index date in addition to the original matching criteria.
NSAID exposures included the following ingredients: acemetacin, diclofenac, diflunisal, etodolac, fenbufen, fenoprofen, flurbiprofen, ibuprofen, indomethacin, ketoprofen, mefenamic acid, nabumetone, naproxen, piroxicam, sulindac, tenoxicam and tiaprofenic acid. Code lists were generated with the OMOP Standard Vocabularies using the NSAID ingredient concepts above to identify applicable Multilex codes. String searches in the Multilex dictionary by ingredient were also conducted to identify any Multilex codes that may have been missed, e.g. they were not mapped in the OMOP Standard Vocabularies. The NSAID exposures were required to start prior to the patient’s index date and within the patient’s valid observation period. Patients were defined as a ‘current user’ if their supply of last NSAID prescription prior to the index date ended at or after the index date, a ‘recent user’ if their supply ended 1–29 days before the index date, a ‘past user’ if their supply ended 30 or more days prior to the index date and as non-users if they had no NSAID records prior to the index date. To classify patients as above it was necessary to calculate the duration for each NSAID drug exposure. We used the same duration imputation for the CDM analysis and for the raw data analysis to facilitate comparison. Patients were also classified according to the number of NSAID prescriptions (a proxy for total duration of NSAID therapy) during the patient’s valid observation period. ‘Current users’ were also classified by ingredient.
Potential confounders, body mass index (BMI), smoking status, current aspirin use and long-term hormone replacement therapy (HRT), were assessed manually in the original study from patient profiles. In our analysis, we extracted this information programmatically. BMI (categories: <25, 25–29.9, ≥30 and Unknown) and smoking status (categories: Non, Current, Ex and Unknown) were obtained from the CPRD lifestyle and clinical measurements data in the patient’s observation period prior to the index date. Aspirin Multilex codes were generated from the OMOP Standard Vocabularies with an ingredient search. The duration of each aspirin exposure was calculated using the algorithm described above for the CPRD CDM and NSAID exposures. HRT codes were generated from the CPRD drug data dictionary with a BNF (British National Formulary) chapter search (06.04.01.01: Oestrogens and HRT). If the patient had 10 or more HRT prescriptions, she was considered to have been on long-term HRT therapy. A conditional logistic regression model was run for the matched case–control sets to assess AMI risk with the different NSAID categories, adjusted for BMI, smoking, current aspirin use and long-term HRT, and odds ratios (OR) reported with 95 % confidence intervals.
We then created a replica of the raw data analysis using the CDM data, including the same case–control analysis to evaluate the risk of first-time AMI associated with NSAID exposure. All source code lists used in the raw data analysis were converted to OMOP concepts using the OMOP Standard Vocabularies; all concept code lists mentioned herein used for the CDM analysis are provided in Online Resource 2. The CDM drug era aggregate file was searched using the NSAID ingredient concepts listed above in the ‘NSAID exposures’ section of the raw data study. NSAID Multilex codes identified via CPRD dictionary string searches for the raw data analysis that had no mappings and/or valid relationships to ingredients could not be included here as drug exposures in CDM drug eras are collapsed by drug ingredient. The number of NSAID prescriptions per patient were calculated as the total number of prescriptions used to create all NSAID drug eras for the patient. The HRT code list used in the CPRD raw data analysis was converted to ingredients using the OMOP Standard Vocabularies and the drug era file was used to determine HRT exposures. Aspirin exposures were identified using the drug era file and the concept for aspirin.
Patient observation periods were calculated in the CDM with the same algorithm used for the raw data study. Data are not included in the OMOP CDM by definition if they do not occur during the patient’s valid observation period. The condition occurrence, procedure occurrence and observation files were searched for conditions instead of the condition era file because the distinction between condition, procedure and observation can often be blurred in the Read dictionary. A procedural record example of this phenomenon is ‘Diab mellit insulin–glucose infus acute myocardial infarct’ (Read code 889A.00), which was used to identify a prior history of diabetes and AMI for the study exclusion criteria. BMI and smoking data were extracted from the observation file using LOINC BMI and smoking concepts.