Background

Historically, prospective randomized clinical trials have served as the “gold standard” for evidence generation in oncology. Given that only a small percentage of cancer patients take part in clinical research studies [1], there is increasing interest in leveraging the data contained in administrative and clinical databases for patients treated outside of clinical trials, as these data can provide guidance for treatment decisions. Such real-world data have the potential to be more representative of patients in routine practice, given that clinical trials tend to enroll highly selected patients who are younger and have fewer comorbidities. Furthermore, real-world data can supplement the results of prospective clinical trials in settings where accrual is difficult due to uncommon clinical or genomic selection criteria. Recently, the Twenty-first Century Cures Act [2] and United States Food and Drug Administration 2018 Goals [3] both highlighted the imperative to understand how real-world data can be optimally used to improve health.

The data contained in electronic health records (EHRs) afford an important opportunity to test hypotheses regarding patterns of care and outcomes in a broadly representative sample of cancer patients. EHR data are characterized by date and may not require third-party primary data collection. However, the impact of real-world data is dependent upon the reliability of specific data elements, their completeness, and the ability to ensure and trace their provenance [4]. Thus, the promise of EHR data can only be realized if each data point is carefully assessed and validated before clinical conclusions are drawn.

As an example of the research application of EHR data, we sought to validate in a real-world setting recent clinical findings from a group of prospective clinical trials in patients with metastatic colorectal cancer (mCRC). Historically, the clinical development of systemic therapies for mCRC has not distinguished patients based on the location of the tumor within the bowel. However, recent analyses have suggested that anatomical side of the colon from which a tumor arises is a prognostic and predictive indicator of survival [5,6,7,8,9]. These studies have indicated that CRCs arising from the left or right side of the colon differ significantly in their clinical characteristics and gene expression profiles [10,11,12,13], with right-sided tumors being associated with a worse prognosis [14,15,16]. Therapeutic outcome also may differ by tumor side, with several analyses reporting differences in benefit with epidermal growth factor receptor and vascular endothelial growth factor antibodies in left- vs right-sided mCRC tumors [5, 6]. These findings led to a recent international expert panel recommendation that primary tumor location be included as an essential data element in the design and reporting of colon cancer clinical trials [17]. As an initial step in seeking to replicate these findings in a real-world population, we undertook a formal analysis of the ability to obtain information about tumor sidedness from billing codes (International Classification of Disease [ICD] 9/10) in EHRs. The overall goal of this study was to determine the feasibility of using structured diagnostic codes to determine tumor location for patients with mCRC. The formal validation approach described herein may be broadly applied to other clinical contexts where data points from EHRs are being considered for use in outcomes research.

Methods

Data source

This validation study was conducted using the nationwide Flatiron Health database, a longitudinal, demographically and geographically diverse database derived from de-identified EHR data. The Flatiron Health database includes data from over 265 cancer clinics, comprised of both community and academic oncology clinics, representing more than 2 million US cancer patients available for analysis. The de-identified patient-level data in the EHRs includes structured data (e.g. billing codes, laboratory measurements, visits, and prescribed drugs) and unstructured data curated via technology-enabled chart abstraction from physicians’ notes and other unstructured documents (e.g. physician progress notes, pathology reports).

Patient selection

From the broader Flatiron Health EHR-derived database, a cohort of mCRC patients was created. Patients were selected for an ICD-code of colon or rectal cancer (153.x, 154.x, C18x, C19x, C20x, or C21x), at least two clinic visits in the Flatiron network that occurred on or after January 1, 2013, and clinical documentation of mCRC. Patients lacking relevant unstructured documents in the Flatiron Health database for abstraction were excluded. Of 9403 patients with confirmed metastatic colon cancer, a random sample cohort of 200 patients who met the above criteria was included in this study. The random sample was selected using a random number generator with a specified seed so that the list of patients is reproducible. As the current analysis focused on side of colon, patients with a confirmed diagnosis of metastatic rectal cancer were excluded from the validation study.

Identification of tumor location

ICD codes were compared with location identified through human abstraction of unstructured data to establish the quality of ICD-defined tumor location. For both ICD-defined and abstracted tumor location variables, tumors were classified as left side (splenic flexure, descending colon, sigmoid colon, rectosigmoid junction), right side (cecum, ascending colon, hepatic flexure), or transverse (transverse colon).

Identification of tumor location based upon structured data

Data captured in the Flatiron Health EHR-derived database include ICD, 9th and 10th revisions (ICD9 and ICD10; see Table 5 in Appendix) for diagnoses [18]. Whereas some codes can differentiate CRC tumor origin (i.e. ICD9 153.1/ICD10 C18.4: Malignant neoplasm of transverse colon, ICD9 153.7/ICD10 C18.5: Malignant neoplasm of splenic flexure), there is also an unspecified code (ICD9 153.9/ICD10 C18.9: Malignant neoplasm of the colon, unspecified site) that can be used by physicians.

ICD9/10 codes were available from the diagnosis table in the EHR database and were used to classify patients. The full list of codes and categories used is listed in Table 5 in Appendix: A. The date of the ICD code closest to the initial diagnosis date was used to assign side of colon with the following considerations: if a patient had multiple ICD codes that indicated different sides on the same date, and if this date was closest to the diagnosis date, the patient was categorized as having CRC in multiple sites of the colon. If one of the codes was an unspecified code, it was dropped and the specific code was used to classify the patient (e.g. “Left colon, Unspecified colon” became “Left colon”). For patients with no abstracted initial diagnosis date, the first relevant ICD code was selected.

Identification of tumor location based on chart abstraction

In order to establish the quality of ICD-defined tumor location, ICD codes were compared with location identified through human abstraction of unstructured data. Centrally trained abstractors reviewed all relevant unstructured documents included in the patients’ EHR, including pathology reports, physician notes, and surgical notes to identify evidence of the side of colon. To classify a patient, abstractors looked for terms such as “left colon” or “right colon,” as well as the specific sites within the colon, as described in Table 5 in Appendix: A.

Statistical methods

Patient characteristics were summarized using counts and percentages for categorical variables, and medians and interquartile ranges for continuous variables, for the full mCRC dataset (9403 patients) and the 200 randomly selected participants in our validation study. Concordance between structured ICD codes and abstracted diagnosis was determined via observed percent agreement and Cohen’s kappa coefficient (κ). The concordance analysis assumed no gold standard. Accuracy of ICD codes was determined by calculating the sensitivity, specificity, positive and negative predictive values, and corresponding 95% confidence intervals, using the abstracted data as the gold standard. “Unspecified colon side” in the unstructured data was treated as “No” for all of these analyses.

Results

Baseline characteristics for patients in this study (N = 200) were similar to patients in the full mCRC dataset for all variables examined (Table 1). Half of the validation study patients were male (50%), and more than half were aged 65 and older (59%), and had stage IV mCRC at initial diagnosis (54%). An additional 28% had stage III CRC at initial diagnosis. Site-specific ICD codes were available for 5940 (63%) patients in the parent cohort (Table 2).

Table 1 Patient characteristics of full EDM registry patients and 200 randomly selected study patients
Table 2 Comparison of patient and clinical characteristics based on presence of specific ICD codes

When patients with unspecified ICD codes were excluded from the analysis, the distribution of side of colon using ICD9/10 codes was similar to the distribution observed using the abstracted tumor site. Of the 200 study patients, 50% had a left-sided tumor, 34% had a right-sided tumor, and 6% had a transverse tumor, based on abstracted data (Table 1 and Table 3). Approximately 4% (n = 8) of patients were considered to have rectal cancer based on ICD codes; however, through chart abstraction these patients had a confirmed diagnosis of colon cancer. Thus, this discrepancy represents misclassification of these patients based on ICD codes alone.

Table 3 Sampled patients with side identified by ICD code or by abstraction

When all 200 study patients were considered, concordance was moderate between the structured (ICD) data and the unstructured (abstracted) data, with an observed agreement of 0.58 (κ = 0.41). When patients who were classified as unspecified or rectal in the structured data were removed, the observed agreement was 0.84 (κ = 0.79). Seventy-six (38%) patients were classified as “unspecified” using ICD codes, and 63 of these (83%) had the side identified through abstraction. As shown in Table 4, specificity of structured data for tumor location was high, ranging from 92 to 98%. Sensitivity, negative predictive value, and positive predictive value were of lower performance, ranging from 49–63%, 72–97%, and 64–92%, respectively. When patients with non-specific side of colon ICD codes were removed, sensitivity improved to ~ 80% for all tumor locations. Similar estimates were observed when stratified by stage at initial diagnosis (Stage I-III vs. Stage IV) (Additional file 1: Tables S1–S4).

Table 4 Accuracy of ICD codesa in sampled patients

In an effort to identify potential biases regarding the likelihood that ICD coding for tumor location was present, we compared the clinical characteristics of those patients who had specific diagnosis codes and those who did not. There were no differences in age, stage, sex, or treatment distributions between these two cohorts (Table 2). A gradual increase in the use of specific ICD codes was observed over time, with 57% of patients diagnosed in 2011 having a specific ICD code, increasing to 74% of patients diagnosed in 2016, and a higher proportion of use of non-specific ICD codes was seen in academic centers compared with community centers; however, the number of academic sites was small compared to community centers.

Discussion

This study demonstrates that billing codes are a highly reliable indicator of tumor location, when the specific location code is entered in the EHR. For a sizable minority of mCRC patients, non-specific colon cancer ICD codes are captured in the EHR; thus, structured data for these patients do not indicate tumor side of colon. In these cases, chart abstraction can increase the completeness. If studies are restricted to patients with specific ICD codes, there would likely be minimal bias introduced as the patients with and without specific ICD codes were similar with respect to demographic and clinical characteristics.

A few limitations for this study exist. Although chart abstraction was considered the gold standard, it is subject to errors introduced by abstractors potentially mis-reporting information or by inaccurate information being recorded in the unstructured parts of the EHR. However, chart abstraction is the accepted gold standard for validation studies from administrative claims and other databases, such as EHRs. Additionally, billing codes are collected for the purposes of reimbursement, not for research. Thus, a bias may exist if there are reimbursement incentives based on charges for the treatment based on tumor site. Furthermore, there may be variation in how billing codes are assigned and recorded at the centers in the Flatiron network; however, we did not observe any systematic differences based on centers, with the exception of a higher proportion of patients without specific codes being treated at academic centers. Further studies are needed to validate whether these results are representative of a wider range of data sources, including sources from outside of the US where billing coding practices may differ.

Our analysis demonstrates that ICD codes adequately characterize side of colon for use in studying outcomes for left- versus right-sided colon tumors following specific therapies. However, certain other research questions, e.g. characterizing very small populations such as BRAF-mutant mCRC patients by variables including primary tumor site, may require a side of colon variable with greater completeness of specific side of colon data. The high specificity of structured data suggests that this augmentation of ICD codes with chart abstracted data may, in some situations, be targeted to only those patients with non-specific CRC ICD codes. For other situations, such as creation of a matched cohort with tumor side as a covariate, abstracting tumor side for all patients in a cohort may be warranted to optimize the quality of the variable.

Conclusions

Overall, these analyses demonstrate the rigor necessary to characterize an EHR-based variable in terms of reliability and completeness, before engaging in formal testing of clinical hypotheses that could be practice-changing. Such methodological assessments are necessary before conducting large-scale research using variables generated from EHRs.