FormalPara Key Points

Patient characteristics and treatment patterns (regimen choice, progression to subsequent therapy, and therapy duration) were comparable between the health insurance claims cohort and the electronic health record (EHR) cohort of patients with metastatic non-small-cell lung cancer (NSCLC) despite differences in data sources.

The findings suggest that real-world data (claims and EHRs) can reliably provide up-to-date insight into current treatment modalities in patients with NSCLC.

1 Introduction

Lung cancer is the second most common type of cancer for both men and women and has been a leading cause of cancer-related deaths for many years, both globally and in the USA, regardless of sex [1]. In the USA, new lung cancer cases in 2021 were estimated at 235,760, representing roughly 12.4% of all new cancer cases [2]. As the leading cause of cancer-related death in the USA, lung cancer represents 21.7% of all cancer deaths, with an estimated 5-year relative survival probability of 21.7% in the USA based on the 2011–2017 estimates. Over half of patients newly diagnosed with lung cancer present with metastases at the time of initial diagnosis [3, 4]. The 5-year survival probability of patients with metastatic lung cancer is estimated at 6.3% [2].

The prognosis and clinical management of lung cancer differs by histology and tumor characteristics. Broadly, lung cancer can be classified as non-small-cell lung cancer (NSCLC), which accounts for approximately 85% of all lung cancer cases, and small-cell lung cancer (SCLC), which accounts for the remaining 15% [5]. Within NSCLC, the most common histological type is adenocarcinoma, which accounts for roughly half of all NSCLC, followed by squamous cell carcinoma, representing approximately one-quarter of all NSCLC cases [3].

Improved understanding of the molecular pathogenesis of lung cancer has led to the development of specific targeted treatment, which has dramatically improved treatment outcomes for individuals presenting with specific oncogene mutations in their tumor [6,7,8]. NSCLC is heterogenous in its molecular complexities, and multiple targetable oncogene mutations have been identified, for example, epidermal growth factor receptor mutations present in approximately 15% of NSCLC adenocarcinomas [6]; anaplastic lymphoma kinase (ALK) gene rearrangements present in 3–5% of patients with NSCLC [9]; and programmed death ligand‐1 (PD‐L1) with tumor proportion score ≥ 50% presents in approximately 23–28% of NSCLCs [10, 11].

With new treatment options as a contributor, age-adjusted death rates of patients with lung cancer have been steadily declining in recent years [2]. However, the prognosis for metastatic lung cancer remains poor. It is important to examine how new evidence from clinical trials translates into practice for patients with metastatic NSCLC.

As the NSCLC clinical landscape is changing rapidly, this study examined the utility of two different sets of real-world data (RWD) to provide insights into current clinical practice in the USA. This study used both health insurance claims data and US nationwide electronic health records (EHRs). In general, health insurance claims data are records of utilized healthcare services and products for reimbursement by health insurance; EHRs are patient care records from healthcare providers within a healthcare system. This study aimed to examine patient characteristics and treatment patterns among adult patients newly diagnosed with metastatic NSCLC through two RWD sources: health insurance claims data and US nationwide EHRs.

2 Methods

2.1 Study Design and Data Sources

This was a retrospective observational cohort study using the Optum Clinformatics® Data Mart (CDM) database and Optum EHRs. Optum CDM is a de-identified administrative health claims database from commercial and Medicare Advantage health plans, geographically representing the members from all 50 states in the USA. The insurance claims data include member eligibility and all medical (outpatient, emergency department, inpatient) and pharmacy claims submitted for reimbursement on behalf of health plan members. Optum’s longitudinal EHR repository is derived from dozens of healthcare provider organizations in the USA, which includes more than 700 hospitals and 7000 clinics. Optum® EHRs include demographics, medications prescribed and administered, immunizations, allergies, laboratory results (including microbiology), vital signs and other observable measurements, clinical and inpatient stay administrative data, and coded diagnoses and procedures. Optum used a generalized natural language processing (NLP) system to extract and organize concepts from free text into semistructured data fields. Optum’s NLP system was developed using vocabularies from the unified medical language system, which includes multiple medical dictionaries. The performance of the NLP system is verified by a team of medical terminologists and clinicians via manual review of sample EHR notes. Both data sources contain de-identified and anonymized patient records. Data management follows statistical de-identification rules and customer data use agreements that adhere to Health Insurance Portability and Accountability Act requirements. This study was exempt from institutional review board approval as it was a secondary database analysis.

2.2 Study Cohorts

NSCLC-specific International Classification of Diseases, ninth or tenth edition (ICD-9, ICD-10) codes are lacking, so we adapted a treatment-based algorithm published by Duh et al. [12] and validated by Turner et al. [13] to identify patients with NSCLC. All chemotherapeutic agents and regimens defined in this study follow the 2015 American Cancer Society and 2021 National Comprehensive Cancer Network guidelines. Patients with any record of SCLC-specific treatments were excluded. The remaining patients were considered to have NSCLC. In detail, patients who met the following criteria were included in this study: (1) lung cancer diagnosis (ICD-9: 162.xx; ICD-10: C34.xx ) between 1 January 2017 and 31 December 2019 and metastasis diagnosis (ICD-9: 196.xx-198.xx; ICD-10: C77.xx-C79.xx) after lung cancer diagnosis; (2) no prior cancer/metastasis diagnosis (except non-melanoma skin cancer) (ICD-9: 140.xx-209.3x, except 173.xx; ICD-10: C00.xx-C96.xx, except C44.xx) before the earliest metastasis diagnosis; (3) with at least one lung cancer treatment after the earliest metastasis diagnosis (the earliest treatment date was defined as the index date) see Table 6 (Appendix) Lung cancer drugs included in the study (adapted from Turner et al. [13], with inclusion of newly approved drugs and relevant HCPCS codes); (4) with continuous health plan enrollment or active EHR records for at least 6 months before and at least 1 month after the systemic treatment initiation (patients who were deceased within 1 month from systemic treatment initiation were kept in the cohorts); (5) with age ≥ 18 years and known sex at the time of systemic treatment initiation; (6) without any record of receiving SCLC-specific treatments (topotecan, cyclophosphamide, doxorubicin, vincristine, temozolomide, ifosfamide, bendamustine at any time, or platinum + topoisomerase inhibitor combination [cisplatin/carboplatin + etoposide/irinotecan] as the first-line regimen). Patients were followed from the earliest systemic treatment date until a censoring event (i.e., end of continuous health plan enrollment or active EHR, death, or end of data availability [30 September 2020]). In the EHRs, NLP tables extracted from the clinical notes were further used to identify patients with metastatic/advanced NSCLC. For metastatic stage, any attribute terms including ‘metasta’ or ‘advanc’ were used. SCLC-specific terms were used to exclude patients with SCLC. Figure 1 provides further description of the study design.

Fig. 1
figure 1

Study design

2.3 Study Variables

Demographic characteristics included patient age at the index date (18–54, 55–64, 65–74, and ≥ 75 years), sex (female, male), geographic region (midwest, northeast, south, and west), and Charlson Comorbidity Index (CCI) based on comorbidities [15, 16] during the 6 months before the index date. Because this cohort required a diagnosis of metastatic NSCLC, any malignancy-related CCI scoring was not included. Additional patient characteristics were explored within the EHR cohorts through NLP tables, such as tumor histology types and smoking status if such records were available. Within the claims data, when capturing treatments from prescription claims, prescriptions with American Hospital Formulary Service class other than antineoplastic agents were excluded. With EHRs, when capturing treatments from administered medication records, for medications available with multiple routes of administration, non-chemotherapeutic uses (e.g., non-intravenous bevacizumab) were excluded. If multiple prescribed medication records within the same drug class (e.g., ALK inhibitors) were captured within 28 days, the subsequent prescription records were examined to determine the appropriate regimen.

First line of therapy (LoT1) was defined as all systemic therapy a patient received during the first 28 days from the earliest treatment date. If a patient initiated a different drug or had a > 90-day gap after the LoT1, the initiation date of that new drug or any drug after the gap was defined as the initiation date of the second LoT (LoT2), and the LoT2 regimen was defined as all systemic therapy drugs the patient received within the 28-day window from the newly established initiation date. LoT duration was defined as number of days from the LoT initiation date to the last treatment date within that LoT + 28 days or to a censoring event.

2.4 Statistical Analysis

Descriptive statistics, including frequencies and percentages for categorical variables and mean ± standard deviation (SD) for continuous variables, were used to describe the characteristics of the patients. The top ten regimens (frequencies and percentages) for the first three LoTs were reported. Differences in characteristics and treatment regimens between the claims cohort and the EHR cohort were assessed for significance using standardized difference scores [25]. A threshold of 0.2, 0.5, or 0.8 suggested small, medium, or large effect sizes, respectively, between the two cohorts [25]. All statistical analyses were performed using SAS Studio version 3.6 (SAS Institute, Cary, NC, USA). Sankey diagrams were constructed to examine treatment pathways/trajectories using R Studio version 3.6.0 (R Studio, Boston, MA, USA).

3 Results

3.1 Patient Characteristics

Table 1 shows the cohort attrition for both the Optum claims and the Optum EHRs. The Optum claims cohort included a total of 7917 patients with NSCLC, 4010 (51%) of whom were female. Approximately 77% of the patients were aged ≥ 65 years, with a mean ± SD age of 70 ± 9 years at the time of systemic treatment initiation and a mean follow-up period of 373 days. The mean ± SD CCI score was 2.3 ± 1.9. The EHR cohort included 7087 patients, of whom 3549 (50%) were female. Approximately 58% of the patients were aged ≥ 65 years, with a mean ± SD age of 67 ± 10 years at the time of systemic treatment initiation and a mean follow-up period of 362 days. The mean ± SD CCI score was 1.6 ± 1.7. In both claims and EHR cohorts, the most represented geographical regions were midwest and south; in the claims cohort, 25% of the patients were from the midwest and 46% were from the south, whereas in the EHR cohort, 54% were from the midwest and 16% were from the south. Detailed patient characteristics of the two cohorts are presented in Table 2.

Table 1 Cohort attrition
Table 2 Patient characteristics in the claims cohort and the electronic health records cohort

Within the EHRs, further patient characteristics were explored through NLP-derived terms. Of the 7087 patients, 6132 (87%) had clinical notes available to generate NLP-derived terms. Approximately 67 and 16% of the EHR cohort presented with non-squamous and squamous NSCLC, respectively. Such information was not available for the other 16% of patients. At the time of treatment initiation, roughly 24% of the patients were current smokers, 63% had a history of smoking, and approximately 13% had never smoked.

3.2 Treatment Patterns

Platinum-containing (carboplatin/cisplatin) chemotherapies were the most common LoT1 type, with 3374 (43%) and 2662 (38%) patients in the claims and EHR cohorts, respectively, receiving these treatments. Mean LoT1 duration was 146 ± 162 and 147 ± 173 days, with a median duration of 91 and 84 days in the claims and EHR cohorts, respectively. In both claims and EHR cohorts, the five most common LoT1 were the same: carboplatin + paclitaxel, pembrolizumab, carboplatin + pemetrexed + pembrolizumab, carboplatin + pemetrexed, and nivolumab. The standardized difference of the LoT1 distribution between the two cohorts was 0.25. Detailed LoT1 distribution by cohorts is presented in Table 3.

Table 3 The ten most common first lines of therapy in the claims cohort and the electronic health records cohort

In the claims cohort, 2580 (33%) patients died after initiating their LoT1, and 2965 (37%) went on to their second line of therapy (LoT2), with a mean duration of 157 ± 157 days and a median of 98 days. In the EHR cohort, 2573 (36%) died after their LoT1, and 2229 (33%) went on to an LoT2, with a mean duration of 158 ± 158 days and a median of 99 days. The five most common LoT2 were also the same in both claims and EHR cohorts: durvalumab, nivolumab, pembrolizumab, carboplatin + pembrolizumab + pemetrexed, and carboplatin + pemetrexed. The standardized difference of LoT2 distribution between the two cohorts was 0.20. Immune checkpoint inhibitor monotherapy was the most common LoT2 type, with 1449 (49%) and 956 (43%) in the claims and EHR cohorts, respectively. Detailed LoT2 distribution by cohort is presented in Table 4.

Table 4 The ten most common second lines of therapy in the claims cohort and the electronic health records cohort

Subsequently, 837 of the 2965 patients (28%) and 580 of the 2229 patients (26%) went on to receive a third LoT (LoT3), with a mean and median treatment duration of 121 ± 124 and 78 days, and 138 ± 153 and 85 days in the claims and EHR cohorts, respectively. The five most common LoT3 were docetaxel, gemcitabine, nivolumab, pembrolizumab, and docetaxel + ramucirumab in both cohorts. In LoT3, monotherapy regimens were predominant. A new combination regimen (docetaxel + ramucirumab) was observed as one of the common LoT3. The standardized difference of the LoT3 distribution between the two cohorts was 0.38. See Table 5 for details.

Table 5 The ten most common third lines of therapy in the claims cohort and the electronic health records cohort

Figure 2 shows Sankey diagrams of metastatic NSCLC treatments by LoT. From both the Optum claims and the Optum EHR cohorts, the main pathways from LoT1 to PD-L1 agents as LoT2 were through platinum-based doublet chemotherapies: carboplatin/cisplatin + paclitaxel and carboplatin/cisplatin + pemetrexed.

Fig. 2
figure 2

Sankey diagram of metastatic non-small-cell lung cancer treatments by line of therapy in a the claims cohort (N = 7917) and b the electronic health records cohort (N = 7087)

Over treatment initiation years, platinum-containing chemotherapy as LoT1 declined. From 2017 to 2019, its use was 52%, 46%, and 37% of LoT1 in the claims cohort and 46%, 39%, and 31% of LoT1 in the EHR cohort. Immune checkpoint inhibitor monotherapy (e.g., pembrolizumab, nivolumab) as LoT1 was relatively stable over index years, with 22 and 25% in the claims and EHR cohorts, respectively.

4 Discussion

This study explored two US nationwide healthcare utilization data sources to examine the current treatment landscapes for patients with metastatic NSCLC. The claims cohort was slightly older than the EHR cohort (mean age 70 vs. 67 years; standardized difference 0.35). Ages of both cohorts were within the range of the US lung cancer statistics. In NSCLC, treatment decisions are primarily guided by stage and tumor characteristics (e.g., histology, tumor molecular testing) and patient’s performance status [17]. As mean and median ages of both cohorts aligned with general US statistics for NSCLC (mostly diagnosed at age 65–74 years, with median age at diagnosis of 71 years [2]), we did not consider differences in age distribution would significantly affect treatment choices. A similar rationale was taken for region, another demographic variable that differed between the cohorts. The southern regions were overrepresented in claims data, and the midwest regions were overrepresented in the EHR data when compared with US statistics [2].

The claims cohort was slightly sicker than the EHR cohort (mean CCI 2.3 vs. 1.6; standardized difference 0.36). Although the Optum EHRs are not limited to oncologists, they still may not include a complete comorbidity profile for patients with NSCLC as they are an open data source. However, both cohorts were comparable in sex, race, treatment initiation year, and follow-up period and showed similar treatment patterns in terms of choice of regimen, progression to subsequent therapy, and duration of therapy. This implies relatively uniform NSCLC treatment approaches regardless of age and region in the USA. However, the literature suggests that age may influence whether patients receive treatments [18, 19]. We discuss this aspect further in Sect. 4.

Different data features between the two data sources should be noted. First, the claims data provided more comprehensive healthcare resource utilization information for each patient, whereas the EHR data provided more comprehensive clinical information available within the institution. EHRs provide the patient’s clinical records regardless of health insurance coverage. They also provide more detailed and comprehensive clinical information by capturing data from clinical notes and laboratory results.

In terms of treatment patterns, the claims data provide more definite treatment patterns, as a record of processed claims indicates the actual medication received. On the other hand, EHRs provide administered and prescribed medication records. Prescribed medication records do not provide any information on the actual patient’s drug acquisition and/or drug utilization. Prescription abandonment (i.e., patients not picking up prescribed medications at pharmacies) is a known issue in the USA [20]. This is a significant disadvantage of EHRs when examining treatment patterns. However, it may not have a significant impact on this study for two reasons. First, cancer drugs are a protected drug class for prescription drug benefits under the Medicare program. Given that the majority of the cohort was aged > 65 years (77% in claims, 58% in EHRs), the eligible age for Medicare, we assumed that the drug coverage by health plans did not play a significant role in determining the treatment choices in this study. Second, most NSCLC treatments are administered via infusion. Within-hospital or clinic uses are well captured through administered medication records from EHRs. Additionally, the similarity of oral chemotherapeutic agent uptake (e.g., 4 and 5% osimertinib) as LoT1 in the claims and EHR cohorts, respectively, further supported this position. The Sankey diagram of treatment pathways between the two cohorts suggested similar regimen choices by lines of therapy as well as similar general subsequent regimen trajectories. This suggests that general treatment approaches in NSCLC are relatively standardized in the USA, and new clinical evidence/guideline recommendations are rapidly adopted into practice.

One strength of our study is that we examined the most up-to-date treatment patterns in patients with NSCLC. In the USA, the first immune checkpoint inhibitor was approved for NSCLC treatment in late 2015, and some of the targeted therapy drugs were approved within the last 5 years. In addition to identifying commonly used regimens, newer treatment options may also influence the overall patient characteristics of patients with metastatic NSCLC who initiated treatment.

Traditionally, platinum-containing chemotherapy (i.e., without combination with newer treatment options such as immune checkpoint inhibitors or targeted therapy) was a standard of care in patients with metastatic NSCLC. In this study, platinum-containing chemotherapy regimens were the most common LoT1. However, its use declined each year, as noted in Sect. 3.2. It is worth noting that LoT1 that consist of platinum-containing chemotherapy in combination with newer treatment options increased between 2017 and 2019, with 13%, 20%, and 28% of LoT1 in the claims cohort and 9%, 18%, and 26% of LoT1 in the EHR cohort.

We explored the potential additional utility of EHRs. Approximately 12% of the metastatic stage was additionally captured by NLP terms, compared with using only ICD codes in the EHR cohort. After patients receiving SCLC treatments were excluded, a further 3% of patients were excluded because NLP terms indicated they had SCLC. This may indicate that the NSCLC cohort algorithm adapted from the literature [13, 14] with updated treatments reliably captured the NSCLC cohort through RWD. After the cohort was defined, we explored smoking status and histology, information that is useful for NSCLC management and a potential advantage of EHRs.

We were unable to estimate the overlap of patients between the two study cohorts as we could only analyze de-identified patient data. According to Optum, about 26% of the patients with an ICD code for lung cancer in the EHRs can be linked with the Optum claims. In both cohorts, the criterion that lost most patients was the inclusion of patients with systemic treatments after the metastasis date, excluding roughly 50 and 70% from the prior step in claims and EHRs, respectively. David et al. [17] found that 22–26% of patients with advanced/metastatic NSCLC received no treatment in 1998–2012. Kehl et al. [18] reported that approximately one-half of elderly patients (aged ≥ 66 years) received a systemic treatment within 1 year of the metastatic NSCLC diagnosis between 2012 and 2015. Although the estimates varied by study population, studies generally identified that certain patient demographics/characteristics were associated with not receiving any systemic treatment, including older age [18, 19], low socioeconomic status [18, 24], and poorer prognosis [19, 24]. Given that this study’s cohorts were patients with metastatic disease and therefore a worse prognosis, and considering the mean ages of the cohorts, it may be conceivable that a relatively large proportion of patients did not receive treatments. Excluded patients would also have included patients who were lost to follow-up before treatment initiation because of discontinued health plan enrollment or no patient activity found in the EHR system. The larger number of patients excluded in the EHR cohort might be influenced by insurance status and the possibility of receiving treatment somewhere else after diagnosis. From the step that excluded SCLC-specific treatments in the attrition table, both cohorts lost roughly 18–20% from the eligible total of patients with metastatic lung cancer, which seems reasonable as roughly 15% of lung cancers are SCLC [5].

A few limitations need to be noted before interpreting the study findings. First, both data sources were primarily collected for non-research uses. Therefore, the databases may be subject to coding errors and missing values. Second, the claims cohort only included commercially insured patients under one payer (Medicare Advantage patients), which represents a smaller proportion of all Medicare enrollees. The EHR cohort may overrepresent the practice patterns of large healthcare systems as > 80% of the cohort patients were from an integrated delivery system. Third, the NLP tables in EHRs may be subject to missingness and potential extraction errors. Finally, as there were no ICD codes for NSCLC, and we used treatments as a proxy, this study only included patients with NSCLC receiving treatment.

However, the similarities in patient demographics and treatment patterns between the two cohorts suggest that this study can provide relevant insights into current clinical practices in NSCLC treatments despite the mentioned limitations. Further studies are needed to explore the comparison between claims and EHRs to understand treatment patterns in other therapeutic areas.

5 Conclusion

This study used two different RWD sources (health insurance claims and EHRs) to capture patients with metastatic NSCLC, demographic and clinical characteristics, and treatment patterns. In both cohorts, commonly used medication regimens were comparable for all LoT1, LoT2, and LoT3. For the LoT1, traditional chemotherapy was more common, followed by immune checkpoint inhibitor monotherapy (e.g., pembrolizumab, nivolumab). With subsequent lines of therapy, new treatment modalities were commonly adopted in patients with metastatic NSCLC, especially immune checkpoint inhibitor monotherapy. Despite differences in data features, the findings suggest that the two cohorts utilizing RWD provided up-to-date insight into the current treatment modalities in patients with metastatic NSCLC.