1 Introduction

Real-world data (RWD) represent data collected from diverse areas of routine clinical practice and are considered as mutually complementary to randomized controlled trials [1]. The RWD originates from electronic health records (EHRs), health insurance claims data, registries (disease and product) and pragmatic clinical trials [2]. In Japan, most RWD studies are based on administrative databases, including insurance claims and diagnosis procedure combination (DPC) data. The research evaluating the outcomes of drug/treatment is not feasible using health insurance claims data or DPC data in Japan as they are currently limited to charting review methods at limited facilities and incur high cost. On the other hand, there is a growing research interest in using electronic health record (EHR) databases for generating RWD.

EHRs, simply, are electronic clinical information systems used and maintained by healthcare systems to collect, store and present longitudinal electronic data collected during the delivery of healthcare [3,4,5]. EHR databases contain a wider range of variables recorded during medical examination compared to administrative databases. EHRs capture large arrays of progressive medical information from patients at different time points in the disease history and across different clinical care systems. EHRs contain various types of patient-level structured data, such as demographics, diagnoses, medications, vital signs and laboratory data [3]. Utilizing structured data for RWD collection is convenient; however, EHRs also contain considerable amounts of unstructured data, such as progress notes which remains a significant challenge in using EHRs for RWD collection [6]. The unstructured data may contain key patient information absent from the structured data. Indeed, most clinical outcomes are unavailable from structured data but available from unstructured data [7]. To evaluate unstructured EHR data, researchers are required to manually review a patient’s chart or to use advanced technologies, such as artificial intelligence based on natural language processing (NLP). NLP has potential to synthesize standardized text strings from unstructured notes containing unprotected health information. However, NLP methods are still evolving and are not widely accessible owing to associated high costs and the operating expertise required.

Recently, limited oncology studies in the USA have evaluated clinical outcomes, such as tumor response and disease progression from unstructured radiology reports using EHR databases [8,9,10]. We aim to utilize the developed method to evaluate comparative effectiveness of treatments using EHR databases which could generate evidence timelier than the evidence generated using the primary data collection [11, 12]. This is expected to support the decision on treatment selection in real world setting. To achieve this objective, NLP method is required to handle the routinely collected large-scale EHR data. Some studies have specifically demonstrated the use of NLP in assessing clinical outcomes in patients with cancer using EHRs [13, 14]. However, there have been challenges pertaining to use of non-English language such as word segmentation and functional expression in the Japanese language [15, 16]. We believe there is a dearth of data reporting oncology outcomes using unstructured data in Japanese language from EHRs. In this descriptive study, we aimed to generate initial methods for extracting clinical outcomes of response to anticancer drugs such as tumor response and treatment discontinuation in patients with cancer utilizing the EHR of University of Miyazaki Hospital in Japanese language. The hospital is one of the major hospitals among the ~ 106 hospitals participating in the Japanese EHR under “Millennial Medical Record Project” [17]. We also applied NLP for morphologically extracting key terms describing the treatment response in unstructured text data of Japanese language so that the NLP method allows researchers to generate evidence using a large scale EHR database.

2 Methods

2.1 Study design and patient population

This was a retrospective study of patients with cancer using databases of medical records and administrative data of University of Miyazaki Hospital in Japan. The study included patients ≥ 18 years of age with a diagnosis of lung cancer or breast cancer who visited or were admitted to the University of Miyazaki Hospital between April 2018 and September 2020 and who had received anticancer therapy having radiology reports and progress notes during the treatment. We selected lung cancer patients with longer follow-ups for evaluating the treatment response. However, we also applied our method to breast cancer patients to conduct preliminary examination regarding any remarkable differences in using key terms between lung and breast cancer.

Patients who requested the suspension of personal information use, or those who were prescribed unapproved drugs, and those who were considered ineligible for inclusion by the principal investigator, were excluded from the study.

2.2 Study outcomes

Main study outcomes were line of treatment of anticancer drug therapy, response to treatment in terms of objective response (OR, i.e. complete or partial response), stable disease (SD) or progressive disease (PD) in real-world. We also aimed to identify key terms used to describe treatment responses in the medical records.

Criteria for evaluating response to drug treatment, including treatment effectiveness and decision for continuing/discontinuing treatment in clinical practice, were defined based on the basic principles of RECIST criteria [18]. Adjudication of treatment response was performed by two evaluators (experienced physicians) using clinicians’ progress notes, radiology reports and pathological reports according to the following criteria of response to treatment: OR – any shrinkage of tumor compared to baseline observed in radiology images; PD – any progression compared to baseline or treatment discontinuation due to insufficient efficacy or intolerance; SD – the outcome when neither OR nor SD was observed (Online Resource, Supplementary Methods). Baseline was considered to be the start date ± 1 month of each treatment line. Radiology reports included CT, MRI and PET-CT scans; whereas simple X-ray interpretation results were included in progress notes.

Among all eligible patients, 15 patients with lung cancer who had longer follow-ups were selected for the adjudication by two evaluators as a training set (Fig. 1). Two evaluators reviewed all the progress notes, radiology reports and pathological reports for the 15 patients and adjudicated treatment response and identified texts of the grounds for the adjudication. From identified texts of treatment response by the evaluators, key terms for the treatment response were extracted and broken down/selected by morphological analysis and were then used for generating the NLP rules. These NLP rules were applied to 70 patients with lung cancer who were not assessed by the evaluators and 30 patients with breast cancer, respectively.

Fig. 1
figure 1

Flowchart for morphological analysis

Pts, patients; NLP, natural language processing; OR, objective response; SD, stable disease; PD, progression of disease

For identifying lines of treatment, an algorithm was set because medical records do not include information regarding lines of treatment. An initial set of drugs administered at first dosing after diagnosis was regarded as first-line treatment. A set of drugs administered at the earliest next dosing that was different from those at the first dosing was regarded as the second or later line. If the same drug was used in multiple treatment lines drug treatment records of the patient were individually reviewed. Lines of treatment generated by this algorithm for patients in the training set were also confirmed by two evaluators.

2.3 Data analysis

The anonymized patient information was analyzed within the intra-net of University of Miyazaki Hospital. Drug treatments were summarized as per the lines of treatment according to the algorithm cited above. Best responses in each treatment line which were adjudicated by two evaluators were summarized and a concordance correlation was evaluated by calculating the kappa coefficient. A summary of data sources (progress notes, radiology reports and pathological reports) that contributed to the adjudication was also written. Sensitivity and specificity of key terms extracted from the text data by morphological analysis associated with each treatment response (OR, SD, PD) were calculated. The key terms for treatment response were extracted from identified texts by the evaluators and were broken down into parts of speech that was further classified into negative/positive context by morphological analysis. For the morphological analysis, Text Mining Studio of NTT DATA Mathematical Systems Inc. software was used [19] and a dictionary specialized for treatment responses in cancer patients with Japanese medical terminologies provided by Medical Information System Development Center of a General Incorporated Association [20], RECIST Guideline version 1.1 of Japanese translation and others in addition to the software’s standard dictionary was created [18, 21]. Frequency, proportion and the 95% confidence intervals of key terms were calculated for patients who were not assessed by evaluators. Missing values for each variable were summarized but were not counted for calculating summary statistics.

3 Results

3.1 Patient disposition and demographics

Between April 2018 and September 2020, of 83,894 patients with EHRs who were admitted in the University of Miyazaki hospital, 1,930 patients were diagnosed with lung cancer or breast cancer. Among eligible patients, the pre-specified sample size of 115 patients were selected as described in the methods. A total of 15 patients with lung cancer with longer follow-ups were selected for adjudication, and the other 100 patients were not assessed by evaluators (70 patients with lung cancer and 30 patients with breast cancer). Mean age of the patients was 67 years and the majority were females (67%). Since this study included patients with breast cancer, the percentage of female patients was higher in the overall cohort; however, the majority of patients with lung cancer assessed by evaluator were males (10/15, 66.7%). Most patients (64/115, 55.7%) were treated for stage 3/4 cancer and 62.6% (72/115) were hospitalized at the time of the analysis due to primary cancer. Recurrence of primary cancer was seen in 1.7% (2/115) of patients; metastasis primary cancer in 28.7% (33/115); and multiple primary cancer in 7.8% (9/115) patients (Table 1).

Table 1 Patient demographics

3.2 Pharmacotherapy for lung cancer

Of 85 patients with lung cancer, all patients (100%) received first-line and 27% patients received second-line therapy according to the algorithm (Fig. 2a). Among the patients with breast cancer, 100% patients received first-line therapy and 60% received second-line therapy. We could identify a single drug or multiple drug combinations for each treatment line. Carboplatin + paclitaxel, pembrolizumab + carboplatin + pemetrexed sodium, cisplatin + docetaxel, and pembrolizumab + carboplatin + paclitaxel were some of the drug combinations used as first-line agents in patients with lung cancer (Fig. 2b). Cyclophosphamide + doxorubicin and Docetaxel + pertuzumab were the drug combinations used in primary line of treatment for patients with breast cancer (Fig. 2c). Treatment lines generated by the algorithm were confirmed by two evaluators and were thus approved.

Fig. 2
figure 2

Pharmacological treatment for patients with lung cancer (n = 85) and breast cancer (n = 30) (a) Patients receiving pharmacological treatment for lung cancer as per the line of treatment (n = 85) (b) Common drug regimen in first-line (blue bars), second-line (red bars) and third-line (black bars) treatment for patients with lung cancer (n = 85). (c) Common drug regimen in first-line (blue bars), second-line (red bars) and third-line (black bars) treatment for patients with breast cancer (n = 30). The drug or combination of drugs for which the frequency is > 3 is presented here. aGenetic combination. CPA, cyclophosphamide; FU, fluorouracil; BVZ, bevacizumab; CBP, carboplatin; PTXL, paclitaxel; ATZ, atezolizumab.; PS, pemetrexed sodium

3.3 Adjudication results

Of 2,039 records in clinicians’ progress notes, 131 records in radiology reports and 60 records in pathological reports of 15 patients with lung cancer, treatment response was adjudicated by two evaluators. Among these records, 182 and 47 records, respectively, were adjudicated for OR, SD, or PD because other records were not related to treatment response (Online Resource, Supplementary Table 1). Among 28 therapy treatment lines used in 15 patients, best response to each treatment was identified. For the best response, clinicians’ progress notes were the most common primary source data for treatment response assessment (60.7%), followed by radiation reports (28.6%) (Fig. 3). A concordance correlation coefficient (kappa coefficient) of 0.59 was obtained for two evaluators in adjudicating the results of best response (Table 2).

Fig. 3
figure 3

Source data for adjudication of the best effect for tumor assessment

Table 2 Inter-evaluator agreement in best effects of tumor assessment based on evaluator’s review for patients with lung cancer

3.4 Key terms extracted by evaluators for adjudication on response to treatment

Key terms were extracted by evaluators after adjudication of text data from 182 records in progress notes and 47 records in radiation reports. Among these, OR had 72 records, SD had 71 records and PD had 48 records in progress notes; and corresponding numbers in radiation reports were 26, 20, and 16. In total, 610 key terms were extracted from progress notes and 555 from radiation reports. Key terms were translated from Japanese to English. For treatment response of OR, “reduction/shrink” was the most common key term (69%) used to describe tumor response, with the highest sensitivity and specificity in the progress notes (Fig. 4a). However, it was less frequent in radiology reports (Online Resource, Supplementary Fig. 1a). “(No) remarkable change/ (no) aggravation)” was the most common key term (51–44%) used to describe stable disease in progress notes (Fig. 4b). Pertinent key terms used in radiology reports were “(no) lesion” and “(no) change” (Online Resource, Supplementary Fig. 1b). Key terms used to describe progressive disease were “(limited) effect” and “enlargement/grow” in progress notes (Fig. 4c). However, it was difficult to detect characteristic words in radiology reports (Online Resource, Supplementary Fig. 1). “reduction/shrink”, “effect” and “improvement” were mostly used in the positive context in reporting tumor response in progress notes and radiology reports. Whereas “remarkable change” was used in a negative context in progress reports and radiology reports (Table 3). Key terms identified in 15 patients were also confirmed in progress notes and radiology reports in the other larger cohort of patients with lung cancer not assessed by evaluators and also in patients with breast cancer (Online Resource, Supplementary Tables 2 and 3). No remarkable differences were found in the other larger cohort of patients with lung cancer. However, in the progress notes of the breast cancer patients, there was a little use of the terms related to lung such as “Infiltration shadow” and “Pleural effusion” that were identified as the key word in the lung patients. But again, these terms were used in the radiology reports of the patients with breast cancer.

Fig. 4
figure 4

Frequency, sensitivity and specificity of top 15 key terms for the adjudication extracted by the evaluators in progress notes (a) Treatment responses (n = 72 units: records) (b) Stable disease (n = 71 units: records) (c) Progression of disease (n = 48 units: records) (d) Sensitivity and specificity of key terms for treatment responses (e) Sensitivity and specificity of key terms for stable disease (f) Sensitivity and specificity of key terms for progression of disease

Table 3 Positive and negative context of key terms in progress notes and radiology reports

3.5 Mortality assessment

Throughout the study period, 3 deaths were recorded and all of them occurred in the hospital (2 deaths in patients with lung cancer and 1 in patient with breast cancer) (Online Resource, Supplementary Table 4).

4 Discussion

Our study demonstrated that evaluation of treatment response to each line of treatment was feasible in patients with lung cancer, using Japanese EHR such as progress notes and radiology reports that are in Japanese language. We identified key terms with high specificity and sensitivity to assess treatment response and we could also determine their use in positive/negative context.

In a study by Griffith et al., owing to missing data and lack of clarity in radiology reports, RECIST could not adequately assess progression of non-small cell lung cancer in EHR-derived data [8]. Our study aimed to assess response to drug treatment in clinical practice by taking into consideration the principle of the RECIST criteria including the decision of continuing/discontinuing treatment. It was feasible to evaluate the treatment response using criteria defined in this study which were based on retrospective review of radiology reports and progress notes. This approach of assessing the real-world response is increasingly used for comparative effectiveness research such as overall survival and progression-free survival particularly in the US but has not been done in Japan yet [9, 10, 14]. Our study provides the results of research evaluating such an approach using Japanese EHRs.

A limited number of studies has evaluated the feasibility of developing NLP tools for identifying textual sources for oncological outcomes in medical records regarding pharmacotherapy and tumor response [13, 14]. In these studies, NLP was employed to extract clinically relevant oncologic endpoints from unstructured EHR data [13, 14]. In this study we utilized NLP tools specialized for Japanese language for assessing response to drug treatment in patients with cancer. There were challenges pertaining to the use of non-English language in many other countries for using the NLP tools such as word segmentation and functional expression of Japanese language [15, 16]. This could be due to spelling variants and orthographical variants of Japanese language. For tackling these challenges in our study, the dictionary specialized for treatment responses in cancer patients was created on the top of the standard dictionary, which led high sensitivity and specificity of key terms related to treatment response. We also generated the rules of specifying positive/negative expressions for key terms broken down by morphological analysis, which was required for assessing response.

Our study also investigated the algorithm for lines of treatment of anticancer drugs. Patients with cancer in general receive several treatment lines and some of the regimens consist of multiple drugs with varying timing of drug administration. The information about treatment lines or treatment regimen is critical for ensuring drug effectiveness and safety in studies using large databases. Real-world treatment pattern is one of the frequently researched questions using administrative databases or EHR. Despite availability of prescription records/data, these databases have limited information regarding treatment line. However, some studies have reported methods to identify the treatment regimen using these databases and the algorithm for identifying them in our study is simple and feasible for large databases [22].

Our study has some inherent limitations. As the evaluators adjudicated fewer patients, since the purpose of this study is feasibility, we could not cross-validate the developed NLP rules. In addition, patients who had longer follow-up of anticancer drug therapy was selected for adjudication to evaluate diverse expressions for treatment responses of OR, SD, and PD, which might introduce selection bias, e.g., patients who have early treatment failure due to tumour flare or intolerance might not be presented among those 15 patients. Therefore, further studies are required to validate methodologies developed in our study. One of the important limitations of the Japanese EHR is the lack of linkage among hospitals and clinics in the Japanese medical practice environment resulting in the inability to analyze survival time/death outcomes. Moreover, treatment and imaging tests outside the hospital were not considered. Information about death and death date (confirmed date) is available in the city government database but could not be accessed for analysis purposes due to the personal information protection act. However, “Act on Anonymized Medical Data That Are Meant to Contribute to Research and Development in the Medical Field” (Next Generation Medical Infrastructure Law) may address the current issue of data accessibility and the aforementioned EHR “Millennial Medical Record Project” is the first organization certified by this low. Since the positive/negative context of key terms is important for adjudication of treatment response and for increasing the accuracy, we plan to investigate the utilization of AI/machine learning for the same. Transparency about the process of collecting the unstructured data for outcomes is also important [23]. Although the method involved double review by two independent reviews to minimize errors in identifying key terms, there was a possibility of AI identifying other key terms because of the large numbers of records to be reviewed. Source of EHR data was limited to a university hospital in this study. Future studies using EHR from several other hospitals across Japan would add diversity to the patient population.

This is the first real-world EHR database study that assesses response to anticancer treatment using Japanese EHRs; mainly progress notes and radiology reports. Findings from our study may form the base for future comparative efficacy research using real-world oncology outcomes in Japan. Within the limitations of our study, results showed that Japanese key terms for treatment response were identifiable by NLP. However, future studies using large scale EHR databases from several other hospitals in Japan are required to substantiate our results.