Introduction

The growing use of electronic medical records has resulted in vast stores of clinical information around the world that represent valuable resources for research and improving healthcare outcomes. However, the unstructured free-text format in which such electronic data are often stored poses significant challenges to expedient data retrieval. The inherent variability in content of unstructured reports may result in loss of information such as tumor progression. Further, even if pertinent information is present in the report, the complexities of human language render such reports less amenable to simple automated data retrieval.

Natural language processing (NLP) is an area dealing with computational methods for processing human language. It has been used as a main method of information extraction (IE), which aims to convert information residing in natural language into a structured format. Advances in both NLP and IE have allowed rapid data retrieval from electronic databases with accuracy comparable to human experts.1,2 In the clinical domain, while NLP has proven utility3 in detecting the presence of disease from unstructured reports,415 it has not been evaluated as a tool for determining progression of disease.

The broad objective of this study was to determine if information regarding tumor progression could be accurately retrieved from unstructured follow-up magnetic resonance imaging (MRI) brain reports using NLP. Specifically, we first aimed to assess if the reports contained sufficient information for classification of tumor status. We next aimed to develop an NLP-based data extraction tool to detect changes in tumor status. Finally, we assessed if the NLP algorithm could retrieve information regarding tumor status from the unstructured reports with similar accuracy as an expert human interpreter.

Materials and Methods

Ethics Approval

The study protocol was approved by Mayo Institutional Review Board.

Sample Selection

Consecutive MRI reports in the Mayo Clinic, Rochester, MN radiology report database from 1 Jan 2000 up to 1 Jan 2008 were screened for the following inclusion criteria:

  1. 1.

    Format: The report must be an unstructured free-text MRI brain examination report.

  2. 2.

    Indication: The MRI examination must be done for brain tumor evaluation. For our study, a “brain tumor” was defined as any of the following: brain tumor, brain cancer, glioma, meningioma, glioblastoma, astrocytoma, ependymoma, oligodendroglioma, brain lymphoma, brain metastases, and pituitary tumor.

  3. 3.

    Condition: The report must make reference to a suitable prior imaging study such as an earlier computed tomography or MRI brain examination.

MRI reports at our institution do not routinely have separate “observations/findings” and “impression/conclusion” sections. Instead, the reporting style is deliberately succinct, often consisting of key findings incorporated into an expanded impression.

Evaluating Information in Reports by Manual Classification

The selected reports were reviewed and annotated by a radiologist (LTC) according to consensus classification guidelines (Table 1) developed by the two authors (LTC and BJE) regarding disease status, magnitude of change, and the significance of change according to the classification scheme indicated in Figure 1. No additional clinical information, apart from data within the radiology report, was provided. Report annotation was performed using an open-source biomedical ontology editor (Protégé ver 3.2.1, Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, CA, USA) and a general-purpose text annotation plug-in tool (Knowtator ver 1.7.4, Center for Computational Pharmacology, University of Colorado Health Sciences Center, CO, USA). Ten percent of reports were randomly selected for blinded repeat annotation by the same radiologist 4 months after the initial annotation to evaluate intra-annotator agreement.

Fig 1
figure 1

Classification scheme for radiology reports.

Table 1 Study Consensus Guidelines for Manual Classification of Reports

Developing an NLP-Based Data Extraction Tool

The reports were stratified by tumor status and divided into training (70%) and testing (30%) sets. Stop words (e.g., “if,” “the,” “by”) had little lexical meaning and were removed. Content words were retained and underwent a stemming process using the Porter stemming algorithm16 to reduce inflected variants to their stems (e.g., conversion of the word “reduction” to “reduce”).

The NLP-based data extraction tool built for the task of discovering tumor status, magnitude of change, and significance of change combined statistical and rule-based methods (Fig. 2). The discovery of tumor status was cast as a classification task extending the support vector machines (SVMs) method,17 while that for magnitude and significance was approached as a pattern-matching task. A simplified illustration of how an example report would be processed and analyzed by the NLP-based data extraction tool is given in Figure 3.

Fig 2
figure 2

Development of NLP-based data extraction tool.

Fig 3
figure 3

Simplified illustration of processing and analysis of an example report by the NLP-based data extraction tool.

Discovering Tumor Status

A radiology report (document) could discuss multiple tumors and include additional information not directly related to tumor status. Therefore, the initial step in tumor status discovery was to identify narrative sections (hereafter referred to as topic discourse units) that contained information describing a single tumor.

Discourse Processing and Subdocument Creation

The most important clue for the identification of topic discourse units was the description of status change. Based on the manual annotation outputs, vocabulary lists for tumor status (e.g., phrases indicating progression) and the tumor status subject (e.g., “mass,” “abnormal signal”) were compiled. Each document was then split into several subdocuments based on the occurrence of pairs of subject and status words from the two vocabulary lists. These pairs were restricted to within a maximum span of two adjacent sentences. Portions of the report with sentences not containing such pairs were evenly divided in terms of sentences and attached to the nearest subdocument. If a document described several tumors, each tumor description formed a separate topic discourse unit.

Feature Extraction

Two types of features are extracted from each discourse unit. The first was the bag of words feature which allowed simplification of each document into a collection of words, disregarding grammar and word order. Each subdocument was represented as a bag of word stems in a vector space. The words in the bag were derived from sentences that had at least one tumor status or tumor status subject manually annotated. Thus, the vector was a series of binary values, with 1 for the presence of the word stem in the subdocument and 0 for absence.

The second feature, negation, was extracted using the NegEx algorithm.18 Negation was common in the reports, and it was critical to distinguish between positive and negative mentions. For example, in the phrase “there was no significant growth,” the “significant growth” is negated by the word “no.” The NegEx algorithm focused on discovering anchor words and spanned a window on both sides of the anchor to detect negation markers. If a negation-stopping word occurred before the window was exhausted, then the scanning stopped there. The tumor status words were the anchors that fed into the NegEx algorithm with a window of six adjacent words. A tumor status was assigned a final value of negated if there was a negation word within the window and no intervening negation-stopping word was present.

SVM Training

SVMs17 are a machine learning technique (supervised learning method) for classification of data. Given training vectors (x i , y i ), i = 1,…,n where \({\mathbf{x}}_{i} \in R^{\operatorname{d} } \) and \(y_{i} \in \{ - 1,1\}\) as the class label, SVMs locate a hyperplane \({\text{w}}\,\cdot \,{\mathbf{x}} - b = 0\) that maximizes the separation between the two classes. We used SVMs to build a classifier to discover tumor status. The feature vectors were constructed as described and a four-way SVM classifier with categories for regression, stable, progression, and irrelevant was trained. The LIBSVM19 toolkit was used to extend SVM to support multi-category classification and enable probabilistic predictions.

Final Tumor Status Assignment

Tumor statuses of subdocuments were assumed to be independent. The final document-level status probabilities were derived from both the subdocument-level probabilities (generated from the SVM toolkit) and the status assignment rules (Table 1). The probabilities were calculated as follows:

$$P{\left( {{\text{irrelevant}}} \right)} = {\prod\limits_{i \in \{ {\text{all subdocs}}\} } {P{\left( {i = {\text{irrelevant}}} \right)}} }$$
(1)
$$\begin{array}{*{20}l}{{P({\text{irrelevant}} \cup {\text{stable}})} \hfill} & { = \hfill} & {{{\prod\limits_{i \in \{ {\text{all subdocs}}\} } {P(i = {\text{irrelevant}} \cup {\text{stable}})} }} \hfill} \\{{} \hfill} & { = \hfill} & {{{\prod\limits_{i \in \{ {\text{all subdocs}}\} } {{\left( {P(i = {\text{irrelevant}}) + P(i = {\text{stable}})} \right)}} }} \hfill} \\ \end{array} $$
(2)
$$P{\left( {{\text{irrelevant}} \cup {\text{stable}} \cup {\text{regression}}} \right)} = {\prod\limits_{i \in \{ {\text{all subdocs}}\} } {{\left( {P{\left( {i = {\text{irrelevant}}} \right)} + P{\left( {i = {\text{stable}}} \right)} + P{\left( {i = {\text{regression}}} \right)}} \right)}} }.$$
(3)

Therefore,

$$P{\left( {{\text{stable}}} \right)} = P{\left( {{\text{irrelevant}} \cup {\text{stable}}} \right)} - P{\left( {{\text{irrelevant}}} \right)}$$
(4)
$$P{\left( {{\text{regression}}} \right)} = P{\left( {{\text{irrelevant}} \cup {\text{stable}} \cup {\text{regression}}} \right)} - P{\left( {{\text{irrelevant}} \cup {\text{stable}}} \right)}$$
(5)
$$P{\left( {{\text{progression}}} \right)} = 1 - P{\left( {{\text{irrelevant}} \cup {\text{stable}} \cup {\text{regression}}} \right)}.$$
(6)

The final prediction at the document level was the stable, regression, and progression label with the highest probability.

Discovering magnitude and significance

Unlike tumor status descriptions, magnitude and significance had deterministic indicator patterns. This meant that apart from negation, there was little variation in the classification values that could be attributed to interaction between indicator patterns and other words in the same sentence. For example, the indicator pattern “compatible with” always indicated a probable value for significance, while “tumor cannot be excluded” always indicated an uncertain significance. Thus, these two classifiers were developed based on pattern matching, rather than the bag of words technique adopted for status classification. Each subdocument in a report was matched against a set of magnitude or significance indicators, while taking word order into account, and assigned a subdocument label according to the matching indicator. Subdocuments that lacked a magnitude indicator were assigned a default value of moderate. Subdocuments that did not have a significance indicator were assigned the same label as their immediate next subdocument with a significance indicator. If no significance indicator was present, an irrelevant significance was assigned.

For each possible configuration of subdocument tumor statuses in a report, the corresponding document-level magnitude/significance label was derived according to the classification guideline rules (Table 1), and the probability for this configuration was calculated. The document-level probabilities were derived by summing the same label assignments. For example, the probability of one report being assigned a final tumor status label of mild was computed as follows:

$$P{\left( {{\text{mild}}} \right)} = {\sum\limits_{i \in \{ {\text{all mild configs}}\} } {{\prod\limits_{j \in \{ {\text{all subdocs}}\} } {P{\left( j \right)}} }} }.$$
(7)

The magnitude and significance probabilities were computed similarly. The final prediction was the one with the highest probability excluding irrelevant.

Comparing Human and NLP Classification Outcomes

The sensitivity, specificity, positive predictive value, and negative predictive value of the NLP classification outputs were calculated using the human expert classification as the gold standard. The statistical software used was JMP® 7.0 (SAS Institute Inc., Cary, NC, USA), which generated the main descriptive statistics, including kappa and Bowker values. Weighted kappa values were obtained using GraphPad QuickCalcs (GraphPad Software Inc., La Jolla, CA, USA). F-measures were calculated using Protégé (ver 3.2.1, Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, CA, USA).

Results

A total of 778 MRI brain reports for 238 patients met the inclusion criteria, with characteristics summarized in Table 2. The reports were prepared by 33 staff radiologists and had an average report length of 109 words (median 95; range 18–447). The number of reports per patient ranged from 1 to 22 (mean 3.3). Incidental findings were observed in almost half of the reports. These incidental findings were non-tumor-related observations such as sinusitis, vascular abnormalities, ischemic changes, normal variants, and incidental benign tumors unrelated to the neoplasm of interest.

Table 2 Characteristics of Reports in the Manual Classification and NLP Development Groups

Information in Unstructured Reports (Manual Classification)

Out of 778 reports, six (0.8%) were unclassifiable despite having suitable comparison scans mentioned in the report (Table 2). Though these reports contained tumor descriptions, it was not possible to discern from the report text whether the findings constituted progression, regression, or no change, even after review by two radiologists (LTC and BJE). One unclassifiable document reported residual postoperative changes which hindered determination of tumor status. The unclassifiable reports had a greater mean report length compared to the classifiable reports, but this was not statistically significant. There was also no significant difference in the prevalence of incidental findings or spelling errors between the classifiable and unclassifiable reports.

In the 772 reports that were classifiable, tumor status was stable in 432 (56.0%), progressed in 235 (30.4%), and regressed in 105 (13.6%; Table 3). Reports could utilize either size, a surrogate indicator, or a combination of both types of indicators (size and surrogates) to describe tumor status. Surrogate indicators used were enhancement, signal change, new lesions, recurrent/residual tumor, or general statements on status. Overall, 557 (72.2%) reports utilized a surrogate indicator, while 242 (31.3%) used tumor size to describe status. This included 27 (3.5%) reports where both size and surrogate indicators were used. The types of status indicators used varied between different tumor status categories. In particular, the absence of recurrent/residual neoplasms and general statements about status were usually used to describe stability, while changes in status (regression or progression) tended to be reported as through descriptions of size and mass effect. As expected, the mention of new lesions was also associated with progression.

Table 3 Summary of Indicators Used for Determination of Tumor Status

The magnitude of regression was classifiable in 60.0% (41.0% mild, 5.7% moderate, 13.3% marked) but unspecified in 40.0% (Fig. 4). Magnitude of progression was classifiable in 45.1% (34.9% mild, 3.8% moderate, 6.4% marked) but unspecified in 54.9%. Reports with status change contained variable degrees of significance (30.0% uncertain, 12.1% possible, 39.4% probable, and 18.5% unspecified). Reports without specific mention of magnitude or significance were subsequently assigned default values according to classification guidelines (Table 1).

Fig 4
figure 4

Outcomes of human annotation for classifiable reports.

Ten percent (77 reports) randomly selected for blinded repeat annotation by the same radiologist (LTC) gave high weighted kappa values for all categories (Table 4), indicating a high degree of intra-annotator agreement. There was no significant asymmetry for discordant classification outcomes (P > 0.80 for all categories).

Table 4 Agreement Results

Comparison of NLP and Human Classification Outcomes

Compared to human classification for the test group (231 reports), NLP performed best for classification of tumor status, with an overall mean sensitivity and specificity of 80.6% and 91.6%, respectively (Fig. 5 and Table 5). Within the status subcategories, the highest NLP sensitivity was seen for classification of stability, while the highest specificity was obtained for classification of regression. The receiver operating characteristic (ROC) curves for NLP tumor status determination gave area under curve (AUC) values of at least 0.94 (Fig. 6). NLP performance metrics were lower for determination of magnitude and lowest for classification of significance. This trend was mirrored in the kappa values for agreement between NLP and human classification (Table 4). A similar pattern was observed for F-measures20 of NLP compared to human classification. Macro F-measure scores of 0.81, 0.77, and 0.69 were obtained for status, magnitude, and significance respectively, while micro F-measure scores were 0.86, 0.82, and 0.72, respectively.

Fig 5
figure 5

Comparison of NLP and human classification outcomes for reports in test set.

Fig 6
figure 6

Receiver operating characteristic curves for tumor status determination by NLP.

Table 5 Results of NLP Classification for All Categories

Subgroup analysis of NLP performance (Table 6) showed that for classification of tumor status, reports that were correctly classified were significantly shorter and more likely to contain general statements regarding tumor status. For magnitude of change, correctly classified reports had a significantly lower prevalence of surrogate status indicators (enhancement and statements about residual/recurrent neoplasms). For classification of significance, specific mention of tumor size was significantly associated with correct classification outcomes, while the use of surrogate status indicators (enhancement) had the opposite effect.

Table 6 Subgroup Analysis of Report Features Compared to NLP Classification Category Outcomes

Discussion

Our study demonstrates that the vast majority of unstructured radiology reports contain sufficient information to allow classification of tumor status by a human reader, although the linguistic indicators of tumor status varied significantly between reports. A novel NLP-based data extraction tool was developed and demonstrated utility in the classification of reports in terms of tumor status, change magnitude, and change significance. NLP classification outcomes had accuracy comparable to human expert classification, with the best performance seen for the classification of tumor status.

Completeness of Information in Unstructured Radiology Reports

Our findings support the many recognized limitations21 of unstructured free-text radiology reports generated today. Although our study was limited to a specific clinical domain, the determination of brain tumor status, change magnitude, and change significance proved to be challenging even to a radiologist interpreter. Reports had typographical errors, different lengths, variable vocabularies, referred to multiple tumors, and included significant amounts of non-tumor information and incidental findings. We noted that variability in expression and interpretation was greater for the more subjective categories of magnitude and significance. For example, a lesion that is “slightly better appreciated” on a study could either refer to a real change (i.e., mild progression) or an apparent difference that was attributed to technical differences between studies (i.e., stability). The challenges of communicating doubt and certainty in radiological reports have been described perviously22 and were borne out by the lower kappa values for the magnitude and significance categories. However, notwithstanding these challenges, there was a reasonably high level of reproducibility for human classification, which suggests that this categorization process could be successfully automated.

The case for structured reports and improved terminology in radiology has been made since the 1920s.2328 A structured report goes beyond standard headings (i.e., indication, findings, conclusion) to include consistent report content and a standard report language such as RadLex.29 It is our view that using structured reports for tumor follow-up studies would result in more consistent and complete radiology reports. For example, though tumor size is advocated as a key measure of status,30,31 the majority of reports in our study did not mention size. While the value of specific measurements is unclear for some tumors,32 standardizing on which measurements are routinely included could help reduce confusion and improve communication. Compulsory data fields (including tumor size) customized according to the clinical question would help prevent the inadvertent exclusion of such important clinical information, reduce the ambiguity of terms used, and decrease variability in their interpretation. Furthermore, structured reports with standardized lexicons would be machine-readable, enabling decision support, semantic search, quality control, and rapid data mining to be more easily incorporated into the daily practice of radiology.

Utility of NLP for Tumor Status Determination from Unstructured Reports

Until the widespread use of structured radiology reports, NLP remains a promising option for rapid data retrieval from radiology report databases. A robust NLP classification tool could facilitate research by rapidly identifying specific patient or disease subgroups based on radiological findings. For example, all patients with changes in tumor status could be easily identified and studied for factors that contributed to the status change. In addition, “stable tumors” could be reviewed for changes too subtle for human detection. Such information can be used to improve automated decision support tools such as computer-assisted detection and diagnosis systems. NLP tools could also “screen” reports prior to finalization, prompting radiologists about important findings that may have been inadvertently left out. However, in view of the complexity and variability of language used in unstructured radiology reports, coupled with the existing error rate of our NLP tool, we feel that use should preferably be restricted to research purposes at the present time. The small but inherent misclassification rate could skew the characteristics of any subpopulation of patients retrieved by the NLP tool according to tumor status. Therefore, these limitations and causes of the errors should be further evaluated before use in a clinical scenario.

Our study showed that NLP was able to classify unstructured free-text neuroradiology reports according to tumor status with good accuracy. The NLP performance metrics were highest for tumor status, intermediate for magnitude, and lowest for significance. This was likely due to the increasing difficulty of determining magnitude and significance, which was also apparent during human classification. Though our NLP tool achieved high sensitivity for classification of stable tumors, we feel that further improvements are required before actual use in a research environment. It is our view that a usable screening tool should have both sensitivity and specificity exceeding 95%. Based on the ROC curve for classification of stable tumors, the current algorithm can only partially meet such criteria with either the combination of 95.4% sensitivity and 72.6% specificity or 79.1% sensitivity and 95.1% specificity.

Error analysis showed that a shorter report length and the presence of a general statement regarding tumor status were significantly associated with a correct NLP status classification. This may be due to a reduction of irrelevant information in short reports that could negatively influence the final classification outcome. A general statement on status was also more likely to be detected by the NLP algorithm. For magnitude and significance classification, the NLP tool performed poorer for reports that used surrogate markers of status other than size. This could be explained by the more variable vocabulary used to describe changes in surrogate features when compared to the more objective features used when reporting changes in size. It is noteworthy that spelling errors, fusion words, and incidental findings were not significantly associated with erroneous NLP classifications. This lends support to the utility of NLP for evaluating free-text medical reports, especially since it may not always be feasible to correct for such errors prior to NLP analysis.

Sensitivity of NLP classification was lowest for tumor regression, marked magnitude, and uncertain significance categories. Several factors may have contributed to this. Firstly, fewer reports were available for these categories, resulting in smaller training sets for the NLP algorithm. Secondly, it is possible that greater variability existed in how significance (including uncertainty) was expressed, as suggested by the lowest intra-annotator agreement levels obtained for classification of significance. Thirdly, uncertainty had the lowest priority amongst the significance categories in the classification scheme, meaning that a concurrent detection of uncertain and another significance value (including the default “possible” value) would always be resolved to the other value. Finally, when no explicit marker of significance was available, the discovery of uncertain significance was tagged to a mild magnitude. Therefore, any error in the discovery of mild magnitude would lead to downstream errors in the discovery of uncertain significance.

Several areas for potential improvement of the NLP algorithm were identified during the study.

For delineation of subdocuments in each report, our method relied on the existence of status–subject pairs identified using vocabulary lists with boundaries demarcated based on proximity. Because subject–status pairs were prevalent, each report tended to be split into many short fragments. Some of these subdocuments described change magnitude or change significance rather than tumor status. This resulted in the creation of irrelevant subdocuments that negatively influenced the subsequent probability contributions from relevant subdocuments. The lower performances of magnitude and significance classifiers could be partly attributed to this problem. Better identification of relevant subdocuments would help prevent the impact of irrelevant subdocuments on the final probability derivation and classification. Enhanced parsing, temporal reasoning,33 and co-referencing are other areas where improvements in subdocument delineation could be achieved. These tools would enable NLP algorithms to better handle reports with multiple tumors, comparisons with more than one prior study, and lengthy reports with co-references that span multiple sentences.

Temporal reasoning remains a key challenge for automated medical data mining from medical reports.33 The value of medical information is dependent on its temporal context. For example, tumor size in a report takes on greater significance when compared to a previously recorded size. This rate of change in size allows additional inferences regarding tumor aggressiveness or treatment efficacy to be made. However, automated temporal reasoning of unstructured radiology reports presents many challenges. For example, even the apparently simple definition of “time of study” can be difficult. The options include scan acquisition time, scan completion time, exam interpretation time, and report finalization time. Furthermore, there is no uniformly accepted format of representing day, month, and year in medical reports. Even if the format is standardized, time zone differences add an additional challenge, especially in the age of teleradiology where reports may be generated in a different time zone from where the examination was performed. Beyond definitions and representations, radiology reports may also make references to multiple prior studies without mentioning specific dates. This poses further challenges for automated temporal relationship discovery as a tumor may have both progressed and regressed when compared to different prior scans. The difficulty of automated temporal reasoning in radiology reports is also increased by temporal relations which rely on implicit event ordering with indicators such as “as compared to the scan taken last week” or “since admission” where no clear reference time point is given.

There are limitations for our study. Though SVMs have been successfully applied to text classification problems,34 limitations exist for unbalanced datasets where class sizes differ significantly. Such unbalanced datasets are common in the medical domain where the important positive instance of a disease/outcome may be rare. In such cases, SVMs may favor the majority (negative) class and will still correctly classify most of the dataset even if the hyperplane is pushed towards the minority (positive) class. This is not preferred as false negative classifications of positive instances are less tolerable than false positive classifications. In our study, we addressed this limitation by conducting searches within a range of values to obtain empirically optimal parameters for the SVM hyperplane. Other methods have been suggested to reduce such errors.3537 Our findings are also limited to the subpopulation of patients with brain tumors with follow-up MRI studies. As imaging features and terminology vary between different tumor types and imaging modalities, our algorithm may not be applicable to different patient populations. The classification scheme used was formulated specifically for the study and had subjective components. While efforts were made to be in line with existing tumor status classification schemes,30,31 the lack of uniformity across reports made complete alignment impossible.

Conclusion

Unstructured free-text radiology reports of tumor follow-up MRI brain examinations mostly contained sufficient information for determination of tumor status, though the features used to describe status varied significantly between reports. Almost 1% of reports studied could not be manually classified despite specific reference to prior exams in the report, over two thirds did not specify tumor size, and almost half did not report magnitude of change when either progression or regression was detected. We successfully developed an NLP-based data extraction tool using existing software that was able to determine tumor status from unstructured MRI brain reports with accuracy comparable to a human expert. Our findings show promise for novel application of NLP algorithms in ascertaining disease status from unstructured radiology reports.