Introduction

Since 2015, the federal government has required most healthcare providers in the United States to use electronic medical records (EMR), or suffer penalties to Medicaid and Medicare reimbursement levels1. Incentives were also provided to create and use electronic health records (EHR), which are more comprehensive, although the terms are frequently used interchangeably. Many healthcare providers transitioned to EHR well in advance, and consequently, many hospitals now have years of patient records stored in a fashion that is relatively easy to search. In the era of big data, researchers quickly saw the potential to utilize EHR to advance their work in ways that were previously challenging2,3,4,5,6. For example, retrospective cohorts for exposures captured in patient records might relatively easily be assembled to study the impact of things like smoking or BMI on outcomes for diseases that are commonly treated at hospitals, such as cancer5. EHR have also greatly simplified the process of recruiting subjects for prospective studies and clinical trials6. However, the EHR was created for healthcare, not research, and the structure of the data leads to challenges in the research setting.

The biggest challenge in the use of EHR comes from extensive reliance on unstructured data. Healthcare providers typically use freeform notes to capture important information when interacting with patients. Certain types of reports, such as pathology reports, are also entered as free text. Often, some of the most important information is stored in these fields, such as patient descriptions of symptoms, or clinicians’ observation of relevant signs of disease. Due to the volume of data generated by hospitals and other healthcare facilities, collecting data from free text is an arduous process to perform manually. Yet, particularly for clinical trials, this is currently the only available option at many institutions. Given the importance of clinical trials to drug development for a range of conditions, there is a critical need for methods that can automate and improve the efficiency of this process. This led many researchers to propose the application of natural language processing (NLP) techniques to these data7,8,9,10,11,12. NLP is not a specific method but rather a collection of approaches that involve extracting information from language as it is naturally spoken or written. Increasingly, NLP efforts have been focused on EHR to enable researchers to capitalize on its valuable information.

Many NLP approaches are based on two main phases: data structuring and machine learning. In the data structuring phase, an algorithm attempts to impose some sort of regular structure on the data. In the machine learning phase, the newly structured data are used to learn some characteristic of the data (e.g. hormone receptor status for breast cancer patients). The challenge of the data structuring phase is to find some way to impose structure on data that allows meaningful information to be extracted. There is typically a great deal of information in the arrangement of natural language, but it also typically varies a great deal in its arrangement from instance to instance. Therefore, it is difficult to reduce natural language to a structure that works well in most instances.

The field of NLP, as it is applied to EHR, is still developing, and there is a need for methods that are designed specifically to capitalize on the context of EHR. It is our hypothesis that methods specifically designed to exploit the nature of free text in the EHR should exhibit better performance than more general approaches. We note that while some fields may be unstructured, healthcare providers often state things in similar ways, for example estrogen receptor (ER) status may be given as any of the following: “estrogen receptor positive”, “ER+”, “positive for ER”, “positive staining for ER”, “ER: 30%”, and still other variations. Nevertheless, the vocabulary around ER status is relatively constrained. Additionally, the order of text in a medical record is frequently important, as multiple results might be included near each other in the text (e.g. “positive for ER, negative for PR”). As part of our own work to support research at the University of Kansas Cancer Center using the EHR13, we are investigating methods for using NLP to extract information from free text fields. In this article, we propose an approach we call Relevant Word Order Vectorization (RWOV). RWOV examines a text for its association to some target class (e.g. ER status). It identifies the most common terms that occur in relation to this target and then scores a text to show the relative position of these most relevant terms to the target. This differs from other methods by considering only the relative position of specific terms, which are assumed to be predictive, to a target term, with extreme flexibility in the true distance between those terms.

Methods

All work was performed in accordance with relevant guidelines and regulations under a protocol approved by the University of Kansas Medical Center Institutional Review Board, which included a waiver of informed consent.

Relevant word order vectorization (RWOV)

In this work, we focus on the data structuring part of the NLP problem, thus we will pair our vectorization-based approach with a couple of different machine learning algorithms to compare its performance to existing methods. The basis of our proposed method is to determine the structure of the most relevant words as they relate to predicting the class of text. Therefore, we call this approach Relevant Word Order Vectorization (RWOV). Pseudocode for the basic approach is shown in Algorithm 1.

Algorithm 1
figure a

.

In Algorithm 1, tokenize refers to breaking the input apart, first into sentences, then into words or symbols. The phrase excluded words refers to common “stop words” as they are often called, which for our purposes were the following: the, are, of, as, is, and, or, report, pathology. These were deemed not to be useful for our purposes, although this list could almost certainly be added to and refined. The variable m refers to the number of observations (i.e. pathology reports about individuals).

The idea behind RWOV is quite simple. RWOV is focused on predicting the class of a term of interest (TOI) from a block of text. Although EHR data are unstructured, nevertheless, there is a relatively concise vocabulary that is used by healthcare professionals when describing patient characteristics. Therefore, we should see the same terms occurring repeatedly in patient medical records. Furthermore, we propose that only a fraction of these terms indicate the meaning in relation to some particular term of interest. Nevertheless, we believe that the relative order of these “most relevant words” is important to the meaning of the text. RWOV creates a matrix, where each row represents a subject and each column a word. The most relevant words are those that co-occur the most frequently with respect to some TOI. We will call these the top words. The value in each cell of the matrix is either 0, or the inverse of the number of top words that occur between the top word represented by the column and the TOI plus 1. The sign of the value indicates if the top word occurs before or after the TOI in the text. Therefore, the value in each cell drops away naturally in a nonlinear fashion from 1 (as close as possible to the TOI) to 0 (does not occur in the block of text).

$$T(i,j)=\{\begin{array}{ll}0,\, & if\,term\,j\,does\,not\,occur\,in\,block\,i\\ \frac{1}{d+1}, & where\,d\,is\,the\,number\,of\,most\,relevant\,terms\,between\,term\,j\,and\,TOI\end{array}$$

Data

For this study, we used a straightforward collection of data to evaluate the performance of our NLP approach compared to a few other existing approaches. Three datasets containing tumor pathology reports of women with breast cancer who sought treatment at the University of Kansas Medical Center in two recent years were included, one for each of the terms of interest we used in this study. Our goal is to identify the status of three important breast cancer biomarkers from the pathology report free text: estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). In order to keep the results more interpretable, we limited the datasets to include only those reports that included a determination of hormone receptor status. The number of positive and negative subjects for each hormone receptor are shown in Table 1.

Table 1 Counts of subjects with hormone receptors status.

The data were labelled using two annotators who received training in reading pathology reports from clinical research coordinators who normally perform this task. Any discrepancies in annotation were then manually resolved (there were no disagreements due to interpretation).

With respect to Algorithm 1, the specific terms of interest (TOI) used in this study were “er”, “pr”, and “her2”. However, we allow the user to set aliases for TOI. In this case the aliases were for “er”: “estrogen” and “estrogen receptor”, for “pr”: “progesterone receptor”, and for “her2”: “her-2” and “her/”.

Note that in most cases, there were numerous biomarkers listed in a single block of text. For example, a report might list the following: “tumor cells are positive for CK7 and MOC31, negative for CK5/6, TTF-1, CK20, ER and PR. Her2 immunostain is positive. Ki-67 is reported 85%”. In most cases there were between 3–10 biomarkers delineated, adding to the challenge of correct classification.

Analysis

Class imbalance is likely to be a factor when assessing performance with these types of data. Hormone receptor status is unbalanced in the population. Furthermore, depending on the study, researchers may be interested in one class or the other of the subjects. Complicating this, is the fact that some performance metrics, such as accuracy, can give the impression of good performance even when a method is unable to accurately predict the class of interest. Therefore, we will break down the performance by class (and train separate models to predict each class), and provide F1 and AUC as our major performance metrics, given that accuracy can be misleading in these circumstances. We will not attempt to use sophisticated class balancing approaches in this assessment, so that we can minimize the number of factors that are being considered. F1 is defined as follows:

$$F1=\frac{2\times \frac{TP}{TP+FP}\times \frac{TP}{TP+FN}}{\frac{TP}{TP+FP}+\frac{TP}{TP+FN}}$$

where TP, FP, and FN stand for the number of true positives, false positives, and false negatives respectively. In other words, it is the harmonic mean of the precision and the recall. We will assume that each of our NLP methods can produce a score representing its confidence in the predicted class label of a sample. For this work, we will also assume that a sample can only be one of two classes. Then the AUC is simply the probability that a random sample that is truly of one class is scored higher than a random sample from the other class.

We will compare the performance of our approach to two popular vectorization methods. The first is known as n-grams9,11, and the other is called word2vec14, combined with either of two machine learning algorithms: support vector machines (SVM), and artificial neural networks (NN). For n-grams, we rely on the implementation in CountVectorizer module of the scikit learn library for python. For word2vec, we used the genism library for python, and chose the skip-gram model. For SVM, we used the SVC module, and for NN we use MLPClassifier. There are two important stages to consider, with their own associated algorithms: (1) data structuring or vectorization and (2) learning and prediction. At both of these stages, important choices can be made that affect performance. At the first stage, vectorization, there are hyperparameters for all of these approaches. For our approach, the only hyperparameter is the number of top (or most relevant) words to model. This was decided by training and testing on an independent dataset and then this setting was simply used for all the analyses here. This will be the recommended default settings for our method, but likely performance could be improved by using training, validation, and test data. However, we wanted to preserve our sample size in this case. For n-grams, we used a number of different settings to try and determine the effect of considering greater or fewer numbers of words. These were [1,2], [2,2], [1,3], [2,3], and [3,3]. The numbers represent the ranges of the number of words to build n-grams from. Furthermore, the vectors were transformed by IDF15. The word2vec approach has more hyperparameters, so we searched a number of parameter sets on the independent dataset to determine those with the best performance. We found settings that are in a relatively common range (size = 100 for dimensionality of the vectors, window = 6 for the maximum distance between a word and the predicted word, negative = 5 for sampling negative words). The neural network structure was determined using a grid search to determine the best structure for the n-grams and word2vec methods. Although the differences in performance were minimal overall, the same structure was found for both these methods. Therefore, we also used this structure for our own method, in order to give those methods any possible advantage. This structure contained three hidden layers using rectified linear units (ReLU) for their activations with 50, 50, and then 100 nodes respectively, and a final sigmoid layer for classification. L2 regularization of the network was also used, and the hyperparameter was determined separately for each method using the independent dataset. Again, better performance could be achieved by tailoring this solution. As noted, we have taken measures to ensure the comparison is as fair as possible. In all comparisons, the exact same training and test data were used to compare all models. All results are presented as the average result over a threefold cross-validation.

Results

Our results (Table 2) showed that RWOV has consistently high accuracy in detecting ER, PR, and Her2 status compared to other vectorization approaches, using NN as a classifier. Furthermore, this result is true irrespective of which class is being predicted. In terms of class imbalance, the other approaches saw a noticeable decline in performance for the underrepresented class in most cases (i.e. ER−, PR−, and HER2+), particularly in terms of F1 score. This is despite the fact that class weighting was enabled for SVM. In every case, RWOV had the highest F1 score. This is particularly notable for the HER2+ class, which included only 42 subjects that were HER2+.

Table 2 Classification performance based on vectorization method.

For both F1 score and AUC, RWOV-NN (i.e., RWOV using neural networks), had the best performance in every task with the exception of AUC for PR− classification. However, even in that case it was very close (0.95 vs 0.96 for SVM with n-grams). In some cases, the other methods had equal performance to RWOV at particular settings. Difference in AUC were tested using the pROC package for the R statistical environment using the method of DeLong, et al.16 (Table 3). In most cases, there were not significant differences in terms of AUC. Although for HER2+, and most comparisons for HER2−, RWOV-NN had a significantly greater AUC than any other approach.

Table 3 P-values for difference in AUC between RWOV-NN and other methods.

We created 95% bootstrap confidence intervals for all AUCs and F1 scores. For most tasks, RWOV-NN had not only the highest performance but also consistently narrow confidence intervals. These are shown in Figs 1 and 2 respectively. ROC curves are shown in Fig. 3.

Figure 1
figure 1

95% confidence intervals for AUC across breast cancer subtypes. Our approach is shown on the left, in black. RWOV-NN has consistently high AUC across the tasks.

Figure 2
figure 2

95% confidence intervals for F1 across breast cancer subtypes. Our approach is shown on the left, in black. In every case, RWOV-NN has the highest F1 score across the tasks.

Figure 3
figure 3

ROC curves for our method, and the best of each of the comparison methods. RWOV-NN shows more clinically useful cut points with low false positives and high true positives are possible consistently across the tasks.

From Figs 1 and 2, it can be seen that in every case that there is a significantly superior method, it is RWOV that has the best performance. Figure 3 demonstrates that although there are typically not significant differences in AUC, RWOV-NN typically exhibits the best cut threshold allowing for a high true positive fraction and a low false positive fraction.

The top 30 words for each TOI are shown in Table 4, as well as the mean frequency of their occurrence per observation. Note that the words have been stemmed to reduce them to their common roots or parts, often by truncation, in some cases by substitution with an identifier. For example “tumor” is “tum” and “comma” and “,” are “comm”.

Table 4 Top 30 occurring words for each TOI. The mean frequency of occurrence per observation is shown.

Discussion and Conclusions

Electronic healthcare records have enormous promise in facilitating research into improving patient treatment. However, much of the EHR is stored as unstructured data. Therefore, it is time-consuming to extract data from the EHR to pre-screen patients for clinical trials, or perform feasibility analysis for recruitment, because these records must be manually examined. Additionally, useful observational studies could be performed, if it were not for this major limitation. Nevertheless, given that the primary purpose of EHR is to support patient care, it would be inappropriate to change its structure purely to facilitate research. Therefore, it is imperative to develop methods for structuring and learning from these data that can facilitate these goals.

In this study, we have demonstrated that our method, Relevant Word Order Vectorization (RWOV), combined with a neural network, shows great promise in tackling this challenge. On a relevant use case, of identifying the hormone receptor status of breast cancer patients, RWOV showed consistently high accuracy across all three classification tasks. In most cases, it had the highest accuracy of any method examined in this study. Of particular importance, RWOV maintained high accuracy in classes with the poorest representation. This is necessary, because for some studies, it will be necessary to include patients based on these poorly represented classes, and poor accuracy might lead to some subjects being unnecessarily excluded (for example HER2+).

The reason for RWOV’s performance on these tasks seems clear. It depends on a unique vectorization method that determines the most important words for classifying a particular case, in addition to their relative location with reference to a term of interest. This relevant word ordering is well suited to the natural language processing in electronic health records, where the data are semi-structured, due to the repetitive nature of how healthcare providers often enter text. Our algorithm is able to take advantage of this semi-structured data to be a more powerful learner. The relatively poor performance of word2vec, which is a well-respected approach, is likely due to the small sample size. Typically, it depends on larger samples to perform well.

Although RWOV performed at least as well as the other methods for the most common class, it particularly shone in predicting the least common class (ER−, PR−, and HER2+). This was partricularly evident in the case of HER2+. For this dataset, there were relatively few positive observations (just 42 of the 229 total observations), and it has a higher semantic complexity. Although we could not determine a single reason for its performance edge for this problem, in at least some cases it able to recognize a class with a more complex set of dependencies. For example, RWOV-NN and RWOV-SVM correctly classified an observation with a snippet of text like the following (full text longer): “ER: Positive (85%) PR: Positive (80%) HER2: Equivocal (2+) Ki-67: 23% FISH analysis for HER2/neu was performed and was reportedly negative.” The other methods were unable to (with the exception of NN-W2V). For HER2, when the initial test is equivocal, an additional test is performed (FISH analysis), making the classification more complex, especially given the limited sample size. While many approaches focus on finding terms that differentiate text, RWOV focuses on finding positional interactions of terms, which may give it an edge in these scenarios.

This initial approach, although promising, is only the beginning. There are many ways that our method could likely be improved. The learning algorithm was off the shelf, in order to provide proof-of-concept but would likely benefit from more customization, such as tuning the structure of the neural network and applying network optimization methods. Also, we were limited in the amount of data that we could provide, due to the laborious process of hand-labeling examples that is required. Therefore, we think we could improve performance by implementing a semi-supervised approach, in addition to a larger training dataset. We expect that the ideas put forward in this article will stimulate research into other approaches that account for the unique characteristics of EHR. At the same time, RWOV will directly benefit research at the University of Kansas Cancer Center by increasing and improving the data available for projecting the feasibility of studies at this institution. To help other institutions benefit from this approach, RWOV is available for Python in a public code repository at https://github.com/jeffreyat/RWOV.git.