Background

Through lifelong and nationwide Electronic Health Record (EHR) systems, larger and larger amounts of patient information will be available at clinicians’ workplaces. Flooding the user which highly granular and partly redundant information is especially relevant when patients have chronic diseases, multiple diagnoses and numerous in- and outpatient treatment episodes.

This circumstance hampers a quick overview of the most important facts, possibly with a negative influence on the quality of medical decisions. For a long time, problem lists or diagnosis lists in medical records have been key information sources, because they contain a palatable selection of the most relevant information items, filtered and summarized by physicians.

In the setting in which this study is embedded, i.e. in a large Austrian hospital network, the clinical information system displays problem list entries up to 50 characters only. Furthermore, problem lists are, first of all, diagnosis lists, and each coded diagnosis at each hospital stay produces a new problem list entry. Due to the length limitation of list items, most official ICD labels are overwritten by the users, often drastically abbreviated and enriched by additional information like time or other contexts of a diagnosis.

To improve the access of physicians to problem list entries, especially by reducing redundancy is the main objective of a so-called patient-centered QuickView mode we have developed and deployed via a web-based front-end from of the clinical information system i.s.h.med. Whereas the ultimate goal of QuickView is a navigable, user-centered overview of a patient’s diseases, medications, procedures and laboratory results, we here limit ourselves to a problem list like diagnosis lists, most of which coded by ICD-10. Such lists easily amount to a length of hundreds of items for elderly or multi-morbid patients. We intend to provide a topic-based grouping, which can be exploited in a navigational and information visualization based way within QuickView.

Analyzing EHR content with supervised and unsupervised machine learning methods has become a widely used approach to gain insights into clinical information like diagnoses [1] or medications [2,3,4,5], and at the same time it is also a matter of investigation in different academic challenges [6].

Information extraction from unstructured EHR data like clinical narratives is a general challenging task, due to language specific idiosyncrasies like short forms (abbreviations [7, 8], acronyms [9, 10]), spelling and typing mistakes, syntactic incompleteness, specialist jargon, negations [11] or non-standardized numeric expression, just to mention some [12, 13]. The automatic assignment of ICD diagnosis codes received special attention in various research projects due to its importance for therapy planning, billing and medical decision support.

Koopman et al. [14] used support vector machines (SVMs) with term and concept based features to automatically detect cancer diagnoses and classify them according to ICD–10. An F-measure of 0.70 was reported for detecting the type of cancer. Koopman et al. [15] also automatically classified death certificates with respect to influenza, diabetes, pneumonia and HIV. A supervised approach with SVMs was used for ICD-10 coding, resulting in an F-measure of 0.80. Ning et al. [16] tested a Chinese ICD-10 coding approach on medical narratives. Based on a word-to-word similarity metric, they structured the ICD-10 codes hierarchically and assigned codes to unlabeled documents with an F-measure of 0.91. Chen et al. [17] enhanced the longest common subsequence algorithm for ICD-10 mapping to Chinese clinical narratives, yielding an F-measure of 0.81 for this task. Boytcheva [18] achieved an F-measure of 0.84 using a multi-class SVM with a max-win voting strategy in combination with a text preprocessing module for ICD-10 coding of Bulgarian clinical narratives.

However, features used in a supervised framework are often connected to language-specific patterns, even though more recent deep learning methods reduce the need for use case specific feature engineering e.g. for clinical narrative de-identification [19, 20].

In the following sections we will present and evaluate a minimal language-dependent approach of semantic grouping of problem list entries, without the need of human feature engineering. We refrain from a purely supervised approach, but will use a post-ICD-10 coding methodology with the side effect that documents where no code could have been assigned are nevertheless grouped together in semantically meaningful clusters.

Methods and materials

Intra-patient data-set

For intra-patient inspection, we used data from five de-identified nephrology patients, each of them having between 250 and 861 50-character long problem list statements written in German, covering time intervals from 12 to 22 years. A special feature of these code-description pairs is the fact that physicians can overwrite the contents of a 50-character long text field originally filled with standardized text generated by an ICD-10 coding plug-in. The list view therefore consists of different standardized and personalized diagnosis entries, the latter often being enriched with additional context like time references, procedures, or medications. Additionally, ICD-10 codes with no textual description as well as entries without ICD-10 codes occur. This makes these lists, originally devised as ICD-based diagnosis lists, resemble problem lists, a feature rooted in Anglo-Saxon medical traditions, but uncommon in German-speaking clinical communities.

Inter-patient data set

We used the sampling theorem with Chernoff bounds [21, 22] in order to estimate a statistical representative sample size for nephrology patients for the inter-patient inspection:

$$ n\ge \frac{3}{\varepsilon^2}\kern0.5em In\kern0.5em \frac{2}{\delta } $$
(1)

With an accuracy of ε = 0.05 and a confidence of 1 − δ = 0.95, 4430 non-identical ICD-10 coded de-identified 50-character long text snippets were chosen as a representative linguistic sample size (4430 ≥ n = 4427). The advantage of using the sampling theorem is its independence of the overall initial pool size for estimating a number of samples. By applying this theorem, we claim that a representative syntactical pattern of the sampled corpus, in our case the non-identical short ICD-10 code descriptions, with a probability of 95%, is within +\- 5% of the overall observations. With this approach for sub sample size estimation we addressed a significant amount of linguistic variations in a clinical domain, for inter-patient post-ICD-10 encoding. Finally, we merged the five de-identified patients from the intra-patient pool with the 4430 ICD-10 samples.

Problem description

A patient P1..i has a set of diagnosis list items I1..k..l where Ik = (ICD − 10k, dk) defines the 50-character long description dk which we refer to as a document in the following analysis. One fraction Icoded = I1..k is coded and the other one Iuncoded = Ik + 1..l is without codes, with just the text snippets dk + 1..l existing. Since an immediate overview of all list items I1..l to a patient Pi is not possible with longer lists, our solution attempts to semantically group them into n sets C1..n, so that the content navigation through all list items I1..l via C1..n is supported.

For semantically grouping related list items I1..l, we make use of the fact that list items Icoded with the same 3-digit ICD-10 code are similar in content. Existing codes to a document form a manual ground truth of judgment for semantic similarity. On the other hand, content similarity of a subgroup of list items Ii..j out of I1..l is given by string similarity between two list items (I1, I2), which can be expressed via a function fsim(I1, I2) = sim = fsim(d1, d2). Therefore sim is an indicator for content similarity.

In cases where list items have the same ICD-10 code, we clustered them forming CICD − 10 = C1..i ICD-10 content groups. Therefore we tried to post-assign ICD-10 codes to the uncoded list items Iuncoded while those list items which got no post-ICD-10 code assigned could at least be grouped as being similar in content, via a certain level of sim forming Csim = Ci + 1..n cluster. We therefore evaluated the correct post-ICD-10 assignment of list items in CICD − 10 and the correct clustering of content groups Csim where no code could be assigned based on string similarity.

We aimed to achieve this in one go by using a hierarchical clustering approach wherever ICD-10 codes are assigned to non-coded list items and at the same time infer the optimal sim boundary for string-based list item grouping with a minimal language-dependent preprocessing strategy. We apply the methodology in an intra-patient and an inter-patient scenario. For inter-patient post-ICD-10 assignment we assumed that the number of assigned ICD-10 codes was significantly higher compared to the intra-patient scenario, due to the fact that codes can be assigned via learning from examples of other patients.

Evaluation methodology

We use the metrics Precision = #TPs / (#TPs + #FPs), Recall = #TPs / (#TPs + #FNs) and F-measure = 2 · Precision · Recall / (Precision + Recall) [23], in order to evaluate the accuracy of our topic groups C1..n, for the intra-patient and for the inter-patient approach, respectively. True Positive (TP): A topic gets correctly assigned. False Positive (FP): A topic gets incorrectly assigned. False Negative (FN): A topic should have been assigned. True Negative (TN): A topic was correctly not assigned. A topic can be specified via a specific 3-digit ICD-10 code or a certain content cluster in case it is not possible to assign a post-ICD-code description.

Data preprocessing

The 50-character text segments were normalized using the following Lucene [24]-based NLP processing chain: a StandardTokenizer for tokenizing the very short narratives; a StandardFilter applying a base orthographic normalization; a LowerCaseFilter to eliminate all upper case occurrences; a StopWordFilter erasing a list of defined tokens and a SnowballFilter (“German2”) for stemming (Fig. 1). Finally a specific set of characters were removed from the normalized token stream via a specific regular expression ([\d\.\,\_\:]+). We compensated the especially German language specific phenomenon of word compounds, e.g. certain domain-specific affixes like “-itis” for inflammation or “-ektomie” for surgical removal, not by a specific word decompounding engine but by a character n-gram filter, choosing an initial window size of n = 3. The side effect of character n-gram modeling is that typing errors, commonly found in clinical narratives have less impact on token dissimilarity in the VSM (Vector Space Model).

Fig. 1
figure 1

Overview of the main data flow and preprocessing steps until the selected clustering approach has been applied at the end of the overall processing chain

Vector space model

We mapped the EHR problem list items into a vector space using the VSM [25, 26] which models a set of documents D = d1, d2, dj, … , dn as bag of words where a document dj defines a point in the m-dimensional vector space, forming an m-dimensional feature vector. The dimensionality m of the feature space in our case is defined via t1, t2, ti, … , tm unique character 3-gram types of the preprocessed document collection D and the VSM is therefore described via a m×n matrix X. We applied the term frequency – inverse document frequency tf-idf weighting scheme on X and used the cosine similarity between two documents di and dj to obtain the semantic similarity sim between two list items Ii and Ij.

Latent semantic analysis

We examined Latent Semantic Analysis (LSA) and different degrees of dimension reduction of the semantic space for its impact on our topic model approach. The mathematical core function of LSA [27, 28] is a Singular Value Decomposition (SVD) of the term-document matrix X = TSDT accessing the orthonormal matrices T and DT with the eigenvectors of XXT and XTX. T defines the term matrix and DT the document matrix. The roots of the eigenvalues of XXT and XTX are embedded in S. The degree of dimensionality reduction can be controlled by eliminating the lowest eigenvalues and their eigenvectors to a new dimension k resulting in a dimensionality reduced space Tk respectively DTk. The orthonormal semantic spaces Tk, DTk can be seen as one kind of distributional semantics and are exploited in various information retrieval and information extraction scenarios.

Clustering methodology

For content-based grouping into n sets C1..n we applied a clustering approach. First, for all patient-specific documents d1..l (50-character long phrases) including the already ICD-10 coded documents we applied a hierarchical agglomerative cluster method implemented in the R package fastcluster [29]. In brief, agglomerative clustering works as follows: All documents are initially assigned to their own cluster and then iteratively merged, based on a specific distance metric until there is just a single cluster. To decide whether two cluster collapse into a single one we used the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) variant. It computes the distances between two cluster C1 and C2 based on the pairwise average distances between their assigned documents d:

$$ \frac{1}{\left|{C}_1\right|\left|{C}_2\right|}\kern0.5em {\sum}_{{\mathbf{d}}_i\in {c}_1}{\sum}_{{\mathbf{d}}_j\in {c}_2}\left(1-{f}_{sim}\left({\mathbf{d}}_i,{\mathbf{d}}_j\right)\right) $$
(2)

We hypothesize that string similarity of textual problem list entries (i.e. the documents) correlate with their ICD-10 code assignments, therefore we expect that UPGMA in combination with the chosen cosine similarity distance metric delivers good results. We applied different cut heights to the resulting dendrogram and inferred the cut-off (cut-height of the dendrogram) that most accurately reproduced the already coded ICD-10 clustering scheme (Icoded). A big advantage of the UPGMA clustering is that we can directly relate the resulting clusters to the cosine distances between the documents whereas other algorithms like k-means for example require a pre-defined parameter k for the number of clusters. Accuracy was estimated by the F-measure for the intra- as well as the inter-patient scenario.

In fact, one could also infer an appropriate cut-off based on more conservative approaches like the Elbow [30] or Silhouette [31] method to enable a purely unsupervised setting. However, in our framework these methods would separate clusters exclusively based on string similarity, which may not capture the true n-gram variances within the semantic clusters and as consequence will likely produce a high false negative rate.

$$ ICD-{10}_c\left({I}_{uncoded}\right)= ICD-{10}_c\left(\max \left\{{f}_{sim}\left({\mathbf{d}}_l,{\mathbf{d}}_k\right)\right\},{I}_{coded}\right) $$
(3)

Equation 3 gives a formal explanation of how the post coding of ICD-10 codes was executed. Unlabeled documents (dl ∈ Iuncoded) were coded if and only if they appeared in a same cluster C (dl, dk ∈ C) together with at least one ICD-10 coded document (dk ∈ Icoded). In cases where documents with different ICD-10 codes were clustered in the same group, we assigned the label of the document with the smallest cosine distance transforming the diagnosis into a coded list item.

Results and discussion

We used a hierarchical clustering approach to semantically cluster EHR problem lists, where semantic similarity was specified by ICD-10 codes and string similarity. The main challenge of this approach is to find the optimal cut-off height of the resulting dendrogram to ensure optimal post-ICD-10 coding and reasonable string clustering at the same time. With the hypothesis that ICD-10 coding correlates with string similarity we were able to exploit the already coded 50-character as a reference for this optimization problem.

Specifically, we inferred a cut-off such as the coded 50-character long diagnosis texts with the same 3-digit ICD-10 code fall into the same grouping based on string similarity. This is achieved by iteratively applying different cut-off heights and finally choose the one with the maximum F-measure. For this study we report an averaged intra-patient F-measure of 0.70 at a cut-off height 0.90 for patients P1..5 and an F-measure of 0.47 at a cut-off height 0.97 for the inter-patient approach. From these first results we could conclude that our assumption exclusively holds for a subset of diagnosis lists reflecting an ICD-10 cluster (intra-patient). Re-sampling a fully representative character 3-gram distribution (inter-patient) of the ICD-10 specific diagnosis texts strongly discard this assumption due to the high variances observed within the ICD-10 groups. However, while the obtained cut-off purely performs in detecting true negatives it does remarkable well in post-assigning ICD-10-codes.

In an additional investigation, as depicted in Fig. 2, we inspected the influence of transforming the character 3-gram term-document matrix X into its semantic orthogonal document space DTk and varied the dimension reduction at k different levels. We observed a maximum F-measure of 0.58 using 40% of the most relevant dimensions for the intra-patient case and an F-measure of F = 0.42 with the 10% of the most relevant dimensions for the inter-patient case. Thus, mapping the problem into a reduced linear transformed semantic space via LSA not yet improved the performance of our approach.

Fig. 2
figure 2

Averaged step-wise intra-patient and inter-patient dimension reduction of the semantic document space

Table 1 highlights the results for the intra-patient post-ICD-10 coding at the top and the string clustering results at the bottom. On average 68% of the non-coded list items were post assigned with an F-measure of 0.77. The remaining 32%,where no ICD-10 code could have been assigned, formed consistent topic clusters with an F-measure of 0.85. We therefore report an overall list item grouping for the intra-patient inspection with an F-measure of 0.81.

Table 1 Intra-patient post-ICD-10 coding and string clustering results

From Table 2 we see that for the inter-patient setting almost all non-coded list items get a post-assigned ICD-10 code with an overall F-measure of 0.87. This result is quite remarkable compared to the literature review and considering the not optimal cut-off we inferred for the inter-patient inspection accomplished by a lower precision compared to the intra-patient results in Table 1. However, the expected recall gain had an overall positive performance impact judged by the F-measure.

Table 2 Inter-patient post-ICD-10 coding

The post-ICD-10 coding rate is indeed that high that the portion of list items without code has no relevant impact on the overall topic groups C1..n, to support the navigation through all list items I1..l via C1..n. We therefore report an overall list item grouping for the inter-patient inspection with an F-measure of 0.87 mainly dominated by ICD-10 codes.

Tables 3 and 4 show that the number of identified topics on average in the intra-patient case (μIntra = 78.4) was lower than in the inter-patient case (μInter = 83.4) as well as initial list items views like in the case of Patient 3 with more than 850 entries can be semantically grouped to less than 100 entry points. This is equivalent to a semantic compression rate of up to 89% of the original list item size.

Table 3 Number of the identified intra-patient topics out of the initial disease list items
Table 4 Number of the identified inter-patient topics out of the initial disease list items

Despite the good results of our approach two major challenges need to be addressed: i) Some textual expressions should be coded with more than one ICD-10 code. For instance, in the case of “Akutes Nierenversagen mit Hyperkaliämie” (acute kidney failure with hyperkalaemia) N17 (acute renal failure) should be assigned to “Akutes Nierenversagen” (acute kidney failure) and E87 (other disorders of fluid, electrolyte and acid-base balance) for “Hyperkaliämie” (hyperkalaemia). So far we have inferred exactly one code per 50-character list entry. ii) Some codes were found to be plainly wrong at the moment we post-assign the codes at the quality level of clinical routine documentation.

Conclusions

In this paper we have motivated a hierarchical cluster-based approach with a minimal language-dependent preprocessing strategy for grouping clinical problem lists into distinct semantically similar clusters in order to support patient-based disease topic navigation. This functionality is planned to be implemented within a QuickView software accessible in a hospital environment.

Our methodology not only post-assigns ICD-10 codes but also builds semantically similar clusters based on string similarity. Applying this method at an intra-patient level implies that possible post-ICD mappings are missing due to the limited patient-focused scope (high false negative rate), nevertheless achieving a useful clustering of list-items where no code could be assigned. For this reason, we extended the scope to an inter-patient examination of the same methodology and motivated a sufficient sample size in order to fetch a common linguistic fingerprint. With an acceptable negative impact on precision we were able to boost recall so that the overall topic modeling of the disease space was reduced to post-ICD-10 codes only. However, the inter-patient cut-off height of the resulting dendrogram is at a very low level, with the result that the inter-cluster variance is not at its optimal state anymore with regard to string similarity. As a consequence, a substantial amount of list items gets ICD-10 code assigned by accident.

In a further investigation we plan to refrain from an F-measure driven optimized single cut-off strategy, and want to pursue a strategy where the ICD-10 cluster-specific variances on our proposed normalized character 3-gram features can be studied more reliably. In this case also a more detailed inspection of the level of character n-gram decomposition could be done. We hypothesize that, while estimating the optimal number of disease clusters based on a between-within variance inspection, already encoded ICD-10 examples can just act as proxies for correct post-ICD encoding and therefore may compensate for the precision loss at a high recall level. One avenue would be a more conservative method like Elbow and Silhouette to infer the best cut-off purely based on string similarity and dynamically encode potentially false negatives in a post-processing step where each ICD-10 cluster is treated independently based on their feature pattern space respectively character n-gram distribution.