Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Within the clinical routine many patient related information are recorded in unstructured or semi-structured text documents and are stored in large databases. These documents contain valuable information for clinicians which can be used to, e.g., improve/support the treatment of long-term patients or clinical studies. Even today information access is often manual, which is cumbersome and time-consuming. This creates a demand for efficient and easy tools to access relevant information. Information extraction (IE) can support this process by detecting particular medical concepts and the relations between them to gather the context. Such structured information can be used to improve use-cases such as the generation of cohort groups or clinical decision support.

Generally, IE can be addressed in many different ways. If sufficient amounts of training instances are available, supervised learning is often the technique of choice, as it directly models expert knowledge. In context of detecting medical concepts (named entity recognition; NER) and their relations (relation extraction; RE) conditional random fields (CRF) (Lafferty et al. 2001) and support vector machines (SVM) (Joachims 1999) have been very popular supervised methods that were frequently used for the last decade. In recent years neural network based supervised learning has gained popularity (see, e.g., Nguyen and Grishman (2015); Sahu et al. (2016); Zeng et al. (2014)).

In context of IE from German clinical data not much work has been done so far. One reason for that is the unavailability of clinical data resources in German language, as discussed in Starlinger et al. (2016). Only a few publications address the topic of NER and RE from German clinical data. Hahn et al. (2002) focus on the extraction of medical information from German pathology reports in order to acquire medical domain knowledge semi-automatically, while Bretschneider et al. (2013) presents a method to detect sentences which express pathological and non-pathological findings in German radiology reports. Krieger et al. (2014) present first attempts to analyzing German patient records. The authors focus on parsing and RE, namely: symptom-body-part and disease-body-part relations. Toepfer et al. (2015) present an ontology-driven information extraction approach which was validated and automatically refined by a domain expert. Their system aims to find objects, attributes and values from German transthoracic echocardiography reports.

Instead, we focus on detecting medical concepts (also referred to as NE) and their relations from German nephrology reports. For both tasks, NER and RE, two different learning methods are tested: first a well established method (CRF, SVM) and later a neural method for comparison. However, the paper describes on-going work, both in terms of corpus annotations and classification methods. The goal of this paper is to present first results for our use case and target domain.

2 Data and Methods

The following section overviews our corpus annotations and the models we use. Note that, due to the (short) format of the paper, method descriptions are brief. We refer the reader to the corresponding publications for details.

2.1 Annotated Data

Our annotation schema includes a wide range of different concepts and (binary) relations. The most frequent concepts used in the experiments are listed in Tables 1 and 2, including a brief explanation. The ongoing annotations (corpus generation) include German discharge summaries and clinical notes from a kidney transplant department. An example of our annotations is presented in Fig. 1. Both types of documents are generally written by medical doctors, but have apparent differences. For more details on corpus generation please see Roller et al. (2016).

Fig. 1.
figure 1

Annotated sentence

For the following experiments 626 clinical notes are used for training and evaluation. Clinical notes are rather short and written during or shortly after a patient visit. Currently, only 267 of those documents contain annotated relations. The overall frequency of named entities and relations is included with the results in the experimental section (see Tables 3 and 4).

Table 1. Annotated concepts
Table 2. Annotated relations

2.2 Machine Learning Methods

NER – Conditional Random Field (CRF). Conditional random fields have been used for many biomedical and clinical named entity recognition tasks, such as gene name recognition (Leaman and Gonzalez 2008), chemical compound recognition (Rocktäschel et al. 2012), or disorder names (Li et al. 2008). One disadvantage of CRFs is that the right selection of features can be crucial to achieving optimal results. However, for a different domain or language important features might change. In this work we are not interested in exhaustive feature engineering. Instead, we intend to re-use an existing feature setup as described by Jiang et al. (2015) who use word-level and part-of-speech information around the target concept. For our experiment we use the CRF++Footnote 1 implementation.

NER – Character-level Neural Network (CharNER NN). In addition to the well established CRF for NER we also use a neural CRF implementationFootnote 2 as introduced by Kuru et al. (2016). The model uses a character-level Bidirectional-LSTM with a CRF objective. Using character level inputs has the advantage of reducing the unknown vocabulary word problem, as the vocabulary size and hence the feature sparsity are reduced compared to words allowing character models to compensate for words unseen during training, which helps on smaller datasets.

RE – Support Vector Machine (SVM). SVMs are often the method of choice in context of supervised relation extraction (Tikk et al. 2010). Besides their advantages, SVMs also suffer from the issue of optimal feature/kernel selection. Other problems are related to the bias of positive and negative instances in training and test data which can significantly influence the classification results (Weiss and Provost 2001). Again, feature selection is not in our interest. For this reason we use the Java Simple Relation ExtractionFootnote 3 (jSRE) (Giuliano et al. 2006) which uses a shallow linguistic kernel and bases on LibSVM (Chang and Lin 2011). jSRE provides reliable classification results and has been shown to achieve state-of-the-art results for various tasks, such as protein-protein extraction (Tikk et al. 2010), drug-drug extraction (Thomas et al. 2013) and extraction of neuroanatomical connectivity statements (French et al. 2012).

RE – Convolutional Neural Network (CNN). Besides SVM, we also use a convolutional neural network for relation extraction. We employ a KerasFootnote 4 implementation of the model described by Nguyen and Grishman (2015) using a TensorFlowFootnote 5 backend and a modified activation layer. The architecture consists of four main layers: (a) lookup tables to encode words and argument positions into vectors, (b) a convolutional layer to create n-gram features, (c) a max-pooling layer to select the most relevant features and (d) a sigmoid activation layer for binary classification.

3 Experiment

In this section named entity recognition and relation extraction on German nephrology reports are carried out. Given a sentence (token sequence), the task of NER is to assign the correct named entity label to the given tokens in the test data. Relation extraction takes a sentence including the different named entity labels as input and determines for each entity pair whether one of our target relations exists. Both classification tasks are evaluated based on precision, recall and F1-Score. Note, due to space reasons, not all relations of the example in Fig. 1 are used for the experiment.

Table 3. Concept classification results
Table 4. Relation classification results

3.1 Preprocessing

To carry out the experiment text documents are processed by a sentence splitter, a tokenizer, stemmer and Part-of-Speech (POS) tagger. The sentence splitting and tokenization are essential to split documents into single sentences and single word tokens. We use JPOS (Hellrich et al. 2015), to tag Part-of-Speech information, since the tool is specialized for German clinical data. POS tags are used for both the CRF and SVM. Additionally, we stem words for jSRE using the German Snowball stemmer in NLTK. CharNER and CNN do not require additional linguistic features as input.

For both NER and RE the experiments are carried out multiple times – for each single named entity type and each single relation for two reasons: Firstly, in context of named entities tokens might be assigned to multiple labels which our classifiers can not handle directly. Secondly, jSRE does not handle multi-class classification. Hence, we use a One-vs. rest (OvR) classification to train separate models for each NER/RE type.

3.2 Named Entity Recognition

Setup. NER type evaluation uses the OvR setup to train a single classifier (CRF or CharNER) per class. The experiment run as a reduced 10-fold cross-validation on 3 out of 10 stratified dataset splits, since the CharNER model took a very long time to compute, despite using a GPU. Specifically, each split has a \(80\%\) training, a \(10\%\) validation and a \(10\%\) test part. To further save time, we determined the CharNERs optimal parameters on only one splits’ validation part for only one out of eight entity types (body part). Afterwards, we applied the found parameters to the other entity types and splits to produce average test part scores. Thus, the parameter settings may not be optimal for all entity types. In practice, the CRF trained in hours compared to days for the Bi-LSTM. Both models were evaluated using the 3-fold setup for comparison.

Results. The results for named entity recognition are shown in Table 3. Even though classifiers are not necessarily optimal (e.g., no feature engineering), the results are promising. All concepts with a frequency above 800 have an F1-Score above 70. Moreover, all concepts can be detected at a high level of precision. Both classifiers produce comparable results, with better F1 for the CharNER and a focus on precision for the CRF.

3.3 Relation Extraction

Setup. Our relation extraction task considers only entity pairs within the same sentence. While positive relation instances can be directly taken from the annotations, negative relation instances (used for training and testing) are generated by creating new (unseen) combinations between entities. The relation extraction experiment is then carried out within a 5-fold cross-validation using NE gold labels as input.

Due to the comparably smaller size of the dataset, hyperparameters of the CNN have been slightly modified in comparison to (Nguyen and Grishman 2015). As before, we used one relation type from one fold to find the optimal parameters and then applied those parameters to the other folds and types. This resulted in a reduced position embeddings dimensionality (from 50 to 5) compared to the original model. We also used pre-trained German word embeddingsFootnote 6.

Results. The relation extraction results are presented in Table 4. Most relations can be detected at an F1-Score of 80. Only the relation is_located produces a surprisingly low precision which results in a reduced F1. Overall, the results are very promising and leave space for further improvements using improved classification models.

4 Conclusion and Future Work

This work presented first results in context of detecting various named entities and their relations from German nephrology reports. For each task two different methods have been tested. Even though preliminary classification methods have been used (no feature engineering, sub-optimal tuning) and the relatively small size of training and evaluation data, the results are already very encouraging. Generally, the results indicate, that the classification of such information is not too complex. However, a more detailed analysis is necessary to support this assumption.

Future work will focus on increasing the corpus size, and extending/improving our classification models (e.g., elaborate hyperparameter search and selection of pre-trained embeddings). Then those models will be used for further use-cases such as general information access of clinical documents and cohort group generation.