A distributable German clinical corpus containing cardiovascular clinical routine doctor’s letters

Richter-Pechanski, Phillip; Wiesenbach, Philipp; Schwab, Dominic M.; Kiriakou, Christina; He, Mingyang; Allers, Michael M.; Tiefenbacher, Anna S.; Kunz, Nicola; Martynova, Anna; Spiller, Noemie; Mierisch, Julian; Borchert, Florian; Schwind, Charlotte; Frey, Norbert; Dieterich, Christoph; Geis, Nicolas A.

doi:10.1038/s41597-023-02128-9

A distributable German clinical corpus containing cardiovascular clinical routine doctor’s letters

Data Descriptor
Open access
Published: 14 April 2023

Volume 10, article number 207, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

A distributable German clinical corpus containing cardiovascular clinical routine doctor’s letters

Download PDF

Phillip Richter-Pechanski ORCID: orcid.org/0000-0003-0121-373X^1,2,3,4,
Philipp Wiesenbach^1,2,4,
Dominic M. Schwab²,
Christina Kiriakou²,
Mingyang He^1,2,
Michael M. Allers¹,
Anna S. Tiefenbacher¹,
Nicola Kunz¹,
Anna Martynova¹,
Noemie Spiller¹,
Julian Mierisch¹,
Florian Borchert ORCID: orcid.org/0000-0003-1079-6500⁵,
Charlotte Schwind^1,2,
Norbert Frey^2,3,4,
Christoph Dieterich^1,2,3,4 &
…
Nicolas A. Geis^2,4

1830 Accesses
2 Altmetric
Explore all metrics

Abstract

We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. CARDIO:DE encompasses 500 clinical routine German doctor’s letters from Heidelberg University Hospital, which were manually annotated. Our prospective study design complies well with current data protection regulations and allows us to keep the original structure of clinical documents consistent. In order to ease access to our corpus, we manually de-identified all letters. To enable various information extraction tasks the temporal information in the documents was preserved. We added two high-quality manual annotation layers to CARDIO:DE, (1) medication information and (2) CDA-compliant section classes. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German clinical corpus in the cardiovascular domain. In summary, our corpus offers unique opportunities for collaborative and reproducible research on natural language processing models for German clinical texts.

European Clinical Case Corpus

The Leaf Clinical Trials Corpus: a new resource for query generation from clinical trial eligibility criteria

Article Open access 11 August 2022

CAS: corpus of clinical cases in French

Article Open access 06 August 2020

Background & Summary

Despite sustained declines in cardiovascular disease (CVD) mortality in many countries across Europe, CVDs still account for approx. 4.1 million deaths within European Society of Cardiology (ESC) member countries and have remained the most common cause of death within this region (45 and 39% of all deaths in females and males, respectively). Moreover, the prevalence of CVDs across Europe is still high with an estimated 113 million people living with CVD in the 57 ESC member countries, significantly contributing to patient morbidity and hospitalizations¹.

At the same time, in clinical routine, large portions of data like patient anamnesis, cardiovascular risk factors and diagnosis continue to be stored in unstructured form, such as free text in doctor’s letters². The predominantly hypothesis-driven strategies used in cardiovascular research should be complemented by computer-assisted methods. The comprehensive analyses of large clinical datasets using automatic information extraction methods will not only significantly expand the data sources for clinical care and research, but could also improve clinical decision-making and enable progress in personalized medicine².

The rapid development in the field of natural language processing (NLP) in the past 15 years provided powerful tools for automatic text processing³. A high number of models, based on rule-based methods, statistical and more recently neural network methods were developed and validated for various tasks. While state-of-the-art (SOTA) supervised machine learning models require annotated data for training, all methods require annotated data for evaluation and quality control.

Therefore, shared corpora in the clinical domain are essential to support transparent and reproducible experiments and foster innovation in the field of clinical NLP^2,4,5. In addition to their use for various medical information extraction tasks, e.g. they can be used for (1) training unsupervised or semi-supervised machine learning models, (2) the development of clustering or topic modelling methods⁶, (3) the development of pre-processing methods such as sentence-splitting, tokenization or part-of-speech tagging⁷, (4) domain adaptation of language models through pre-training techniques⁸, (5) the development of data augmentation techniques to overcome data scarcity⁹ or (6) use of the corpus as a reference corpus for collaborative cross-site clinical annotation projects.

Progress In clinical NLP can have a direct impact on clinical routine, as it enables clinicians to extract insights from large amounts of clinical textual data. For example: (1) to build clinical decision support systems that can be used to identify drug-drug interactions or adverse drug events or (2) to support clinical trial recruitment, by identifying patients based on their clinical anamnesis, diagnosis or demographics.

State of research

Although the availability of textual data from medical domain has increased, there is still an immense need for high quality, but also freely accessible clinical text corpora. Particularly strict data protection regulations are a huge challenge that prevent publishing even small-scale data sets. Distributable English medical corpora in the U.S. must meet the regulations of the HIPAA (Health Insurance Portability and Accountability Act of 1996; https://www.hipaajournal.com/de-identification-protected-health-information/). Accordingly, the safe harbor section explicitly lists eighteen Protected Health Information (PHI) identifiers (e.g. person names, dates etc.), which need to be removed from a clinical document in order to define it as de-identified (https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html, accessed 10.09.2022).

Based on HIPAA several clinical text corpora in English were recently published. The MIMIC III (Medical Information Mart for Intensive Care) corpus¹⁰ is one of the most popular and largest data set. MIMIC III contains approximately 2 million free text notes from various clinical domains. In addition, annotated clinical corpora are published in context of shared tasks¹¹, e.g. by the i2b2 foundation (>805,118 tokens, https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp), CLEF (http://clef-ehealth.org/), SemEval (http://alt.qcri.org/semeval2014/) and THYME¹². The availability of these corpora led to a series of developments of information extraction frameworks in clinical setting¹³.

In contrast to the US HIPAA regulation, European data protection regulations, specifically the GDPR (General Data Protection Regulation), do not explicitly define how to de-identify a clinical document when sharing it (For a more detailed comparison between GDPR and HIPAA, see: https://iapp.org/news/a/gdpr-match-up-the-health-insurance-portability-and-accountability-act/, accessed 10.09.2022). However, lately, non-English European corpora were published under the GDPR, e.g. (1) MERLOT, a French clinical routine corpus containing 500 manually de-identified documents including 148,476 tokens¹⁴ or (2) IULA, a Spanish clinical record corpus, containing de-identified and shuffled sentences extracted from 300 clinical reports¹⁵. Unfortunately, to date there are only a handful of German medical corpora publicly available. This led to the development of German medical corpora containing medical guidelines or synthetically generated clinical texts^{11,16,17,18,19}. The largest German medical corpus is currently GGPONC 2.0, containing approximately 1,8 million tokens, which builds upon clinical guidelines from the oncology domain¹¹.

The only distributable German clinical routine corpus available contains 200 oncological discharge summaries (89,942 tokens, created between 2013–2016) from the university hospitals Berlin (Charité) and Tübingen²⁰. As these data were collected retrospectively, each document needed to be carefully manually de-identified. Furthermore, all sentences in the corpus were shuffled in order to meet legal requirements of the clinical data protection officers and the ethics committee. Thus, only sentence-level medical information extraction (MIE) is possible requiring proper sentence segmentations, which is quite challenging in clinical texts. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German corpus containing coherent documents from the cardiovascular clinical routine.

Goals

The goal of CARDIO:DE is to foster NLP research in the German-speaking cardiovascular domain by publishing a freely available corpus containing cardiovascular doctor’s letters from clinical routine. To achieve this goal, we prospectively collected 500 doctor’s letters from the cardiology department at Heidelberg University Hospital. In addition, we publish two manually annotated layers:

1.
for medication information extraction and
2.
for section classification.

Furthermore, we present results of various baseline classifiers, trained and evaluated on both annotation layers.

Corpus characteristics

CARDIO:DE encompasses 500 cardiovascular doctor’s letters covering a broad clinical spectrum of a tertiary care cardiovascular centre between 2020 and 2021. Our corpus covers 311 in-patient, 172 out-patient and 17 letters of the cardiac emergency room (chest pain unit; CPU). Thus, the included doctor’s letters cover both complex multiple-day hospitalizations as well as brief out-patient presentations. This results in the deployment of a representative collection of clinical documents, covering common doctor’s letter sections (e.g. anamnesis, physical examination, instrumental diagnostics, laboratory results, epicrisis, medication) in varying degrees and details. Figure 1 illustrates an excerpt of a doctor’s letter including common section types. The complete corpus contains 993,143 tokens, with approximately 31,952 unique tokens.

We randomly split our corpus into two parts (similar to Kittner et al., 2021²⁰). CARDIO:DE400 contains 400 documents, 805,617 tokens and 114,348 annotations. CARDIO:DE100 contains 100 documents, 187,526 tokens and 26,784 annotations. Both corpora will be published for scientific research purposes. Scientific research excludes processing the data for marketing purposes. We will only publish annotations of CARDIO:DE400, annotations of CARDIO:DE100 will be kept inhouse as held-out data for a shared task on various MIE tasks, which we want to organize in the future.

Table 1 shows a quantitative analysis per CARDIO:DE splits. Figure 2 illustrates the quantitative analysis as a box plot for the whole CARDIO:DE corpus. In Table 2 we present the most common 50 whitespace separated token in CARDIO:DE including token count.

Table 1 Corpus token statistics.

Full size table

Table 2 Frequent token in CARDIO:DE.

Full size table

We aim to publish cardiovascular doctor’s letters as close as possible to clinical routine documents. To achieve this, CARDIO:DE is based on a prospective study design with patient consent, which enabled us to keep the original document structure of clinical routine doctor’s letters (Fig. 3). While collecting patient consents can be time consuming and tedious, this procedure assures us to best comply with current data protection regulations in Germany. Moreover, similarly to recent corpus distribution projects^10,20 we preserved the information on patient’s age and time/date in the documents. Thus, the corpus can be used for various information extraction tasks on document level in the cardiovascular domain. In a future version of CARDIO:DE, to further increase corpus consistency, all PHI placeholders will be replaced by semantic related surrogates, as proposed in Lohr et al., 2021²¹.

Methods

Ethics declarations

This study has been approved by the ethics committee of the Heidelberg University Hospital (S-498/2020) and has therefore been performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki. All persons gave their informed consent prior to their inclusion in the study. The manuscript does not contain clinical studies or patient data.

Data selection and collection

This study was designed monocentric, non-interventional, non-randomized, and prospective by collecting 500 patient consents between 2020 and 2021 in the cardiology department at the Heidelberg University Hospital. One doctor’s letter per patient was included into the CARDIO:DE corpus. Inclusion criteria were as follows: (1) age of at least 18 years (2) written consent signed by the patient; and (3) a diagnosis with a cardiovascular system disease. Patients were included after revision of the criteria by the recruiting study assistant, adequate information and subsequent signing of the CARDIO:DE consent form. By signing, the patient’s next generated doctor’s letter was included into the corpus. No study-specific additional examinations or further measures are performed within the scope of the project; thus the patient was not exposed to any additional risk.

We then exported each document and converted it from binary MS doc to MS docx to a plain text format (LibreOffice 6.1.5.2, https://wiki.documentfoundation.org/Faq/General/150, accessed: 22.08.2022) keeping the paragraph sections consistent, highlighted with the “¶” symbol in MS Word. We split each document by paragraph and tokenized each document using spaCy (v.3.2.1, language pipeline: de_dep_news_trf)²².

De-Identification

All documents were initially de-identified using a deep learning model trained on manually annotated in-house data²³. The model was trained on a pre-defined set of PHI classes. In a second step, clinical experts manually reviewed the automatic de-identified documents and replaced remaining un-deidentified PHI token with appropriate semantic placeholders. To keep the chronological order in the documents, we followed best-practice procedures¹⁰ by shifting all dates by a random number per document. Information about weekdays, time of day and seasons were kept. We also kept the information about patient age in the documents. If the patient was older than 80 years, we followed best-practice approaches¹⁰, by shifting age by a random number larger than 300.

We added three initial lines to each document containing pseudonymized meta information about: (1) admission date, (2) date of birth and (3) patient age. To further ensure anonymity, we removed outliers in laboratory values of each document, including patient height and weight. To identify outliers, we used a z-score approach²⁴. Finally, we stored each doctor’s letter in our clinical data storage to save the corpus for further data preparation and annotation.

Data annotation

We created two annotation layers to CARDIO:DE: (1) medication information on token-level and (2) information of section type on paragraph level. We used the annotation tool INCEpTION (v. 22.3) optimized for span annotations, including a monitoring and curation tool²⁵. INCEpTION was installed as a web service in the clinical network to facilitate its use and data access for our annotation task force inside the clinical infrastructure.

Annotation workflow

We used well-established annotation methods^26,27,28,29, including a guideline adaptation process by redundantly annotating documents involving an inter-annotator agreement score (IAA) in an iterative approach (Fig. 4).

After drafting initial guidelines with domain experts, a subset of documents was sampled from the main corpus for redundant annotation by all annotators. After each iteration the annotation master reviewed all annotations and documented all disagreements. To measure annotation quality an IAA score was calculated. During the following review meeting with all annotating participants including the annotation master all disagreements were discussed and a joint solution was defined. If necessary, the guidelines were adapted and a new iteration round was initialized until a pre-defined IAA threshold was met, depending on the annotation task. The annotation master curated all documents of each iteration based on the adapted guidelines. After redundant annotation was completed each annotator was assigned to a distinct subset of the remaining documents. To ensure high annotation quality nevertheless, the annotation master carefully reviewed all single annotated documents in compliance with the final guidelines.

Medication information annotation

Our medication information annotation scheme is based on Uzuner et al., 2010³⁰, and was adapted to the specific structure of our CARDIO:DE data (guidelines available as Supplementary File 1). As we performed a named entity recognition (NER) task, with many tokens not getting annotated with a class type, we used F1-score (harmonic mean between precision and recall) as IAA. We performed three iterations for redundant annotation with six annotators (four medical informatics master’s students, three with clinical experience and two medical students in their seventh semester (third clinical semester) with clinical routine experience) containing 15, 15 and 10 documents. The entire project lifetime, including preparation, annotation and evaluation, was three months. Approximate annotation time per document was 5–10 min.

Most of medication information in a doctor’s letter is listed in a separate semi-structured section (e.g. Therapieempfehlung, Medikation bei Aufnahme, …). In addition, we annotated medication information in narrative text sections. For all annotated medications in a doctor’s letter, the patient had to be the experiencer. We neither made any assumptions nor considered longitudinal information from external sources about a patient.

Our annotation objective was to identify a relevant drug (Drug) or active ingredient (ActiveIng) and its relation information (Dosage, Route, Frequency, Duration, Strength, Reason and Form). Moreover, we added a binary attribute (inNarrative) to each Drug/ActiveIng, to mark whether the medication information is in a semi-structured or in a plain text section (Fig. 5).

In this initial corpus version, we did not add entity normalization to the medication information layer.

Section type annotation

Our section type annotation scheme is based on Lohr et al., 2018³¹, but is more coarse-grained and carried out on paragraph-level (guidelines available as Supplementary File 2). To measure the quality of annotations we calculated an IAA using Krippendorff’s alpha. Krippendorff’s alpha is a chance corrected inter-annotator agreement score and can be used for any number of annotators and class labels³². We performed three iterations for redundant annotation with three annotators (two clinical data scientists researching on clinical routine documents at the cardiology department, one research student assistant studying computational linguistics (B.A.) in sixth semester) containing 35, 30 and 20 documents. The project lifetime, including preparation, annotation and evaluation, was two months. Approximate annotation time per document was 3–8 min.

We annotated fourteen section types. Nine section types are mapped to HL7 CDA elements (Arztbrief Plus, v. 3.15, https://wiki.hl7.de/index.php?title=IG:Arztbrief_Plus, accessed 06.10.2022). Sections related to diagnosis are not mapped to CDA elements. The CDA standard separates diagnosis sections into Entlassungsdiagnose (discharge diagnosis) and Aufnahmediagnose (admission diagnosis). Neither of them is explicitly part of doctor’s letters in CARDIO:DE. After consultation with cardiologists, we decided to use the most representative heading names in the original doctor’s letters as section class labels. There are typically two section headings related to diagnosis: (1) AktuellDiagnosen: This section contains discharge diagnosis information and is part of most of the letters. (2) Diagnosen: This section type contains admission or discharge diagnosis information. In the original documents in MS word format, important diagnosis information is commonly written as bold type. This information is not part of the documents in CARDIO:DE. After consultation with physicians, in addition to CDA section type Befunde, we annotated section types KUBefunde and EchoBefunde. Both appear frequently in CARDIO:DE letters and are considered relevant for cardiovascular clinical routine and research. Sections and paragraphs which cannot be mapped to one of the thirteen section types listed in Table 3 are annotated with the generic section type Mix.

Table 3 Section types in CARDIODE:

Full size table

CARDIO:DE contains high-quality gold standard annotations for medication information and CDA compliant section types. Our IAA scores are comparable to previously published similar annotation projects (e.g. IAA from i2b2 corpus for medication information on token-level F1: 81.6–88%³³, e.g. IAA for section types, median Krippendorff alpha: for seven classes: 84.6, 11–21 classes: 70–84.4³¹. Please note, that these IAAs are not completely comparable, due to different pre-processing steps and measurement procedures). Since the corpus is freely available for research purposes, excluding marketing purposes, we encourage the community to improve existing annotations and add new annotation layers to CARDIO:DE. Indeed, we think of CARDIO:DE as a facilitator and driver of collaborative NLP research in the German-speaking cardiovascular community. To support this goal, we also plan to organize shared tasks for various clinical NLP topics, such as concept normalization or negation detection. We will also be releasing new and updated annotation layers.

Data Records

We publish CARDIO:DE for scientific research purposes and follow best-practice approaches of recently published clinical data sets^10,20. Scientific research excludes processing the data for marketing purposes. The corpus contains detailed clinical care data of patients. In this context, CARDIO:DE needs to be used with care and respect¹⁰. We distribute the data via heiDATA (https://heidata.uni-heidelberg.de/). The corpus must be formally requested following instructions on the CARDIO:DE website in the Terms of Use section (https://doi.org/10.11588/data/AFYQDY)³⁴. This procedure also applies for the CARDIO:DE_EXP corpus, which includes additional experimental annotation layers (details, see Usage Notes section)³⁴.

CARDIO:DE (cardiode.zip) contains 500 doctor’s letters in plain text and in tsv3 format (WebAnno TSV version 3.3). The corpus is published as a zip file, containing all data files distributed in two folders (CARDIODE400_main, CARDIODE100_heldout). The tsv3 files of the CARDIO:DE400 split contain the untokenized and the tokenized format including annotations for the medication information and the section information layers. The tsv3 files of the distributed CARDIO:DE100 split only contain the untokenized and the tokenized format of each letter (Fig. 6).

CARDIO:DE_EXP (cardiode_exp.zip) is published with the same folder and file structure as CARDIO:DE. In addition to all annotations of CARDIO:DE, it contains three experimental medication information entity annotations: Reason, Route and Dosage.

Technical Validation

Annotation quality

Medication information

Figure 7 shows all token level median IAA scores of all annotator combinations per iteration per medication information class. Detailed information of IAA scores including standard deviation, see Table 4.

Table 4 IAA Medication information per class.

Full size table

IAA could be improved consistently for three classes (Drug, Form and Frequency) over all iterations. For classes Duration, Strength and Reason IAA could be increased in second iteration and slightly decreased in third iteration. The complex Reason class still achieved a relatively low IAA (0.41) in iteration 3. For classes Route and ActiveIng IAA continuously decreased over all iterations. Standard deviation decreased in iteration 3 for all but the Dosage class, which showed the overall lowest IAA scores with a maximum of 0.33 in iteration 3.

Considering median micro average F1-score, token-wise F1-score could be improved in second iteration from 0.85 to 0.89, but only leveled at 0.85 in third iteration. Entity-wise median F1-score decreased in second iteration from 0.84 to 0.79, but increased to 0.81 in third iteration, which is 3 percentage points below IAA of first iteration (Table 5).

Table 5 IAA Medication information.

Full size table

In addition to IAA for token level medication information, we calculated micro average F1-scores to measure the annotation quality of the annotations of medication information relation. We could improve IAA in second iteration by 0.25. In the last iteration IAA decreased to 0.61 (Table 5). As also reported by other publications, the IAA scores for relation annotation were generally lower, than for entity annotations^14,28.

Overall, the synchronization of the annotation for medication information was very challenging. While annotation quality of medication information in the structured section of doctor’s letters quickly improved, annotation of more complex medication information samples in plain text remained difficult. Due to time restrictions, we stopped redundant annotations after the third iteration. Moreover, although some IAA scores could not be increased or even decreased, IAA scores of the most frequent classes (Drug, ActiveIng, Strength, Form, Frequency, Duration) achieved a sufficient quality (0.75–0.98).

One reason for the challenging synchronization might be rooted in the different educational backgrounds of the annotators. Generally, medical students and medical informatics students with experiences in clinical routine shared higher IAA scores. This was apparent for the Reason class, as this information demanded a more profound clinical knowledge. The Dosage class achieved lowest IAA scores as Dosage entities were easily confused as Strength entities (e.g. max Zufuhr von 4 IE Insulin/h: 4 IE repeatedly annotated as Dosage; Torasemid 5 mg 1-1-0: 5 mg repeatedly annotated as Dosage). In addition, Strength was occasionally annotated as Dosage in the structured medication sections. Overall Dosage entities are rarely represented in the corpus; thus, these results are not fully representative.

At the end of our annotation iterations, we had to face another challenge. To increase heterogeneity of the doctor’s letters in each iteration, we did not just randomly select letters from the complete corpus but restricted each sampling for each iteration to a specific time period of patient recruitment. For example; iteration 1 contained the first ten letters from the beginning of the projects recruiting phase, while iteration 2 contained letters from the middle of this phase. This resulted in a bias, as the patient recruitment was sometimes dominated by in-patients, out-patients or patients from the CPU. The structure of these letters can vary significantly, therefore the medication information in each iteration batch varied in its notational form. For future annotation projects, we recommend a more balanced distribution of such letters in each iteration in order to improve agreement between different iterations.

Due to the not satisfying IAA scores of Reason, Route and Dosage, we excluded these classes from CARDIO:DE. Still, to allow further improvement of annotation quality and to further experiment with these classes we distribute an experimental corpus named CARDIO:DE_EXP including all nine medication information classes, medication relations and section classes³⁴.

The final CARDIO:DE corpus contains in total 24,234 annotated medication information entities (Table 6) with 15,105 medication relations. The most frequent medication information classes are related to ActiveIng/Drug and their relations Frequency and Strength. Corpus statistics of CARDIODE_EXP are shown in Table 7. Figure 8 illustrates the most frequently annotated Drug entities in CARDIO:DE.

Table 6 Medication information statistics in CARDIO:DE.

Full size table

Table 7 Medication information statistics in CARDIO:DE_EXP.

Full size table

Section annotation

We could increase IAA from a Krippendorff’s alpha score of 0.91 to 0.96 (Table 8).

Table 8 IAA section types.

Full size table

The final corpus contains in total 116,898 annotated paragraphs with section classes. The most frequent section class was Labor and Befunde. Befunde is a meta class, containing all kinds of findings, excluding KUBefunde and EchoBefunde. Labor contains laboratory information in a flattened tabular format. The least annotations are related to the class Anrede. This includes typically a single introductory sentence at the top of a doctor’s letter, containing information of the patient and the receiving department (Table 9).

Table 9 Section type statistics in CARDIO:DE.

Full size table

During redundant section annotation, we faced a decrease in IAA in iteration 2. This was mainly due to a major change in the annotation scheme. Initially we annotated all lines of a section type with a span annotation. This resulted in slow performance of the INCEpTION tool, which made annotations unnecessarily laborious. The annotators decided to continue annotating just the first paragraph of a new section type. A new section type defines the end of the previous section type. The decision to only annotate the first paragraph of a new section type increased annotation speed significantly.

In iteration 2, the annotators missed some section ends because not every line had to be annotated anymore, especially with regard to section classes inside the Befunde class. Frequently annotators missed to mark the end of a EchoBefunde or AllergienUnverträglichkeitenRisiken section and as a result these sections covered too many paragraphs. After thorough review meetings, we were able to overcome this problem in Iteration 3.

Other issues concerned how to define the section type EntlassMedikation. Some doctor’s letters contained two explicit medication sections: section type AufnahmeMedikation at the beginning and section type EntlassMedikation at the end. EntlassMedikation was frequently introduced by the header ‘Aktuelle Medikation’. But a couple of letters only contained a single medication section at the beginning of a document. Therefore, some annotators interpreted these sections as AufnahmeMedikation, even though it is introduced with the header ‘Aktuelle Medikation’. After review meetings and consultations with physicians, we defined this initial medication section with this header as EntlassMedikation in the guidelines.

Usage Notes

Data access

CARDIO:DE (and CARDIO:DE_EXP) must be formally requested via heiData following three steps:³⁴

(1)
Sending a data request mail to heiDATA (data@uni-heidelberg.de) including a signed DUA form, addressing information about correct data usage and security standards. Under this licence, it is clearly prohibited, to identify individuals or try to contact or advertise them.
(2)
Including a group description and a project description (details, see below).
(3)
After a positive decision by the CARDIO:DE study director Christoph Dieterich (CARDIO:DE supervisor), the data requestor will receive detailed instructions how to download the corpus via heiDATA.

Data approval will require at least one week. The data request needs to contain the following information:

(1)
A signed DUA, signed by each data user individually.
(2)
A group description including the requestor’s (data user’s) name, affiliation, position and email address and website of the institution.
(3)
A project description of the research purpose (max. 150 words).
(4)
Name, affiliation, position, email address and signature of the responsible person to administer and manage the infrastructure on which CARDIO:DE will be stored.

The data request will be approved, if it contains all requested information and if it is in line with the data usage agreement (DUA). The data request will be rejected, if it contains incomplete or incorrect information or if it violates regulations of the DUA.

Baseline classifier

We use baseline classifiers to demonstrate what current well-established and freely available machine learning models could achieve out-of-the-box on CARDIO:DE annotations. We trained our classifiers on annotations of CARDIO:DE400 and assessed their performance on annotations of CARDIO:DE100. For these models we performed neither hyperparameter tuning, nor architecture optimization. The results published here are intended to give a first impression of possible applications and how to train MIE models on CARDIO:DE annotations for both tasks. In a future shared task, researchers are invited to improve these baseline results.

Medication information extraction

As an example use case, we evaluated a statistical and a neural machine learning model for medication information extraction. Our statistical model is based on a well-established conditional random field (CRF)³⁵. The CRF was trained on basic linguistic features, including POS tags and context token. We used features proposed by Mikhail Korobov, see https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#features, for further details see Figure S1 in Supplementary File 3. Our neural model is based on a well-documented Hugging Face BERT language model for NER, pre-trained on different publicly available German language corpora (deepset/gbert-base)^36,37.

In addition, we evaluated the freely available German GGPONC NER classifier (04_ggponc_fine_long), trained on SNOMED CT (https://confluence.ihtsdotools.org/display/DOCSTART/6.+SNOMED+CT+Concept+Model, accessed 08.09.2022) annotations in GGPONC 2.0¹¹. The fine-grained scheme includes the entity type Clinical Drug. For a detailed description of this class, see GGPONC 2.0 guidelines: (https://github.com/hpi-dhc/ggponc_annotation/blob/master/annotation_guide/anno_guide.pdf, accessed 08.09.2022). Clinical Drug describes a pharmaceutical product produced for diagnostic or therapeutic purposes. While the guidelines do not exactly match our definition of Drug/ActiveIng we follow two mappings during evaluation, (1) mapping Clinical Drug to our Drug/ActiveIng (short) and (2) to Drug/ActiveIng/Frequency/Strength (long).

The task was designed as entity recognition classification. The objective of this task is to assign a set of six medication information classes (ActiveIng, Drug, Duration, Form, Frequency, Strength) to each input token. Medication information can consist of one single token or a sequence of tokens. Figure 9 shows a tokenized input snippet, containing 20 token and their assigned medication information classes. We evaluated this task using the F1-score, the harmonic mean between precision and recall per class, and the micro-average F1-score per classifier (Table 10). We evaluated token-wise and without an IOB scheme (short for inside, outside, beginning notation, see, https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), accessed 08.09.2022). Therefore, we removed all “B-” and “I-” substrings from the labels before calculating precision and recall. Token-wise results for precision, recall and F1-score of the GGPONC NER classifier are listed in Table 11.

Table 10 Token-wise precision (Pr), recall (Re) and F1-score (F1) results for medication information extraction per class and per model, including entity-wise F1-score in brackets and the micro average F1-score in the last row.

Full size table

Table 11 Token-wise precision, recall and F1-scores on CARDIO:DE100 for Clinical Drug class of GGPONC NER model.

Full size table

Considering token-wise and entity-wise micro average F1-score, BERT shows slightly higher results than CRF. Regarding token-wise F1-score BERT outperforms CRF over all classes. This is also true for the low-frequency class Form. Furthermore, this class achieved the overall lowest F1-score for both models.

Results of GGPONC NER shows the highest F1-score for the long mapping (81%), along with a balanced precision and recall score. The short mapping shows an overall much lower F1-score (0.21) along with a much lower precision (0.13) than recall score (0.67). For further evaluation results, see Supplementary File 4.

Considering different domains (oncology vs. cardiology) and document types (guidelines vs. doctor’s letters), the cross-domain GGPONC NER baseline showed impressive results on the CARDIO:DE100 corpus split. The short mapping achieved a recall of 0.67, while the precision score was only 0.13. This was especially due to issues while mapping our medication information classes based on our guidelines to the more generic GGPONC Snomed CT class Clinical Drug. We frequently observed, that GGPONC NER annotates token sequences such as 20 mg (Strength) or 1-0-0 (Frequency) as Clinical Drug, resulting in a high amount of false positives. Hence, our long mapping using the more comprehensive mapping of Clinical Drug to four classes Drug/ActiveIng/Frequency/Strength achieved both a high precision score (0.81) and a high recall score (0.80). These results show the potential of cross-domain models for German MIE. We therefore leave it to future work, to systematically evaluate the performance of publicly available German MIE models on already distributable German medical corpora. For a detailed analysis of GGPONC NER, see Supplementary File 4.

Current SOTA results for medication information extraction on English datasets achieve up to 0.95 F1-score³. Considering different data sets and hyperparameters a comparable German model for medication extraction for the classes ActiveIng/Drug (Clinical Drug) recognition in the GGPONC 2.0 corpus achieves 0.91 F-score, thus, outperforms our more fine-grained baseline distinguishing between Drug (0.81) and ActiveIng (0.86).

Section classification

Equal to the medication information extraction task, we evaluated a statistical and a neural model for section classification. For the statistical model we opted for a support vector machine (SVM)³⁸. Our neural model is based on a well-documented Hugging Face BERT language model for sequence classification, pre-trained on different publicly available German language corpora (deepset/gbert-base)^36,37. The objective of this task was to assign a set of fourteen section types (Anrede, AktuellDiagnosen, Diagnosen, AllergienUnverträglichkeitenRisiken, Anamnese, AufnahmeMedikation, KUBefunde, Befunde, EchoBefunde, Labor, Zusammenfassung, Mix, EntlassMedikation, Abschluss) to each input sample. An input sample consists of a paragraph of text (a paragraph is defined by the MS Word “¶” character) extracted from a doctor’s letter with no further context information. Figure 10 shows a tokenized example of an input sample, containing 48 tokens assigned to the Anrede class. We evaluated this task using the F1-score per class and the macro average F1-score per classifier (Table 12).

Table 12 Precision (Pr), recall (Re) and F1-score (F1) results per class and macro-average F1-score per model for section classification.

Full size table

Analyzing the macro average F1-score the BERT model outperforms the baseline by 0.02. Taking the per class F1-score into account, BERT achieves a better score in nine section classes. In four section classes both models achieve the same score, while for the class EntlassMedikation SVM achieves a higher F1-score.

Both models worst performing classes are related to medication sections. We observe a very low recall score for AufnahmeMedikation for the SVM and for EntlassMedikation for the BERT model. The SVM frequently classifies AufnahmeMedikation instances as EntlassMedikation. The BERT model, on the other hand, frequently misclassifies EntlassMedikation instances as AufnahmeMedikation, but to a lesser extent (confusion matrices for both models, see Figure S2 and Figure S3 in Supplementary File 3). Due to the fact that this task was performed only as a simple text classification task without further context information, these errors cannot be easily avoided, only by merging these class types or by adding context information to each input sample.

Regarding our baseline classifiers, considering different datasets, language and annotation guidelines, our final accuracy scores for SVM and BERT are comparable to similar published section classification results³¹.

Again, our two show cases on medical information extraction and section classification serve to demonstrate the great potential of this first German corpus from the cardiovascular domain. For future work, more sophisticated document segmentation methods need to be applied^39,40.

Code availability

For information on software packages and versions used for pre-processing, see methods section. No additional custom code was implemented for CARDIO:DE.

References

Timmis, A. et al. European Society of Cardiology: cardiovascular disease statistics 2021. Eur Heart J 43, 716–799 (2022).
Article PubMed Google Scholar
Starlinger, J., Kittner, M., Blankenstein, O. & Leser, U. How to improve information extraction from German medical records. it - Information Technology 59, 171–179 (2017).
Article Google Scholar
Hahn, U. & Oleynik, M. Medical Information Extraction in the Age of Deep Learning. Yearb Med Inform 2020, 208–228 (2020).
Google Scholar
Chapman, W. W. et al. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. JAMIA 18, 540–543 (2011).
PubMed PubMed Central Google Scholar
Lentzen, M. et al. Critical assessment of transformer-based AI models for German clinical notes. JAMIA Open 5, 1–10 (2022).
Article Google Scholar
Nagamine, T. et al. Multiscale classification of heart failure phenotypes by unsupervised clustering of unstructured electronic medical record data. Sci Rep 10 (2020).
Hellrich, J., Matthies, F., Faessler, E. & Hahn, U. Sharing Models and Tools for Processing German Clinical Texts. Stud Health Technol Inform 210, 734–738 (2015).
PubMed Google Scholar
Lange, L., Adel, H., Strötgen, J. & Klakow, D. CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain. Bioinformatics 38, 3267–3274 (2022).
Article CAS PubMed Google Scholar
Shorten, C., Khoshgoftaar, T. M. & Furht, B. Text Data Augmentation for Deep Learning. J Big Data 8, 1–34 (2021).
Article Google Scholar
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
Article CAS PubMed PubMed Central Google Scholar
Borchert, F. et al. GGPONC 2.0-The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers. in Proceedings of the Language Resources and Evaluation Conference, European Language Resources Association, 3650–3660. https://doi.org/10.18653/v1/2020.louhi-1.5 (2022).
Styler, W. F. et al. Temporal Annotation in the Clinical Domain. Trans Assoc Comput Linguist 2, 143–154 (2014).
Article PubMed PubMed Central Google Scholar
Wu, S. et al. Deep learning in clinical natural language processing: a methodical review. JAMIA 27, 457–470 (2020).
PubMed Google Scholar
Campillos, L. et al. A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). Lang Resour Eval 52, 571–601 (2018).
Article Google Scholar
Marimon, M., Vivaldi, J. & Bel, N. Annotation of negation in the IULA Spanish Clinical Record Corpus. in Proceedings of the Workshop Computational Semantics Beyond Events and Roles, 43–52, https://doi.org/10.18653/v1/W17-1807 (ACL, 2017).
Borchert, F. et al. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, 38–48, https://doi.org/10.18653/v1/2020.louhi-1.5 (2020).
Modersohn, L., Schulz, S., Lohr, C. & Hahn, U. GRASCCO — The First Publicly Shareable, Multiply-Alienated German Clinical Text Corpus. German Medical Data Sciences 2022 – Future Medicine: More Precise, More Integrative, More Sustainable!, 66–72, https://doi.org/10.3233/SHTI220805 (2022).
Lohr, C., Buechel, S. & Hahn, U. Sharing Copies of Synthetic Clinical Corpora without Physical Distribution — A Case Study to Get Around IPRs and Privacy Constraints Featuring the German JSYNCC Corpus. in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018).
Frei, J. & Kramer, F. GERNERMED: An open German medical NER model. Software Impacts 11, 100212 (2022).
Article Google Scholar
Kittner, M. et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open 4, 1–9 (2021).
Article Google Scholar
Lohr, C., Eder, E. & Hahn, U. Pseudonymization of PHI Items in German Clinical Reports. Public Health and Informatics: Proceedings of MIE 2021, 273–277, https://doi.org/10.3233/SHTI210163 (2021).
Honnibal, M. et al. explosion/spaCy: v2.1.7: Improved evaluation, better language factories and bug fixes. Zenodohttps://doi.org/10.5281/zenodo.5764736 (2019).
Richter-Pechanski, P., Amr, A., Katus, H. A. & Dieterich, C. Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports. Stud Health Technol Inform 267, 101–109 (2019).
PubMed Google Scholar
Rousseeuw, P. J. & Hubert, M. Robust statistics for outlier detection. Wiley Interdiscip Rev Data Min Knowl Discov 1, 73–79 (2011).
Article Google Scholar
Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R. E. de & Gurevych, I. The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation. Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, 5–9 (2018).
Gurulingappa, H. et al. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J Biomed Inform 45, 885–892 (2012).
Article PubMed Google Scholar
Lohr, C., Modersohn, L., Hellrich, J., Kolditz, T. & Hahn, U. An evolutionary approach to the annotation of discharge summaries. Stud Health Technol Inform 270, 28–32 (2020).
PubMed Google Scholar
Roberts, A. et al. Building a semantically annotated corpus of clinical texts. J Biomed Inform 42, 950–966 (2009).
Article PubMed Google Scholar
Wilbur, W. J., Rzhetsky, A. & Shatkay, H. New directions in biomedical text annotation: Definitions, guidelines and corpus construction. BMC Bioinformatics 7, 356 (2006).
Article PubMed PubMed Central Google Scholar
Uzuner, O., Solti, I. & Cadag, E. Extracting medication information from clinical text. J Am Med Inform Assoc 17, 514–518 (2010).
Article PubMed PubMed Central Google Scholar
Lohr, C. et al. CDA-Compliant Section Annotation of German-Language Discharge Summaries: Guideline Development, Annotation Campaign, Section Classification. AMIA Annu Symp Proc 2018, 770–779 (2018).
PubMed PubMed Central Google Scholar
Krippendorff, K. Content Analysis: An Introduction to its Methodology. Content Analysis. An Introduction to Its Methodology vol. 20 (SAGE, 2004).
Uzuner, Ö., Solti, I., Xia, F. & Cadag, E. Community annotation experiment for ground truth generation for the i2b2 medication challenge. JAMIA 17, 519–523 (2010).
PubMed PubMed Central Google Scholar
Richter-Pechanski, P. et al. CARDIO:DE. heiData https://doi.org/10.11588/data/AFYQDY (2022).
Lafferty, J., Mccallum, A., Pereira, F. C. N. & Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence. ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, 282–289 (2001).
Devlin, J., Chang, M.-W., Lee, K., Google, K. T. & Language, A. I. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the ACL, 4171–4186, https://doi.org/10.18653/V1/N19-1423 (2019).
Chan, B., Schweter, S. & Möller, T. German’s Next Language Model. in Proceedings of the 28th International Conference on Computational Linguistics, 6788–6796, https://doi.org/10.18653/v1/2020.coling-main.598 (2020).
Cortes, C. Support-Vector Networks. Mach Learn 20, 273–297 (1995).
Article MATH Google Scholar
Denny, J. C. et al. Evaluation of a method to identify and categorize section headers in clinical documents. JAMIA 16, 806–815 (2009).
PubMed PubMed Central Google Scholar
Lin, Y. et al. BertGCN: Transductive Text Classification by Combining GNN and BERT. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 1456–1462, https://doi.org/10.18653/V1/2021.FINDINGS-ACL.126 (2021).

Download references

Acknowledgements

This work was generously supported by the German Research Foundation under grant DI 1501/14-1, the German Federal Ministry of Research and Education (BMBF) under grant 01ZZ1802B and by the Klaus Tschira Foundation under grant 00.013.2021. For the publication fee we acknowledge financial support by Deutsche Forschungsgemeinschaft within the funding programme “Open Access Publikationskosten” as well as by Heidelberg University. We would like to thank all colleagues from the Department of Internal Medicine III, University Hospital Heidelberg for their valuable and constructive support and input.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
Phillip Richter-Pechanski, Philipp Wiesenbach, Mingyang He, Michael M. Allers, Anna S. Tiefenbacher, Nicola Kunz, Anna Martynova, Noemie Spiller, Julian Mierisch, Charlotte Schwind & Christoph Dieterich
Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
Phillip Richter-Pechanski, Philipp Wiesenbach, Dominic M. Schwab, Christina Kiriakou, Mingyang He, Charlotte Schwind, Norbert Frey, Christoph Dieterich & Nicolas A. Geis
German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany
Phillip Richter-Pechanski, Norbert Frey & Christoph Dieterich
Informatics for Life, Heidelberg, DE, Germany
Phillip Richter-Pechanski, Philipp Wiesenbach, Norbert Frey, Christoph Dieterich & Nicolas A. Geis
Digital Health Center, Hasso Plattner Institute, University of Potsdam, Potsdam, DE, Germany
Florian Borchert

Authors

Phillip Richter-Pechanski
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Wiesenbach
View author publications
You can also search for this author in PubMed Google Scholar
Dominic M. Schwab
View author publications
You can also search for this author in PubMed Google Scholar
Christina Kiriakou
View author publications
You can also search for this author in PubMed Google Scholar
Mingyang He
View author publications
You can also search for this author in PubMed Google Scholar
Michael M. Allers
View author publications
You can also search for this author in PubMed Google Scholar
Anna S. Tiefenbacher
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Kunz
View author publications
You can also search for this author in PubMed Google Scholar
Anna Martynova
View author publications
You can also search for this author in PubMed Google Scholar
Noemie Spiller
View author publications
You can also search for this author in PubMed Google Scholar
Julian Mierisch
View author publications
You can also search for this author in PubMed Google Scholar
Florian Borchert
View author publications
You can also search for this author in PubMed Google Scholar
Charlotte Schwind
View author publications
You can also search for this author in PubMed Google Scholar
Norbert Frey
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Dieterich
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas A. Geis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Phillip Richter-Pechanski: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Data Curation, Project administration, Writing - Original Draft, Visualization. Philipp Wiesenbach: Methodology, Validation, Data Curation, Writing - Review & Editing. Dominic M. Schwab, Christina Kiriakou: Conceptualization, Resources, Validation. Mingyang He: Validation, Data Curation. Nicola Kunz, Michael M. Allers, Anna S. Tiefenbacher: Data Curation, Validation, Writing - Review & Editing. Julian Mierisch, Anna Martynova, Noemie Spiller: Data Curation, Validation. Florian Borchert: Resources, Investigation, Writing - Review & Editing. Charlotte Schwind: Resources, Project administration. Norbert Frey: Resources, Supervision, Writing - Review & Editing. Christoph Dieterich: Conceptualization, Supervision, Funding acquisition, Writing - Review & Editing. Nicolas A. Geis: Conceptualization, Funding acquisition, Resources, Validation, Supervision, Writing – Review & Editing. Christoph Dieterich (christoph.dieterich@med.uni-heidelberg.de) and Nicolas Geis (nicolas.geis@med.uni-heidelberg.de) are shared last authors.

Corresponding author

Correspondence to Phillip Richter-Pechanski.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Richter-Pechanski, P., Wiesenbach, P., Schwab, D.M. et al. A distributable German clinical corpus containing cardiovascular clinical routine doctor’s letters. Sci Data 10, 207 (2023). https://doi.org/10.1038/s41597-023-02128-9

Download citation

Received: 06 December 2022
Accepted: 31 March 2023
Published: 14 April 2023
DOI: https://doi.org/10.1038/s41597-023-02128-9
Springer Nature Limited

A distributable German clinical corpus containing cardiovascular clinical routine doctor’s letters

Abstract

Similar content being viewed by others

European Clinical Case Corpus

The Leaf Clinical Trials Corpus: a new resource for query generation from clinical trial eligibility criteria

CAS: corpus of clinical cases in French

Background & Summary

State of research

Goals

Corpus characteristics

Methods

Ethics declarations

Data selection and collection

De-Identification

Data annotation

Annotation workflow

Medication information annotation

Section type annotation

Data Records

Technical Validation

Annotation quality

Medication information

Section annotation

Usage Notes

Data access

Baseline classifier

Medication information extraction

Section classification

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Material

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation