Background

Heart failure has become the leading diagnosis at hospital discharge and the most important driver of in-hospital mortality in Germany [1]. These estimates are based on counts provided by hospitals utilizing the German modification of the International Classification of Diseases (ICD)-10. Respective ICD-10 codes identifying heart failure within the German ICD-10 catalogue are I11.*, I13.0, I13.2, and I50.*. However, estimating the true number of patients suffering from heart failure based on this catalogue is unreliable for several reasons. The heart failure syndrome, especially in its early stages, may go unrecognized or may not be encoded as an explicit diagnosis; further, various financial incentives provided by the German health care system drive the likelihood of a specific ICD code entering a patient’s list of discharge diagnoses. These incentives favor the encoding of diagnoses associated with the most favorable reimbursement profile and may therefore considerably affect the “true number” of patients burdened by heart failure. After hospital discharge, however, only the most important diagnoses (e.g., the top three) are reported to and collected by higher level organizations, e.g., health insurances or statutory registries. Thus, the “prevalence” of a certain condition may be augmented or suppressed depending on its re-imbursement profile and subsequent quality of coding. Furthermore, the quality of documentation itself, e.g., staff training [2] and the marked changes imposed on respective workflows (e.g., change from paper-based records to electronic record systems [3]), have a major influence on disease statistics. Despite these shortcomings, the above-mentioned approach of collecting ICD diagnoses remains the prime source for public health decisions [4, 5].

Beside the statutory census of disease statistics, attempts have been made towards a more reliable and comprehensive identification of diagnoses from clinical routine data. Most of them, however, based their algorithm on coded diagnoses [6]. Because of these reasons, a better and earlier recognition of heart failure patients is of utmost importance [7]. Since modern hospitals can provide a wealth of electronic patient-based information, this data may be used to improve or corroborate diagnostic certainty and comprehensiveness.

The objective of the current study was to approximate the “true number” of patients suffering from heart failure at a tertiary care center. Against a physician-based reference standard, we compared the performance of ICD-based diagnosis versus advanced definitions based on algorithms that integrate various sources of the hospital information system. We hypothesized that (a) ICD-based diagnosis may underestimate the true prevalence of heart failure and (b) a catalogue of criteria defining heart failure utilizing various sources of the hospital information system will advance diagnostic accuracy.

Methods

The Würzburg data warehouse

The clinical data warehouse (DWH) implemented at the Würzburg University Hospital provides a homogeneous and structured access to pseudonymized data of 100% of the hospital’s patient cases originally stored within various separate information systems (e.g., the central administrative system, electronic patient chart used on wards, and systems for laboratory values and discharge letters). The only current exceptions are data from psychiatric and child care facilities for data protection reasons. The technical set-up is based on open-source systems and has been described elsewhere [8,9,10]. Data of the DWH can be queried in (a) structured form (e.g., patient demographics, diagnoses as ICD codes, procedures as codes of the German procedure classification “Operationen- und Prozedurenschlüssel” (OPS), and laboratory values); (b) semi-structured form (e.g., echocardiography, cardiac catheterization), and (c) unstructured form (e.g., discharge letters). The most innovative add-on to the DWH is the unique information extraction and ad hoc text search functionality, which allows to create parametrized information from semi- and unstructured reports and to search for any textual item (e.g., search within discharge letters for text combinations including variants and negations or extract numeric parameters from echocardiographic reports) [11].

Patient selection

The Medical Department I of the Würzburg University Hospital specializes in, but is not limited to, emergency medicine, intensive care, cardiology, pulmonology, nephrology, and endocrinology. For the current analysis, we used all cases of patients treated at the Medical Department I between the years 2000 and 2015 for whom a discharge letter was available.

Reference standard for the definition of heart failure

A sample of consecutive patients treated at the Medical Department I was drawn from the DWH within a randomly selected period (January 1 to January 31, 2009), yielding 1042 cases. These patients were manually checked by a cardiologist with long-standing experience in the care of heart failure patients (GG). Information used by the physician included ICD codes, the discharge letter, and the echocardiographic report (if available). The physician assigned a label (“heart failure: yes/no”) to each case, which was then used as reference standard for subsequent analyses.

Algorithms for automated detection of heart failure

In order to investigate heart failure detection algorithms, 18 subqueries of relevant heart failure-related concepts were defined within the user interface of our DWH and presented in the rows of Table 1. Each subquery considers a specific fact and either is a restriction on a numeric DWH parameter (e.g., subquery Echo-EF ≤ 45 represents a left ventricular ejection fraction (LVEF) ≤ 45% captured from echocardiographic reports after information extraction), the existence of an ICD diagnoses (e.g., subquery ICD-Any-HF represents the existence of any heart failure related ICD) or text searches within the discharge letter suggesting presence of heart failure (e.g., subquery Text-Left-HF represents the occurrence of a textual synonym for “left ventricular heart failure”). Text searches were specified to account for typing errors, synonyms and negations [12].

Table 1 Automated advanced data warehouse interrogation to detect heart failure

The algorithms used to detect patients with heart failure (i.e., MICD, MExpert, APrecision, ASensitivity, AF1) are presented in the right-hand columns of Table 1, each by a selection of subqueries that needed to be combined for the full algorithm. Each hit of any of an algorithm’s subqueries stands for the presence of heart failure. Two of the algorithms were manually specified: MICD indicates an algorithm that solely utilizes ICD codes and MExpert indicates an algorithm (i.e., subqueries used for this DWH interrogation) pre-specified by cardiologists based on clinical experience. The other three algorithms (APrecision, ASensitivity, and AF1) originated from iterative permutation testing utilizing all defined subqueries. They were optimized to yield the most favorable results regarding the chosen measures (i.e., precision, sensitivity, and F1 score; for definitions see “Data analysis”) with regard to the reference standard definition of heart failure described in the previous section. The algorithms were computed utilizing exactly the same data that the physician used to evaluate the reference standard: the discharge letter, the ICD codes, and the echocardiographic report (if available).

Data analysis

The data for the current analyses were exclusively taken from the DWH via its graphical user interface and the subqueries defined in Table 1. Analysis was done using the software package R [13]. Presence of heart failure was captured based on data of individual hospital visits of individual patients (i.e., one case); each patient was counted once per year. The proportions of true positive, false positive, true negative, and false negative matches were calculated. Further, precision, sensitivity, and the F1 score were computed to provide integrated measures of the accuracy of the match between automated heart failure detection and the reference standard. Precision (also called positive predictive value) describes the share of algorithmically labeled heart failure patients who indeed have heart failure, out of all algorithmically labeled heart failure patients; e.g., a precision of 100% means that all selected patients truly have heart failure. Sensitivity (also called true positive rate or recall) is the share of algorithmically labeled heart failure patients who indeed have heart failure, out of all patients with heart failure; e.g., a sensitivity of 100% means that all patients with heart failure are selected. The F1 score is the harmonic mean of precision and sensitivity and is used as the overall accuracy measure in this analysis; e.g., an F1 score of 100% means that exactly the patients who truly have heart failure are selected and an F1 score of 85% would describe the prevalence of heart failure with an estimated error of 15%. Measures of any permutation of the subqueries were computed in R, utilizing single DWH exports of each subquery, to maximize the F1 score and aiming to yield a precision and sensitivity of at least > 90% but still have a corresponding sensitivity and precision > 60%, respectively. Frequencies and percentages were used to present aggregated data across periods under study.

Results

From 2000 to 2015, 110,742 individual patients were treated at and received a discharge letter from the Department of Medicine I of the Würzburg University Hospital. Of these patients, 71,625 had at least one inpatient visit. After splitting the 16-year period into four 4-year periods (i.e., 2000–2003, 2004–2007, 2008–2011, and 2012–2015), respective counts for all patients (inpatients) were 25,753 (17,941), 32,301 (19,592), 37,300 (21,743), and 42,119 (25,692).

Verification of the heart failure detection algorithm

Table 2 presents the performance characteristics derived from cross-validating the heart failure detection algorithms (defined in Table 1) against the reference standard set, i.e., the 1042 manually labeled inpatients, in whom 222 subjects (21%) were identified by the expert to suffer from heart failure.

Table 2 Performance of automated heart failure detection algorithms versus reference standard

The algorithm that was solely based on ICD codes (MICD) resulted in a good precision of 94%, but a low sensitivity of 50%, and a F1 score of 65%. The low sensitivity illustrates the low share of patients with heart failure detected by this algorithm. The missing 6% to a precision of 100% were caused by seven out of the 1042 patients who had a heart failure-related ICD diagnosis, but were not labeled to have heart failure. The expert-specified heart failure algorithm (MExpert) improved the detection rate and resulted in a precision of 76%, a sensitivity of 87%, and a F1 score of 81%. Divergent conclusions between the algorithm and the reference standard were found in 89 cases, with 60 patients mistakenly classified to have heart failure and 29 patients mistakenly classified to not have heart failure.

Since the manually defined algorithms resulted in low scores, the algorithm MExpert was refined further. Three algorithms were developed and tested, each optimizing certain aspects of diagnostic accuracy: APrecision aimed to increase the reliability of the classification as heart failure patient (reduced false positives), ASensitivity aimed to reduce the number of patients not classified as heart failure patient (reduced false negatives), and AF1 aimed for an overall improved accuracy of the classification as heart failure patient (balanced precision and sensitivity). The algorithm with the highest F1 score (i.e. AF1) resulted in a precision of 89%, a sensitivity of 84%, and an F1 score of 86%. The missing 14% to an F1 score of 100% was caused by 59 false matches that originated from “borderline cases” with limited or unclear textual information that opened more room for interpretation and misclassification for both computer and expert. Some errors were the result of missing data in the DWH, e.g., missing LVEF values or terms that indicate negations of heart failure in the discharge letter.

Prevalence of heart failure

Figure 1 illustrates the annual frequencies of all inpatients of the Department of Medicine I with a discharge letter and the subgroup of patients with heart failure identified by the automated algorithms described in Table 1. Across the entire period, AF1 identified 18,167 unique patients with heart failure. In the year 2000, the count of patients with heart failure started at n = 620 and showed an average annual gain of 9.3% over the entire period. After the year 2012, the annual gain appeared to accelerate from 7.4% before 2012 to 17.1% thereafter. By contrast, the average annual gain of all inpatients was 3.4%. The application of APrecision and ASensitivity resulted in 10,786 and 25,084 patients identified with heart failure, respectively.

Fig. 1
figure 1

Count of inpatients within the Department of Medicine I in the years 2000–2015. The solid line indicates all patients; each patient is counted once per year. Intermittent lines represent patients with heart failure identified using different automated heart failure detection algorithms: MExpert originates from the variable set pre-specified by the clinical expert; APrecision optimizes count of false positives; ASensitivity optimizes count of false negatives; AF1 optimizes overall accuracy (for details refer to“Methods”)

Several patients were treated multiple times over the years, which resulted in sums of unique patients per year reported in Fig. 1 that were higher than the above-reported sum of unique patients of all years. This included 3115 unique patients with 4583 heart failure-related re-hospitalizations after an initial heart failure-related hospitalization within the entire period. A characterization of inpatients with heart failure identified by the application of AF1 is presented in Table 3 for the four 4-year periods from 2000 to 2015, grouped by age and sex.

Table 3 Frequencies of all patients with heart failure identified by the AF1 algorithm by age group, gender and the 4-year periods (each with unique patients per time period)

Each search term of the detection algorithm AF1 contributed with varying impact to the identification of heart failure. In the reference standard, the largest contributions emerged from “Text-Heart-Failure” (59% capture rate), “ICD-Any-HF” (56%), and “Text-Cardiac-Decompensation” (53%). Further important contributors were “Echo-EF ≤ 45” (24%) and “Text-Systolic-Failure” (4%). In the case of “ICD-Any-HF”, for example, this means that 44% of all patients with heart failure did not have an ICD code indicative of heart failure. The contribution of the individual search terms to the overall analysis varied substantially over the years, as presented in Fig. 2. This illustration presents the search terms of the first heart failure related hospitalization per patient and year. Noteworthy is the relatively small contribution of the term LVEF ≤ 45% from echocardiography, although echocardiography was frequently performed: in 58% of patients on cardiologic ward and 29% of patients on other wards of internal medicine.

Fig. 2
figure 2

Detection of heart failure in inpatients using different approaches (percentage of all inpatients). The solid line indicates the prevalence detected when applying the automated algorithm AF1 (for details refer to “Methods”). Intermittent lines indicate detection using ICD codes or other information tags that dominantly contributed to the detection of heart failure. Each patient entered analysis only once per year; if patients attended the hospital multiple times, the first case of each patient per year was used

Figure 3 illustrates the contribution of ICD codes to the detection of heart failure in contrast to the additional contribution of other search terms (text/echo terms) over the entire period using the AF1 algorithm. Within the years 2000–2015, the overall share of patients with heart failure identified by ICD codes was 69% of the total sample of patients with heart failure, which means that 31% of patients with heart failure remained undetected throughout the entire period.

Fig. 3
figure 3

Detection of heart failure via related ICD codes (dark gray) and the additional detection through other search terms* (light gray), in inpatients with heart failure across the entire sampling period (years 2000–2015). The percentage of patients found via selective ICD code search increased in recent years, which might be explained by the foundation of the Comprehensive Heart Failure Center Würzburg in the year 2010, i.e., a facility devoted to the integration of research and care of patients with heart failure. *Executed via application of the automated algorithm AF1 (for details refer to “Methods”)

Comorbidities and heart failure

We further analyzed, whether the comorbidity profile differed in subjects in whom presence of heart failure was identified via ICD codes versus subjects in whom heart failure was identified via additional sources of the DWH. Table 4 lists the most frequent comorbidities reported in the 18,167 patients with heart failure detected by the AF1 algorithm in the mutually exclusive subgroups “detected by ICD codes” or “detected by other search terms” (specific DWH interrogation other than ICD code). Reported comorbidities were identified by their respective ICD code. Patients with ICD-coded comorbidities more frequently also had an ICD code for heart failure. Of note, the subgroup identified without ICD codes appeared to have a slightly lower burden of comorbidity.

Table 4 Frequency of comorbidities in inpatients with heart failure, detected by ICD codes and additionally detected via data sources provided by the data warehouse

Discussion

The current analysis sheds new light on the magnitude of underestimation of heart failure prevalence in hospitalized patients. Identifying patients with heart failure from the hospital information system solely based on ICD-coded discharge diagnoses substantially underestimated the “true number” that could be gleaned after adding specific text searches and echocardiographic parameters to the search profile.

We observed a large degree of heart failure underestimation when using ICD codes only: within a single year it was up to 55% (average 31%) lower than the “true number” of heart failure. The last years of the analysis showed a trend towards better patient identification. The decreasing gap of underestimation became considerably smaller over time and indicates that coding strategies as diagnostic and therapeutic algorithms may indeed affect the “prevalence” of a disease. The detection gap came together with a marked increase in the absolute frequency of encoding ICD diagnoses for heart failure, starting with the year 2012 for inpatients. Furthermore, the percentage of patients with heart failure increased from about 15% in 2000 to about 35% in 2015. Reasons for such high proportions might be that we only included patients from the Medical Department I (hosting wards for intensive care, cardiology, pulmonology, endocrinology, nephrology), where heart failure is a frequent diagnosis, but also identified patients having heart failure as a secondary or tertiary diagnosis. Another explanation for these developments might be that the Comprehensive Heart Failure Center was founded at the Würzburg University Hospital in the year 2010, i.e., a facility devoted to the integration of research and care of patients with heart failure. This spurred numerous structural and research projects involving several hospital departments, led to a higher degree of awareness for the heart failure syndrome, and ultimately might not only have increased the count of patients with heart failure admitted to the hospital, but also improved the coding ratio.

Verifying the diagnosis of heart failure patients based on physician claims or hospital data has been attempted earlier [6, 14,15,16,17,18,19,20,21]. However, most of these studies focused on confirming or refuting the diagnosis of heart failure with the help of experts in subjects pre-identified via several variants of ICD codes, via study inclusion/exclusion criteria or manual screening. Subsequently, reported identification figures were fairly precise (i.e., yielded high precision). Frolova et al. for example aimed to verify ICD-based diagnosis of acute heart failure amongst patients admitted to the hospital with suspected acute heart failure [17] and found a precision of 93% (sensitivity of 76%) leading to an F1 score of 84%. In contrast, we aimed to identify “true heart failure” amongst all-comers, i.e., without an increased pre-test likelihood for the presence of such diagnosis. As expected, performance of the algorithm relying solely on ICD-based identification (= MICD) was worse in our data set (F1 score 65%). There are few studies focusing on all-comers for detecting heart failure [15, 22, 23], all reporting lower F1 scores (82, 53–67, 80%, respectively) compared to our analysis (86%).

Applying text extraction methods to detect heart failure has rarely been attempted. Meystre et al. [24], for example focused on the information extraction of a few highly selected parameters (e.g., LVEF value and medication) from texts in contrast to an overall detection of heart failure. They utilized a pre-defined data set of heart failure patients in contrast to all-comers and, subsequently, received high F1 scores of up to 99% for single parameters (e.g., LVEF value). While interesting to demonstrate feasibility, such concepts do not mirror clinical reality. No related work was found utilizing information extraction to detect heart failure in all-comers.

Another major finding of the current study is that readily available information from the hospital information system considerably improves the identification of heart failure patients beyond the traditional identification via ICD codes. The option to enrich the search strategy by clinical variables supporting or denying the presence of heart failure is not new, but a variety of problems may impede its implementation: (1) the information is only selectively documented in clinical routine; (2) the desired information is stored in a non-structured format and appropriate data extraction tools are unavailable or unreliable; (3) the information is stored in a structured format but cannot be accessed for analysis (e.g., because it is stored in dedicated research data bases); (4) the quality of the stored data (structured or non-structured) is unreliable; (5) the information behind variables (meta data) is highly flexible but cannot be connected to the source data; (6) the individual patient and the corresponding cases of a patient (repeat hospitalizations) cannot reliably be discerned. Our approach utilized the hospital’s clinical DWH as described earlier [8,9,10] and integrated the full spectrum of digital information collected per patient in the hospital information system.

The most elaborated part in providing a DWH is the implementation of the data extract–transform–load (ETL) process to transfer data from the information systems to a unified database, which often—but not always—requires to consider local peculiarities depending on the available information systems. We implemented this process for most of the information stored within our systems, be it structured or unstructured. Our DWH query system utilizes a locally developed add-on [10] to provide text search functionality to DWH systems. This add-on could be instantaneously added as an extension to the often utilized i2b2 DWH system [25] or, with little extra work, to other similar DWH systems. Importantly, these tools were tested and optimized across their repeat utilization for various studies, including data validation against the primary systems after DWH extraction [9].

The combined use of these interfaces and generation of automated detection algorithms markedly improved the identification of patients with heart failure. We found better albeit still unsatisfactory accuracy when employing the algorithm based on “clinical information” alone (i.e., the algorithm MExpert). We therefore tested other, data-optimized algorithms, and observed another major improvement of heart failure detection: the algorithm AF1 optimized precision and sensitivity and yielded the overall best results. Importantly, our approach allowed to adjust and optimize the detection algorithm for different scenarios or use cases, e.g., to identify potential study participants via the algorithm (and thus enabling a study nurse to fine-tune the results) ASensitivity might yield best results. For the scenario of a post hoc analysis, AF1 or APrecision might be the preferred solutions. Interestingly, the NT-proBNP queries “Lab-NT-proBNP ≥ 1000” and “Lab-NT-proBNP ≥ 3000” (see Table 1) were not selected by the permutation analysis for any algorithm. This may be explained by the collinearity contained in other terms indicative for heart failure; e.g., for the AF1-algorithm: “Echo-EF ≤ 45”, “ICD-Any-HF”, “Text-Heart-Failure”, “Text-Cardiac-Decompensation”, and “Text-Systolic-Failure”. We also considered using Framingham heart failure signs and symptoms [26] for detection of heart failure (see [27, 28]) either alone or in combination with borderline echocardiographic data, but were unsuccessful in demonstrating superior precision and sensitivity.

Our analyses support the notion that comorbidities of heart failure may also affect coding practices for heart failure. When comparing the presence of common comorbidities with the detection of heart failure via ICD-based versus alternative approaches, the differences where highly significant for almost all conditions. Interestingly, a sizeable proportion of patients with heart failure received an ICD code for the respective comorbidity, but not the ICD code for heart failure itself. This might indicate that heart failure was not at the focus of their hospitalization visit and not a dominant contributor from the reimbursement perspective. From a health policy perspective this means that many patients with heart failure as a concomitant condition leave the hospital without being reported to statutory data banks as heart failure patients. This not only adds to the detection gap, but also constitutes a major information gap for care providers after hospital discharge who play a key role in the treatment of heart failure in Germany [29].

Limitations

A limitation of this study is that the reference standard was only defined by a single cardiologist with long-standing experience in heart failure instead of multiple experts. The count of true heart failure patients may vary considerably depending on the care setting, the type of catchment area, and numerous other influencing factors. Hence, absolute counts are likely not directly comparable between hospitals. Similarly, the successful implementation of adapted detection algorithms needs to be confirmed before our results may become generalizable to other hospitals, both in Germany as internationally.

Conclusions

Coded discharge diagnoses substantially underestimate the number of heart failure patients compared to the added information available within discharge letters and echocardiographic reports. Therefore, statistics about heart failure solely based on ICD codes might be misleading. The degree of underestimation might vary substantially across case types (inpatients versus outpatients) and the course of subsequent years. The latter might be influenced by internal factors, e.g., improved coding practices, and/or external factors, e.g., the set up of specialized centers as the Comprehensive Heart Failure Center Würzburg.