Introduction

Heart failure (HF) is an end-stage cardiac condition in which the heart pumping function is insufficient. HF has been a challenging outcome in epidemiological studies as its subtypes are often impossible to discern using solely register data [1]. In many countries, the International Classification for Diseases (ICD) holds only a single diagnostic code for congestive HF, and has no separate codes for HF with reduced ejection fraction (EF; HFrEF), HF with mildly reduced EF (HFmrEF), or HF with preserved EF (HFpEF) [2,3,4].

In this study, we set out to combine register data from the FinnGen database with information mined from electronic health records (EHR) to improve subtyping of register-based HF diagnoses [5]. We assessed the feasibility of EHR mining for non-structured text mentions for EF values, and whether EHR-mined HF subtypes could be used effectively to discern mortality risk. Information gained from this study could be used to determine HF subtypes to be further used in future research purposes of the very heterogeneous HF syndrome.

Materials and methods

Study sample

FinnGen is a joint research project aiming to collect the genomic and EHR data of 500,000 Finns from population-based studies and hospital biobanks [5]. The FinnGen register database holds individual-level health information mainly based on ICD-10 coding from nationwide registers, such as the Finnish Hospital Discharge Register (since 1968) and the Causes of Death Register (since 1969). These data enable defining a large number of clinical end points, including HF [4, 6]. The registers do not contain any EHR data.

FinnGen participants’ data in the Auria (Turku, Finland; n = 29,201) and Helsinki (Helsinki, Finland; n = 58,693) hospital biobanks were accessed for this study, with data collected in 2001–2020. Our data mining algorithm identified EF data for 43,405 individuals. Data was available for 35,800 individuals after excluding individuals with missing creatinine (available for n = 40,864) and N-terminal-pro-b-type natriuretic peptide (proBNP, available for n = 9479) laboratory parameters. ProBNP was only required for HF cases. After removal of fatal cases with missing baseline HF information (n = 534) or missing HF follow-up data (n = 1,283), our study sample consisted of 33,983 participants.

EHR data mining algorithm

To study whether HF subtyping based on EHRs is possible and feasible, we created a rule-based, regular expressions, and string-matching algorithm for data mining purposes. First, the EHR and clinical reports were text mined for all references to EF. Second, proBNP and creatinine were drawn from structured laboratory data. The main EHR data were then merged with the register-based FinnGen clinical data using personal identification codes that are unique for each Finnish resident.

The overarching principles of the algorithm are presented in Fig. 1. First, the algorithm searches for mentions of “EF” or “ejection fraction” from the EHRs. When these terms are observed, the texts are extracted, filtered, and split to sentences and the sentences are searched first for a series of two numbers that could be an EF measurement; two digits after each other and a percent marker, or the word ‘percent’. Ranges are also searched with two series of two digits and a percent marker, separated by a hyphen. Clinicians also use a wide variety of expressions for describing EF. If no numbers are present, a word search is triggered. The words describing EFaere converted to numbers based on the 2016 European Society of Cardiology (ESC) HF guidelines [2]. I.e., we defined “preserved”, “mildly reduced”, and “reduced” ejection fraction as 50%, 45%, and 39% to meet with the ESC definitions. The definitions for the other common worded descriptions of EF were defined based on clinical judgment. In addition, all sentences undergo a simultaneous quality check to exclude dates possibly masquerading as EF readings, and EF readings done in the past (e.g., “EF 40% a year ago” is disqualified). A mean EF is calculated if several EF readings are observed at the same date. EF outliers (< 10% or > 90%) are also removed. The code for the algorithm is available online at: https://zenodo.org/record/7900516#.ZFi92S9Z9qs.

Fig. 1
figure 1

The principle of the EF mining algorithm

Abbreviations: EF, ejection fraction; EHR, electronic health records; HFrEF, heart failure with reduced ejection fraction (< 40%); HFmrEF, heart failure with mildly reduced ejection fraction (40–49%); HFpEF, heart failure with preserved ejection fraction (≥ 50%)

HF subgrouping

Based on the mined EF and proBNP values, the participants were categorized into four clinical HF subtypes based on the ESC guideline [2] by the algorithm: (1) no HF was defined as normal EF (here defined as ≥ 50%) and normal proBNP levels (≤ 125 ng/ml); (2) HFrEF was defined as EF < 40%; (3) HFmrEF was defined as EF 40–49%; (4) HFpEF was defined as EF ≥ 50% and proBNP levels of ≥ 125 ng/ml.

Validation procedures

After data extraction, two validations with separate 100 randomly selected individuals were undertaken. First, we examined all the mined instances of EF values for the first 100 individuals and an internist (M.V.) defined a correct EF value for a specific time-point from the EHR data without knowing the algorithm-defined EF. The algorithm-defined EF values were measured against the gold-standard clinician-defined values. Subsequently, the HF subtype was defined for another 100 patients (with also proBNP values available) by the algorithm and by the internist blinded from the results of the algorithm. The misclassified cases were reviewed and the reasons for an inaccurate EF reading and subtyping were identified.

Statistical analyses

To test the functionality of the algorithm, Cox proportional hazards models were used to assess the association between HF subtypes with overall mortality, with individuals with no HF as the reference. We adjusted for risk factors that are common in HF and also increase the risk of death – sex, estimated glomerular filtration rate [7], and register-based diagnoses of prevalent hypertension, ischemic heart disease, type 2 diabetes, chronic obstructive pulmonary disease, and renal failure. Age was used as the time scale. The definitions of comorbidities in FinnGen are available online at https://risteys.finngen.fi. Proportional hazards assumptions were assessed by inspecting visually plotted Schoenfeld residuals.

Results

Study sample and data mining results

The characteristics of the study sample are presented in Table 1. A slight majority of the sample were women (58.1%), and the mean age was 58.7 (standard deviation 18.2). The most common clinical comorbidities were hypertension (29.7%), type 2 diabetes mellitus (17.2%) and coronary artery disease (15.7%). After dividing the participants into subphenotypes according to the algorithm, 1,162 had HFrEF, 474 had HFmrEF, 2,110 had HFpEF, and 30,237 had no HF.

Table 1 Study sample characteristics

Validation

The assessment of the clinician and the algorithm resulted in the same EF in 78% of the patients. In 87% of patients, the algorithm-mined EF value was within a 5% range with the clinician’s estimate, and in 86% of patients, the algorithm-derived EF value was in the correct HF subtype range. In the 22 cases where the algorithm missed the right EF, the reasons were the inability to find the correct EF value (12 cases) and the calculation of mean EF from an incorrect and correct EF value (10 cases). Results and metrics of the HF subtype validation are presented in Table 2. The performance of the algorithm was good in detecting HF in general. However, false positives, all due to proBNP being elevated for a reason other than HF limited the performance of the algorithm for diagnosing HFpEF.

Table 2 Results of the HF subtype validation and calculated epidemiological measures

Risk of death by EF subtype

The multivariate-adjusted risk of death for a register-based diagnosis of HF, as compared to individuals with no HF, was 2.35-fold (95% CI, 1.90–2.90). For an algorithm-based diagnosis of HF (any subtype), this risk was 2.47-fold (95% CI, 2.00–3.06) (Supplementary Table 1). When analyzing the risk of death for algorithm-based subtypes, the highest HR was observed for HFrEF, 2.63 (95% CI, 1.97–3.50), as expected. The risks of death in the HFmrEF and HFpEF groups were 1.91-fold (95% CI, 1.24–2.95) and 2.28-fold (95% CI, 1.80–2.88), as compared to individuals with no HF according to the algorithm. In the study sample, 3,875 individuals had the gold standard EHR-based diagnosis of HF, in comparison to 3,746, when using the algorithm to define HF. The mean follow-up time was 1.5 (SD 1.2) years.

Discussion

In this study, we generated a data mining algorithm for extracting free-text EF values and laboratory data for improving HF subclassification.

Although the EF provided by the algorithm had 78–86% concordance with clinical assessment, EF was a challenging target for text mining. The greatest challenge for the algorithm was to correctly distinguish the current EF value from previous EF measurements that were often listed in the same unstructured text. However, this limitation was overcome surprisingly well by using mean EF values. The word search and numeric conversion functioned well in general, and descriptive reports did not tend be a problem. The mining of laboratory values was unproblematic as it was always based on structured data.

The risk of death was similar in both mortality analyses, and significantly lower in the group with HFmrEF compared to those with HFrEF or HFpEF. This finding is in line with a meta-analysis of 12 observational studies with 109,257 HF patients by Lauritsen et al. [8]. The profile of comorbid conditions in our study sample was also similar to that of the meta-analysis. The agreement between our findings from our study and the study by Lauritsen et al. provide further support on the validity of our data mining algorithm.

To our knowledge, text mining of EF values has not been attempted previously. In contrast, text mining of several dichotomous disease states has been previously attempted, such as for pregnancy status in a sample of 344 patients [9], the presence of colorectal cancer in a sample of 1,262,671 patient reports and pathology notes [10], systematic lupus erythematosus (SLE) in a sample of 4,607 patients [11], and cardiac implantable device infections in a sample of 19,212 implant procedure patients records [12]. In these studies, Labrosse [9], Brunekreef [11] and Mull [12] used a string character or rule-based text mining algorithm that was very similar to ours and resulted in analogous results. The accuracy of SLE detection was very similar to ours: 71% had a complete agreement in diagnosis in a validation sample of 100 randomly selected patients [11]. Labrosse et al. [9] manually reviewed all records, and their algorithm was superior to detecting pregnancy (35 of 36) compared to manual EHR assessment (30 of 36). Mull et al. reviewed 232 records of patients with a high risk of implantable device infection [12]. Text mining yielded a low positive predictive value (PPV) of 43.5% for the algorithm, but a very good sensitivity 94.4%, like in our study for HFpEF. Finally, Xu et al. manually validated a set of 300 patient records for the presence of colorectal cancer [10]. In this study, natural language processing provided a PPV of 84%. The main limiting factor in these studies is the relatively high number of false positives resulting in low PPVs. As the idea of the algorithm is to read through large volumes of patient data, high accuracy is needed for the mined data to be useful in clinical practice or research.

We conclude that quantitative EF and laboratory data can be efficiently extracted from EHRs and that these data can be used to subtype HF with reasonable accuracy, especially for HFrEF. The better and more clearly-defined the algorithm-defined subtypes are, the more the results of the future studies using definitions of HF subtypes derived from these will be expected to be concise.

Limitations

Our study has certain limitations. The algorithm performed well in capturing HFrEF and HFmrEF subtypes, but proBNP values elevated for a reason other than HF made it less capable in diagnosing HFpEF. Although the algorithm classified 86% of HF patients under the correct HF subtype, the accuracy of the mined EF values needs to be further improved. Particularly HFpEF detection could be improved by implementing concurrent comorbidity information to better discern the reasons for proBNP elevation. Also, echocardiographic markers of diastolic dysfunction could further improve HFpEF diagnosis. Unfortunately, these markers were usually not recorded in most clinical echocardiography reports until very recently, rendering this approach impossible for now. In addition, individual HF timelines with longitudinal information on the disease pattern could be incorporated, aiming to discern various chronic HF subtypes. Finally, machine learning approaches such as natural language processing could possibly lead to improved language processing.

Furthermore, the validations were performed by a single blinded clinician who reviewed only two sets of 100 cases.