Introduction: what is machine learning?

Machine learning (ML) is a subset of artificial intelligence that is defined as the study of computer algorithms that improve automatically through experience. These algorithms are built on sample training data with the goal of making predictions or decisions without being explicitly programmed for this purpose. Machine learning enables computerized analysis of large amounts of data over relatively short periods of time and can be used to address questions in a variety of fields including medicine. There are several different approaches to ML. These include unsupervised learning which uses only input data to find grouping between the data points and supervised learning that builds models from training data sets containing both inputs and desired outputs.

Application of machine learning to heart failure treatment

Over the past several years, ML approaches have been used in a number of different areas of medicine including cardiology and is frequently used to identify correlations in large data sets, as in studies of the genome [1], proteome and metabolome. Applications currently being explored in HF include analysis of images from tests such as the electrocardiogram, echocardiogram, cardiac magnetic resonance imaging, and other scans that are commonly obtained as part of the evaluation of patients with HF and identification of subgroups of patients with HF who have common characteristics (pheno-grouping) that might provide novel mechanistic insights or predict response to specific therapies. The ability of ML to identify correlations between descriptive variables in complex, non-linear, multi-dimensional systems is also being investigated as a means of predicting risk of the future development of HF in a population and in identifying risk of events in patients with established disease. In this review, we focus on the use of ML as a means of developing tools for predicting risk in patients with HF.

Use of machine learning in detecting risk

Modern ML algorithms for classification are powerful tools that can be used to build clinical risk scores. These algorithms are “trained” to distinguish (“classify”) two or more cohorts of patients with or without known outcomes, such as death or HF hospitalization based on a set of clinical covariates. The ML process helps identify unsuspected associations among the covariates fed into the computer which cumulatively relate to the patients risk of a future event. Once the algorithm is trained, it can be used to assess the risk of an event occurring in an individual patient. The strength of ML is that it can handle complex non-linear correlations between the covariates that are entered into the computer. Figure 1 [2] provides a simple example. Although the choice of covariates should be guided by clinical and other considerations, a deep understanding of their correlations is not necessary, since many ML algorithms handle them automatically.

Fig. 1
figure 1

Simple visualization how a machine learning algorithm is able to learn how to distinguish between two populations. Two computer-generated populations (a) red and (b) blue, characterized by a pair of covariates (X,Y). A Neural Network (NN) is trained to distinguish between the two. For each (X,Y) pair, the NN provides a score that can be used in this or a different population to give a red vs. blue probability. In panel (c) the output of the NN is visualized (in a third dimension - color) as a function of (X,Y) from deep red (very likely red) to deep blue (very likely blue). As additional covariates are added. Separation of the populations increases.

Challenges in developing a machine learning algorithm

In practice risk scores are most often designed based on retrospective studies. The data used to generate these scores can originate from a clinical trial, patient registry, or an administrative data base. More recently, use of data from hospitals and healthcare systems has been made possible by the adoption of electronic health record (EHR) systems by institutions around the world. The use of EHR data bases offers several advantages including availability of large amounts of raw data, better representation of the spectrum of HF than in the other settings, and the ability to update scores as the study population grows or when new tests are integrated into medical practice. However, in addition to the obvious issues of de-identification and privacy inherent in using data from EHRs, several other challenges arise in extracting useful data from such systems. These include time variability in the data format/content due to evolutions in the EHR system, missing, incomplete, or poorly formatted data, confusion with diagnostic and procedure codes, duplications, mistakes, etc. While much information in a patients EHR already exists in numerical form that lends itself to computer evaluation, other patient descriptors and characteristics exist only in narrative form. While non-numeric variables including images can be translated into numeric form, this process adds complexity to the process in that it involves integration of steps that allow the capture of information in an appropriate format into the algorithm [3]. Regardless of how it is derived, validation of the data extraction procedure is essential. This can be achieved by proactively searching the dataset for nonsensical features, e.g., lab data supposedly taken months after the patient’s death and extremely abnormal clinical values, as well as by randomly selecting a subset of patients and comparing “by hand” the extracted data with the contents of their EHR (chart review).

Approaches to missing data

One of the major limitations in constructing risk scores from the EHR is partially missing patient data. Missing data can be handled by “imputation”; e.g., if diastolic blood pressure (BPDIAS) is one of the covariates and BPDIAS data is missing for a given patient, the simplest form of imputation is to assign the patient the mean or the median of BPDIAS in the cohort. There are more sophisticated techniques for imputation but the obvious disadvantage of all of these procedures is that information is lost and correlations are broken. Alternatively, patients with missing data could be excluded from the study; however, this approach raises other concerns that must be addressed. For example, if missing data are not randomly distributed across the cohort, this procedure can introduce a selection bias. In such cases, it is important to investigate possible biases generated by exclusion of patients with missing data and attempt to mitigate them. Special attention needs to be payed to in-patients vs out-patients, if both are included in the cohort, as out-patients are more likely to have more issues with time-consistency of the covariates used (see section “Characterizing the State of the Patient at a Prediction Time”).

Exclusion of patients with missing data is most important in the training phase of a ML algorithm, where capturing the correlations between the covariates and the outcomes is “learned” by the computer. It is less so, however, when the model that has been generated is applied to calculating risk for a given patient. At this stage, the loss of a single covariate out of a larger number may slightly reduce but does not destroy the predictive power of the risk score. It is possible to provide specific calibrations for scores calculated with an imputed variable.

Characterizing the state of the patient at a prediction time

As the clinical state of a patient evolves over time, it is important that the covariates used in the creation of the score be measured in tight temporal proximity to each other so that the correlations used to collectively capture the state of the patient at prediction time are preserved. If these correlations are lost or “smeared” by mixing covariates collected at various points over an extended time frame, the ability to define a patient’s state at a precise moment is weakened, reducing the ability to relate that state to the outcomes. Clearly, not all covariates are equally time-sensitive to the same extent and clinical judgment of the necessary time-window is important. For instance, use of left ventricular ejection fraction collected within several weeks or even months from the point in time that risk is being calculated may be acceptable in a patient who has had a stable clinical course, whereas variables such as serum creatinine level or arterial blood gas saturation may be needed to be collected in much closer temporal proximity to the prediction time point.

Overtraining

In most cases, the size of the cohorts on which risk scores can be developed is limited to the hundreds or the thousands, rather than the millions. This usually limits the complexity of the underlying algorithm that can be safely deployed; e.g., “deep” neural networks typically need large sample sizes to avoid over-fitting (aka over-training) and hence risk irreproducibility. In general, if the number of covariates is too large with respect to the training data set size, the algorithm is at risk of being “over-trained.” An over-trained algorithm is one that has too much freedom to focus on the detailed properties of the particular cohort on which it is trained, so that its performance does not generalize. This leads to inferior performance of the score when it is applied to populations other than the one from which it was derived. There are several techniques to recognize and avoid or minimize over-training [4]. The simplest one is to randomly split the cohort into two statistically independent sets: a “training” cohort, and a “testing” cohort. The ML algorithm is first trained, i.e., developed, on the training cohort only. Then its performance is evaluated on and compared between the training and testing cohorts. Statistically significant differences in performance are a sign of over-training.

Machine learning algorithms

There are many different ML algorithms [5,6,7] that can be used, e.g., neural networks, decision trees, Bayesian methods, and support vector machines, each with several configurable parameters. Picking among different choices is not obvious, and some trial and error is necessary. Fortunately, software codes to implement different possibilities are easily available.

Selection of covariates

The choice of the set of covariates on which to base a risk score is guided by clinical judgment, practical considerations, e.g., what covariates are generally available, and avoidance of over-training. Unlike risk scores that are developed using standard approaches in which variables are selected according to their individual predictive power, covariates that are selected by the computer in ML approaches may not have predictive value by themselves. Testing performance as covariates are added or removed from the training set is also a powerful tool. An important reason to limit the number of covariates is that if a score is to be useful in clinical settings, it needs to be based on widely available and easily accessible data and it should be easy to calculate. Although the addition of a particular variable may improve predictive accuracy of a ML generated risk score, this may represent a pyrrhic victory if that variable is costly or not readily accessible to providers. As discussed in the following section, we have recently developed a score for predicting mortality risk in patients with HF that is based on just eight covariates (diastolic blood pressure and seven covariates from CBC (complete blood count) and CMP (comprehensive metabolic panel)), all of which are readily available in patients with HF.

External validation and bias-testing

Finally, biases due the properties of the cohort on which the risk score is built are probably inevitable. Validations in external cohorts, e.g., from a different medical center or from a different country, are also crucial. A limited choice of widely available covariates also helps in making external validation possible. In addition, examining the performance of the score, stratified by other properties such as age gender, ethnicity, in-patients vs out-patients, is very important, particularly when these data are not included as covariates.

Application of machine learning in predicting risk in heart failure

Importance of knowing risk for future events

The ability to accurately predict risk of clinical events in patients with HF provides important information to providers and patients alike. Determination of the risk of an event helps identify patients who are most (or least) likely to derive benefit from targeted interventions for preventing future events. Assessment of a patient’s risk for these events is particularly valuable when the proposed treatment option is costly or associated with serious side effects. Knowledge of a patient’s risk for future events when integrated along with clinical, social, and other information provides a framework for therapeutic recommendations by providers and is helpful to patients and their families in planning for the future.

Approaches to predicting risk

Risk of future events can be determined in a variety of ways including assessment of patient characteristics, clinical status, test results, or biomarkers. Risk scores which combine information from a variety of the above domains using standard statistical analysis methods have also been developed to predict risk in patients with HF [8,9,10,11,12,13]. These risk scores, however, demonstrate only modest predictive power, especially when applied to populations outside the one used for their derivation [14,15,16,17]. A recent analysis of the predictive accuracy of several commonly used risk scores including the CHARM (Candesartan in Heart Failure-Assessment of Reduction in Mortality), GISSI-HF (Gruppo Italiano per lo Studio della Streptochinasi nell’Infarto Miocardico-Heart Failure), MAGGIC (Meta-analysis Global Group in Chronic Heart Failure), and SHFM (Seattle Heart Failure Model) scores in a large European registry of patients with HF indicated only modest accuracy of all of these scores in predicting risk of death at 1 and 2 years [18]. The authors of this analysis concluded that the limited accuracy of these scores made physicians reluctant to use them in clinical practice and that more precise predictive tools are needed.

Machine learning for predicting incident HF risk

Machine learning techniques have been used to predict cardiovascular (CV) events, including the risk of incident HF. In a study of patients in the Multi-Ethnic Study of Atherosclerosis (MESA) cohort, Ambale-Venkatesh et al. used random survival forest (RSF) to predict six CV outcomes including new onset HF [19]. They identified the top-20 variables which were predictive of each outcome from a total 735 variables derived from imaging studies, health questionnaires, and biomarker analysis. While the model predicted incident HF with an AUC of 0.84, this was only a modest improvement compared to the MESA-HF score (AUC of 0.8). A limitation of this risk score is that the top predictive variables included several laboratory values which are not widely available, e.g., tissue necrosis factor alpha soluble receptor and interleukin-2 soluble receptor. This model has not yet been validated in external populations and its generalizability is questionable. Segar et al. used a similar RSF technique to identify predictors of incident HF among patients with type 2 diabetes mellitus (DM) in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial [20]. The RSF models demonstrated somewhat better discrimination than the Cox-based method (C-index 0.77 [95% CI 0.75–0.80] vs. 0.73 [0.70–0.76], respectively). From the identified predictors, an integer-based risk score for 5-year HF incidence was created, the WATCH-DM (Weight [BMI], Age, hyperTension, Creatinine, HDL-C, Diabetes control [fasting plasma glucose], QRS Duration, MI, and CABG) risk score. Each 1 unit increment in the risk score is associated with a 24% higher relative risk of HF within 5 years. The WATCH-DM risk score can easily be applied in clinical practice. Both the risk score and the RSF-based risk prediction model were externally validated in a cohort of individuals with DM using the Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT) [21].

Machine Learning for Predicting Future Events in Patients with Heart Failure

Researchers have also applied ML to predict morbidity and mortality in patients with known HF (Table 1). Although numerous models have been created to predict HF readmissions, most have demonstrated only limited discriminative properties. Frizzell et al. used several ML algorithms to predict all-cause readmissions 30 days after discharge from a HF hospitalization in patients included in the GWTG-HF registry [22]. All of the models developed in this study showed modest discriminatory power, with C statistics consistently around 0.62. Awan et al. demonstrated a similar AUC of 0.62 using a multi-layer perceptron-based approach to predict risk of 30 day HF readmission or death in a population of patients above age 65 years admitted with HF [25]. Golas et al. were able to demonstrate modest improvement in risk prediction using several deep learning algorithms in a population of HF patients admitted within a large healthcare system [24]. The model developed using deep unified networks from >3500 variables from the electronic health record (EHR) demonstrated the best performance with an AUC of 0.705 for prediction of 30-day readmission. To our knowledge, these models have not been externally validated nor directly compared to traditional risk assessment tools, so it is not known if they can be applied to broad HF populations.

Table 1 Characteristics and outcomes in selected studies using machine learning to predict mortality and hospitalization in patients with heart failure

Machine learning models have overall been more successful at predicting mortality in patients with HF. In 2018, Ahmad et al. applied random forest modeling to identify predictors of 1-year survival in patients enrolled in the Swedish HF Registry [28]. Their model demonstrated excellent discrimination for survival with a C-statistic of 0.83. Additionally, they used cluster analysis to successfully identify 4 clinically relevant subgroups of HF with marked differences in 1-year survival and response to therapeutics. Samad et al. used ML learning techniques to incorporate a panel of 57 echocardiographic measurements with clinical variables to predict 5-year mortality in all patients undergoing echocardiography, including 15,492 patients with HF. For the patients with HF, the random forest model that was developed resulted in an AUC of 0.8, a significant improvement compared to the Seattle HF Model (AUC 0.63) [9, 30]. Using a deep learning algorithm, Kwon et al. have also demonstrated superior predictive abilities compared to traditional risk scores for predicting in-hospital, 1-year and 3-year mortality for patients with acute HF [29•]. For in-hospital mortality, their ML algorithm demonstrated an excellent AUC of 0.88, compared to the AUC of 0.728 from the GWTG-HF score [10]. They were also able to successfully predict long-term mortality, with an AUC of 0.782 for 1-year mortality and AUC 0.813 for 3-year mortality. This algorithm outperformed the MAGGIC score which demonstrated AUCs of 0.718 and 0.729 for 1- and 3-year mortality, respectively 13. While these models have demonstrated success at risk prediction, they have not been externally validated and have not yet been widely incorporated into clinical care.

Jing et al. were recently able to demonstrate that in addition to risk stratify patients, ML can be used to identify individuals most likely to benefit from interventions with evidence-based therapies [31]. Using data from the EHR, they trained ML models to predict 1-year all-cause mortality in 26,971 patients with HF. Their model included clinical variables, diagnostic codes, electrocardiogram and echocardiographic measurements, and 8 evidence-based “care gaps”: flu vaccine, blood pressure of <130/80 mmHg, A1c of <8%, cardiac resynchronization therapy, and active medications (angiotensin-converting enzyme inhibitor/angiotensin II receptor blocker/angiotensin receptor-neprilysin inhibitor, aldosterone receptor antagonist, hydralazine, and beta-blocker). Their best performing model demonstrated an AUC of 0.77 for 1-year all-cause mortality which was superior to the performance of the Seattle HF Model (AUC 0.57) [9]. Of the 13,238 living patients, 2844 (21.5%) patients were predicted to die within 1 year based on the estimated mortality rate. Simulating closure of the 8 care gaps that existed in these patients resulted in a 1.7% improvement in the absolute mortality rate with 231additional patients predicted to survive beyond 1 year. These findings require prospective evaluation and external validation, but they are promising and highlight the potential utility of ML methods to guide clinical action and to identify patients most likely to benefit from optimization of evidence-based therapies.

Derivation and validation of MARKER-HF

The Machine learning Assessment of RisK and EaRly mortality in Heart Failure (MARKER-HF) risk score for predicting mortality in patients with HF was developed using non-parametric analysis methods to incorporate interactions between variables that have prognostic value [32]. It was derived and internally validated in patients who were identified at the time of the first mention of a HF diagnosis in either the in-patient or out-patient setting in their University of California, San Diego Healthcare System EHR. Patients who met entry criteria were divided into training and validation cohorts. MARKER-HF was constructed in the training cohort using a boosted decision tree model to discriminate between patients at the extrema of risk of death with patients who died within 90 days considered “high risk” while those known to be alive after 800 days designated as “low risk.” Data from complete blood count, comprehensive metabolic panel, vital sign measurement, electrocardiogram, and echocardiogram, all performed within 7 days of the patients’ index HF event, was used to construct the model. The number of variables selected was determined by balancing inclusiveness so that patients with missing data would not be lost and the need to minimize overfitting that could result in over-training and loss of accuracy of the score in other populations. Ultimately, a composite of eight variables (diastolic blood pressure, creatinine, blood urea nitrogen, hemoglobin, white blood cell count, platelets, albumin, and red blood cell distribution width) that discriminated with a high degree of accuracy between patients at low and high risk was identified. Predictive accuracy was demonstrated across the entire spectrum of risk and was confirmed in the validation cohort. Comparison of MARKER-HF predicted survival in the training and validation cohorts is depicted in Fig. 2, Panel A. No evidence of over-training was detected.

Fig. 2
figure 2

Panel A. Comparison of 1-year Survival probabilities as a function of MARKER-HF fir Training and Validation cohorts in the UCSD cohort. The values obtained in the Training and Validation cohorts do not differ significantly, indicating no significant overtraining. Panel B. The ROC curves for MARKER-HF in the University of California, San Diego (UCSD), University of California, San Francisco (UCSF) and BIOlogy Study to TAilored Treatment in Chronic Heart Failure (BIOSTAT-CHF) populations. The C-statistics obtained are consistent with each other within their 95% confidence intervals. Panel C. ROC Curves for NT-proBNP and MARKER-HF. MARKER-HF demonstrates superior prediction compared to the biomarker.

The ability of a risk score to maintain its accuracy in clinically relevant subgroups of the HF population and in other independent populations is essential if it is to be widely used in clinical practice. Consequently, we assessed the predictive accuracy of MARKER-HF in patients according to sex, race, in-patient vs out-patient status and acuity at the time of identification (determined by whether the index diagnosis was HF or pulmonary edema) and found it to perform equally well in all subgroups. External validation demonstrated that MARKER-HF maintained its predictive accuracy in populations followed in the University of California, San Francisco (UCSF) Healthcare System and the European based BIOSTAT-CHF Registry [32] (Fig. 2, Panel B).

Finally, to determine if MARKER-HF was superior to other predictors of risk we compared its performance to NT-proBNP, a well-validated HF biomarker, and to other risk scores used to predict mortality. While MARKER-HF scores tracked reasonably well with levels of NT-proBNP, it proved a much more reliable predictor of mortality than did the biomarker which had an AUC of 0.69 as shown in Fig. 2, Panel C. Initially, NT-proBNP was excluded as a covariate during the derivation of MARKER-HF due to its low availability (~50%). When it was added to the other eight variables in the MARKER-HF score, predictive accuracy was not significantly improved. MARKER-HF also proved to be superior to either the Intermountain Risk Score (IMRS) [7], the Get With the Guidelines-HF (GWTG-HF) score [11], or the Acute Decompensated Heart Failure Registry (ADHERE) risk score [8] in predicting mortality risk in the UCSD, UCSF, and BIOSTAT-CHF populations.

Future directions

By virtue of their ability to assess large data bases and find unsuspected associations between covariates, ML approaches offer a powerful new approach for risk assessment in patients with HF. The ability to accurately predict risk of future events is critical for providers in making recommendations about the advisability of specific therapies and for patients and their families who must plan for the future. In this review, we have outlined criteria for extracting and managing data that we believe are essential in developing ML based risk scores. In all cases, external validation in independent populations is needed to determine generalizability and comparison with available tools is required in order to know if they are superior to other approaches. For risk scores to be effective, they should require input of a manageable number of variables that are easily accessible and widely collected in routine patient care, so as not to constrain their use.

Finally, although this review has focused on the use of ML to develop hospitalization and mortality risk scores for clinical events in patients with HF, they can be applied to a variety of situations in either this or other populations. There is great need for novel approaches for calculating risk of other adverse events known to occur in patients with HF such as stroke, atrial fibrillation and sudden cardiac death as well as the risk of adverse consequences of specific therapies designed to treat these conditions. Machine learning approaches may also be useful in future clinical trials in HF by helping to determine which patients to enroll.