Introduction

Accurate risk stratification remains a central theme in all stages of the care of patients with cardiovascular disease. Indeed, the likelihood that any patient will benefit from a given therapeutic intervention is a function, in part, of the risk associated with the intervention itself versus the risk that the patient will have an adverse event if no intervention is performed. Informed clinical decision making necessitates gauging patient risk using available clinical information.

A number of societal guidelines recommend the use of validated risk scores in the initial evaluation of patients with suspected coronary disease [1,2,3]. The use of accurate risk scores helps to ensure that patients who are at high risk of adverse outcomes are quickly identified and assigned a therapy that is appropriate for their level of risk. Nevertheless, risk stratification is far from a perfect science, and risk scores often fail to identify patients at high risk of inimical outcomes. This problem is made more apparent in light of the fact that a relative minority of patients with cardiovascular disease experience the gravest adverse outcomes. Moreover, while the prevalence of adverse events in high-risk populations is, by definition, large, the absolute number of events is also large in patients who are predicted to be low risk using traditional risk prediction metrics. This low risk-high number dilemma is frequently encountered in many areas of cardiovascular clinical research [4]. As such, adequately identifying patient subgroups who are truly at high risk of adverse events remains a clear unmet clinical need. Novel methods are therefore needed to realize the full potential of clinical risk stratification from existing clinical observations. Machine learning and deep learning, in particular, holds the potential to robustly identify high-risk patient subgroups, suggest personalized interventions that can reduce a given patient’s risk, and help ensure that appropriate resources are allocated to those patients who are in the most need.

In this review, we do not strive to review all of the relevant literature in the area of deep learning in cardiovascular medicine. Indeed, this review is written for the practicing clinician and strives to provide intuitive explanations for how deep learning models actually work and where they are most applicable. As the use of these models becomes ubiquitous in the clinical arena, it will be important for health care providers to critically evaluate them in order to determine the clinical usefulness of any given machine learning approach. Our goal is to provide a general framework for understanding what advantages these models hold and what considerations limit their broad applicability.

Conventional approaches to risk stratification

The term machine learning is believed to have been originated by Arthur Samuel, an engineer and scientist who pioneered artificial intelligence in 1959 [5]. He described it as “programing computers to learn from experience.” There are diverse examples of machine learning in the clinical literature, including straightforward approaches like logistic regression and Cox proportional hazards modeling and more esoteric techniques like deep learning, which is described in the next section. Indeed, the former methods have actually been a part of the clinical literature for some time [6,7,8]. Therefore, while the term machine learning has only recently entered the medical lexicon, a number of existing clinical risk scores were developed and refined using approaches that fall under this umbrella term. The exorbitant list of such models is too lengthy to exhaustively review here. Instead, we focus on some approaches that are commonly used to assess patient risk.

One of the earliest models for quantifying the risk of adverse cardiovascular outcomes was developed by Killip et al. in 1967, where 250 patients were divided into four simple classes of increasing severity of illness, ranging from no clinical signs of heart failure to cardiogenic shock [9]. The primary goal of this study was to trial an improved workflow for cardiac intensive care, but the data collected over the course of study revealed patterns in patient survival based on their class (now called the Killip class). The utility of these classes for identifying high risk patients has been born out in a number of studies, and these classes remain a part of the clinical assessment of patients who present with an acute myocardial infarction.

Over time, more sophisticated statistical techniques have been used to develop more sophisticated risk stratification models. Both the Framingham risk score—which quantifies the risk of adverse events (death from coronary heart disease, nonfatal MI, angina, stoke, transient ischemic attack, intermittent claudication, and heart failure) in patients who had no prior history of cardiac disease—and the Global Registry of Acute Coronary Events (GRACE) score—which quantifies all-cause mortality in patients who present with an ACS—were developed using Cox proportional hazards regression [10, 11]. Another class of risk scores, developed from and named for the Thrombolysis in Myocardial Infarction (TIMI) study groups, was developed specifically for patients who present with symptoms consistent with an acute coronary syndrome. Here, features that were discriminatory with respect to the combined outcome of all-cause mortality, new or recurrent MI, or severe recurrent ischemia in their cohort were selected using logistic regression. Seven features were selected in the final model. To use the risk score itself, the physician simply counts the number of features that are present to estimate the short-term risk of either mortality after a myocardial infarction post ST segment elevation MI or a combined outcome of all-cause mortality, new or recurrent MI, or severe recurrent ischemia requiring revascularization post non-ST segment elevation ACS [12, 13].

Regression modeling has found a role for quantifying patient risk in other disorders apart from ischemic heart disease. Pocock et al., for example, performed a meta-analysis of heart failure patients from 30 different studies, amounting to 39,372 patients. They used multivariable piecewise Poisson regression methods to identify features that are predictive of mortality at 3 years. These features were then converted into an integer risk calculator, called the Meta-analysis Global Group in Chronic Heart Failure (MAGGIC) score, with higher values corresponding to greater risk [14]. Similarly, the Seattle Heart Failure Model was developed on a cohort of 1125 patients, using a multivariate Cox proportional hazards model. This model provides estimates for 1-, 2-, and 3-year mortalities [15, 16].

Logistic regression and proportional hazard models are advantageous because they are easy to interpret: each clinical feature in the model has an associated weight that corresponds to how important that feature is for the model arriving at a particular result. However, such models are relatively simple and cannot necessarily capture complex mechanisms relating observations and outcomes of interest.

What is deep learning?

The diverse, nonuniform terminology in the medical literature unfortunately tends to obfuscate the meaning of the term “deep learning.” Deep learning is a subfield of machine learning that strives to find powerful abstract representations of data using complex artificial neural networks (ANNs) that are then used to accomplish some prespecified task. While these abstract data representations are powerful ways to describe clinical data, they are difficult to comprehend and explain; that is why they are, indeed, “abstract.”

ANNs correspond to a class of machine learning algorithms whose algorithmic structure is inspired by structure of the human brain and how it is believed that humans compute [17, 18]. A neural network consists of interconnected artificial neurons that pass information between one another. A typical ANN contains an input layer, which contains several artificial neurons that take clinically meaningful data as input. The input layer then passes the clinical data to other inner, or “hidden,” layers, each of which performs a series of relatively simple computations. At each layer, more abstract representations of the input data are obtained. Eventually, the information is passed to an output layer that yields a clinically meaningful quantity (Fig. 1).

Fig. 1
figure 1

In our applications, a neural network acts as a function that takes some observations as input and produces some prediction of outcomes as the output (a). This function is generated by adding many simple functions (represented by circular nodes that process information), each of which takes all the outputs of the previous layer as its input, which renders a network “fully connected” (b). These simple functions are strictly increasing and include parameters (\( {\overrightarrow{w}}^{(i)},{b}^{(i)} \) for each node), which are chosen by training the network (c). Each layer can be though of an abstraction of the data, which is eventually separable in the last layer if the model works well. The output of the last layer is the probability of an adverse event, which a clinician may use to inform her clinical decisions (d).

Deep learning models, in practice, correspond to neural networks that contain several hidden layers. These models, originally referred to as multilayer perceptrons, were popularized in the early 1980s for applications such as image and speech recognition, then receded in popularity in favor of simpler, easier to train, and perhaps more explainable models [19, 20]. In recent years, however, deep neural network (DNN) learning has resurged dramatically both because of the availability of so-called “big data” and the development of computational methods that facilitate the training of large neural networks. In many of today’s applications, these networks can be quite large, having on the order of 105–106 artificial neurons and millions of modifiable parameters. Parenthetically, as the size of clinical datasets is typically much smaller, care must be taken when implementing these models to ensure that they are not overtrained.

While the structure of ANNs, and DNNs in particular, are inspired by the structure of neurons in the human brain, these models are best thought of as universal function approximators. Indeed, it has been mathematically proven that any continuous function on compact spaces can be represented by a neural network, under certain constraints [21, 22]. These models therefore form an efficient platform for generating functions that model complex relationships between patient characteristics/features and outcomes. This highlights an important difference between DNNs and simpler methods like logistic regression, which models the relationship between outcomes (i.e., the logarithm of the odds ratio) and patient features as a linear function. By contrast, a DNN corresponds to a complex, highly nonlinear function that takes patient information as input (including medical images) and outputs the corresponding outcome. An additional advantage of DNNs is that they can use input data in “raw” form, with little preprocessing.

Deep learning models can, in principle, capture complex, nonlinear, relationships between patient features and outcomes and therefore necessarily meet the first criteria. However, because these models generate abstract representations of the input data, it can be very difficult to understand what the model has learned and consequently why the model arrives at a particular result. Moreover, understanding when the model will fail—i.e., which patients are most likely to be associated with an incorrect prediction—can be just as challenging.

Evaluating deep learning risk models

Standard performance metrics, such as the area under the receiver operating characteristic curve (AUC), accuracy, and the sensitivity/specificity, provide useful information for gauging how a risk model will perform, on average. Nevertheless, these metrics do not by themselves offer any interpretative insights, nor do they help the user understand how the model will perform on any individual patient. The upshot being that conventional statistical metrics of success are not always sufficient to determine the clinical utility of a deep learning model.

When evaluating applications of machine learning to medical problems, there are particular criteria that must be considered given our current understanding of human physiology and the reality of medical practice (Fig. 2). In addition to having a level of performance that ensures that it will perform well, on average, on the population of interest, ideally a good algorithmic solution should also:

  1. 1.

    Provide information about potential failure modes; i.e., indicate when it is likely to yield a false result;

  2. 2.

    Be explainable in the sense that clinicians can understand why the model arrives at a particular result.

Fig. 2
figure 2

Issues that hinder the clinical acceptance of deep learning models.

Although determining when a model will fail is challenging, it is an essential task. Formally, this can be understood as finding, a priori, patient characteristics or subgroups that are associated with incorrect predictions. The development of methods that identify such “failure modes” are also a nascent area of research within the machine learning community, with most of the published research appearing in specialized machine learning conferences or non-peer-reviewed online printed archives, with little associated work appearing in the clinical literature. Nevertheless, insights into when a model will fail can often be garnered if the model itself is explainable; i.e., understanding how/why a model arrives a particular result often provides clues as to how the model can yield an incorrect result.

Recently, a new method was described for identifying when a given clinical risk score will yield unreliable results [23••]. The approach identifies, a priori, patient cohorts associated with reduced model accuracy, discriminatory ability, and poor calibration. Application to the GRACE risk model correctly identifies patient cohorts where the GRACE score has reduced performance. Advantages of the method are that it is straightforward to implement and that it can be applied to any risk model, regardless of how the risk model was developed—thereby making the approach appropriate for deep learning models. General methods along these lines will likely play an increasingly important role in determining when complex risk models are expected to yield useful predictions.

In addition to deciphering when a given model is likely to fail, developing methods that “explain” what a model has learned is an important part of any comprehensive strategy that strives to maximize clinical acceptance. Nevertheless, conceptions of explainability or interpretability of machine learning models are diverse, and it is difficult to determine exactly what this term means in the context of machine learning models. In his article, “The Mythos of Model Interpretability,” Zachary Lipton identifies five types of interpretability for machine learning models: trust, causality, transferability, informativeness, and fairness [24]. Of particular interest for medical algorithms are causality and informativeness. Causality describes if the relationships discovered by the model are truly causal or merely correlative. While casualty in machine learning is an active area of research, it is always very difficult to tease out causal relationships from a retrospective analysis of any dataset [25]. An informative deep learning model provides some intuition to support how it arrives at a given result. In order to impart useful intuitions, however, one needs to translate the abstract representations learned by a deep learning model into language that is easily understood by the health care practitioner. In short, in the medical context, we ideally need models that yield insights that are translatable into the language of physiology (Fig. 2).

There are a limited number of tools that have been used to provide interpretations/explanations of what a deep neural network has learned. Shapely values, Gradient-weighted Class Activation Mapping (Grad-CAM) methods, and saliency maps represent a class of methods that can provide insight into what input features are most responsible for the risk model making a prediction [26,27,28]. Grad-CAM and saliency maps, in particular, are typically used with convolutional neural networks (described below) and provide insight into the relative importance of different parts of an image for a specific prediction [29]. For example, consider a model trained to distinguish between different objects, such as dogs and humans. A saliency map may reveal that pixels corresponding to the legs (four for a dog and two for a human) are most dispositive. Hence, for such a simple task as differentiating humans from dogs, saliency maps provide easily understood “explanations.” However, for more complex classification tasks, saliency maps may not yield such readily interpretable insights. Indeed, these methods generally do not provide information about how the data in these regions were used to arrive at a particular decision, nor do they necessarily provide any causal insights. More generally, it has been argued that the attempts to explain deep models are inherently flawed because such post hoc explanations can never have true fidelity with respect to the original complex model [30•]. In this vein, the use of interpretable models have an advantage in that they are designed to yield explanations that can be understood by domain experts. Nevertheless, it is not clear that commonly used interpretable models can capture the complex nonlinear relationships described above in manner that yields clear explanations. A compromise may be to build models that combine both mechanistic/physiologic models and deep learning models to enhance both model explainability and predictive performance. This is an active area of research.

It has been argued that clinicians should embrace black box models rather than strive to develop explanations that provide insight into how the model arrives at a particular result [31]. Proponents of this thesis argue that clinical decision-making is frequently rooted in an incomplete understanding of the disease process in question and how the potential intervention actually works. Hence requiring deep learning models to be explainable holds them to a higher standard than other methods used to inform clinical decision making and further stymie innovation in this space.

While there is merit to this argument, there is little doubt that clinical decisions are grounded in some understanding of the disease process. Indeed, it is precisely this, albeit imperfect, understanding that guides our therapeutic choices. By contrast, deep learning models represent an unprecedented level of opaqueness with respect to clinical understanding. In the setting of black models and only statistical measures of the model’s overall performance, additional information are needed to determine when a model prediction is appropriate for a specific patient. While the identification of model failure modes and explainability are distinct concepts, they are related. Failure mode analyses strive to identify patient subgroups where the model has reduced performance, and a comprehensive understanding of how a complex model arrives at a particular result provides further assurance that the model is appropriate for a given patient, who has a given set of clinical characteristics. Explanations that are inconsistent, for example, with our understanding of the underlying pathophysiology should not be trusted.

In sum, it is our view that deep learning models for any clinical application should be evaluated using these metrics, in addition to standard statistical measures of performance. In what follows we discuss several recent applications of deep learning methods for cardiovascular risk stratification and evaluate them relative to the metrics discussed above.

Deep learning for risk prediction

Deep learning for image classification has a relatively extensive literature. Indeed, the Imagenet challenge—a worldwide competition for classifying millions of curated images—has led to the development of many sophisticated algorithms for image classification [32]. In a number of applications, these image classification algorithms have been modified and fruitfully applied to clinical images to quantify patient risk. However, these methods have mainly been used for automatic disease diagnosis from pathology slides and radiological scans [33,34,35,36,37,38]. These algorithms are usually implemented using a class of DNNs called convolutional neural networks (CNNs). CNNs are inspired by the structure of the mammalian visual cortex, where each neuron “sees” a small region of the visual field, called the receptive field of that neuron [39]. In a CNN, the information contained in adjacent groups of pixels of an image, analogous to the receptive field, is summarized, using a mathematical operation called a convolution to create an abstraction of the information in the image [40].

In cardiology, deep learning work has been focused on the automatic interpretation of cardiac images, with few applications to the development of models that directly quantify patient risk [41]. Recent studies have highlighted the ability of CNNs to identify echocardiographic windows using the images alone [42, 43], correctly segment the left ventricle in both cardiac CT images and cardiac MRIs [44, 45], and accurately detect cardiac MR motion artifacts [46]. The use of CNNs to garner insights into the risk of future adverse outcomes, however, is still a nascent area of investigation.

A recent study that purports to use medical image data for assessing cardiovascular risk was published by Poplin et al. [47••]. In that work, the authors used a CNN to predict age, gender, smoking status, systolic blood pressure, diastolic blood pressure, and, most importantly, major adverse cardiovascular events (MACE) within 5 years from the time that retinal fundus images were acquired. The dataset used to develop and validate the model was obtained from the UK Biobank and EyePACS (a retinal image database consisting of images obtained during routine diabetic screening in clinics in the USA). They report an AUC of 0.70 for predicting MACE after 5 years using their deep algorithm. This performance exceeds that of predictions made based on single risk factors such as age and systolic blood pressure. However, they do not outperform an existing, simpler proportional hazards model, SCORE (Systematic COronary Risk Evaluation), proposed by Conroy et al. in 2003 [48]. In addition to predicting risk, they utilized saliency maps, described above, to attempt to explain their algorithm. Saliency maps highlight portions of the retinal images that contributed significantly to the predictions their models produced. However, the usefulness of these saliency maps is limited because they give us no information about the mechanism by which certain features of the retina relate to cardiovascular risk and if the deep learning model has recapitulated that mechanism.

Recently, there have been attempts to extend classification algorithms, which were originally designed to analyze medical images, to different types of data in the Electronic Medical Record (EMR). The EMR can be divided into two types of data: structured data and unstructured data. Structure medical data refers to what can be found in the pre-existing fields with the electronic medical record; e.g., lab results, vital signs, and demographic information. Unstructured data refers to what appears in medical notes written by health care practitioners. In a recent study, Mayampurath et al. assembled structured data from the electronic health record into a visual format that could then be used to train a CNN to predict in-hospital outcomes [49]. Essentially, the EMR is converted to a two-dimensional medical image, which enables the use of standard machine learning techniques appropriate for medical image processing. The image itself maps time on one axis and 156 clinical variables (including vital signs, laboratory results, medications, diagnostic tests, and nurse examinations), recorded over the first 48 h of admission, on the other axis. Overall, the discriminatory ability of the best performing CNN (the authors considered more than one) was 0.91, suggesting that the method holds considerable promise.

A significant advantage here is that they can leverage methods used to “interpret” what CNNs have learned about images to help explain why their deep learning model arrives at a particular result. In their work, the authors used a standard method—Gradient-weighted Class Activation Mapping or Grad-CAM—to understand what clinical features are most important for discriminating between patients who die in-hospital and those who do not [28]. Not surprisingly, the method finds that vital signs, interventions (e.g., mechanical ventilation), and administered medications were important for distinguishing between those who would have an in-hospital event and those who would not. Of interest, the model does suggest that simple nursing examinations, represented by Morse and Braden scores, may be important for predicting in-hospital mortality. Moreover, it is noteworthy that there are many different ways to organize data arising from the EMR into two-dimensional representations and not all visual representations will have the same prognostic information. The authors of this study only experiment with three different ways to organize the data.

While these results are encouraging, the problem of predicting in-hospital mortality using 48 h of admission data may be, relatively speaking, not that difficult. For example, one would likely do fairly well predicting in-hospital mortality using a simplified set of input features that includes where the patient is admitted (ICU vs. hospital floor), vital signs trajectories during the first 48 h (higher death rates are expected in patients who become hypotensive soon after admission), and whether the patient requires mechanical ventilation or inotropic support soon after admission. As the authors do not compare their method to what would be obtained using a simple method such as logistic regression model using a rich set of clinical features, it is not clear whether a CNN is truly necessary for this task.

One very popular data source for machine learning is the electrocardiogram because it is routinely measured, cheap to administer, and apparently rich in information, some of which may not be easily discernable by humans. In addition, a variety of deep learning methods exist that can effectively deal with time series data, much like that arising from a single lead and multiple lead ECGs. Many of these approaches have already been applied to the interpretation and classification of electroencephalographic signals [50].

Attia et al. also mined the ECG for new information by attempting to predict left ventricular systolic dysfunction from the 12-lead ECG and transthoracic echocardiogram (TTE) using a convolutional neural network [51]. As LV dysfunction itself is a powerful predictor of subsequent heart failure, the resulting network indirectly identifies patients at elevated risk of adverse events [52]. By traditional statistical metrics (e.g. AUC) their classifier performed extremely well, with some exceptions (positive predictive value). The low positive predictive value (PPV) tells us that the model has many false positives, but, crucially, this does not help us predict when the model will fail; i.e., for which type of patients. The work also does not provide insights on the details of the relationship between the ECG and ALVD. For example, some determination about what segments of the ECG contribute to the prediction would be highly informative and of scientific interest.

Myers et al. applied a recurrent neural network (RNN)—a structure used to analyze time-series data—to continuous ECG data, along with a set of patient features, to predict the risk of death 1 year after non-ST segment elevation myocardial infarction (NSTEMI) [53]. For these studies, samples from the ST segments of each beat were identified and extracted in an automated fashion and then used as input to the RNN. The resulting neural network, which incorporates information from approximately 1 min of continuous ECG data, had improved predictive and discriminatory ability relative to a logistic regression model that used the same patient features and summary information from the admission 12-lead ECG. Nevertheless, the complexity of the model makes it difficult to understand precisely how and why the model arrives at a particular result. Consequently, while the model itself has improved performance relative to existing methods, the ultimate clinical utility of the method remains to be determined.

“All models are wrong, but some are useful”

The recent, notable successes of deep learning approaches argue that they will have place in the pantheon of methods used to build risk stratification models. However, it is not always clear when these approaches should be chosen over more standard methods such as logistic regression modeling or Cox proportional hazards regression.

In general, given a powerful set of independent patient characteristics, good results will be obtained regardless of the model used. In this instance, simple regression models may indeed be preferred over deep learning models as they can offer additional insights into the relative importance of the chosen features. The main advantage of deep learning, in terms of model performance, is that it can be used when one does not have good insight into what clinical features have the greatest predictive value. For example, the degree of ST segment elevation/depression is known to have prognostic significance in patients who present with symptoms consistent with an acute coronary syndrome. However, there are likely other ECG characteristics that are associated with adverse outcomes, so methods that use the entire ECG signal are likely to capture subtle ECG features that have prognostic significance. By design, deep learning models can utilize raw data, such as the raw ECG signal (i.e., samples) and therefore does not require insight into the type of features that may be important for a particular prediction task. Hence, there is a role for deep learning when one has incomplete knowledge of the important prognostic information that may be most relevant for quantifying patient risk. Nevertheless, it is important to keep in mind that the performance of deep learning models depends heavily on the size and quality of the data set which one uses.

Data availability is a significant barrier to the deployment of the type of models described in this review. For medical problems, data must be fully de-identified to be shared widely, but there is no useful standard for de-identification—given that perfection is impossible on large scale data, what percentage of missed protected health information (PHI) is permitted? Large scale projects like the Medical Information Mart for Intensive Care (MIMIC) have had success with automating de-identification, but they too expect errors and have an explicit protocol for reporting PHI in the data [54]. In general, there are few large, general medical datasets, and a significant fraction of studies (including many of those discussed here) use MIMIC or the UK-Biobank dataset. Those who do develop their own unique datasets often clutch that data close to their chests for a variety of reasons, not the least of which is the immense amount of work required to reliably de-identify that data.

We believe, from our own and others’ clinical experience, that deep models will not be adopted into medical practice if they are not powerful, explainable, and capable of providing detailed failure mode information. None of the work discussed here hits all three marks, and the wider literature is largely barren in this regard. Future work must emphasis these factors should it wish to impact patient outcomes in the clinic. Through this effort, perhaps we can design, if not “true,” then truly useful models.