Keywords

Introduction

A vast number of machine learning techniques has been proposed for solving a rich set of problems. As we discussed in the Introduction, many of the clinical problems fall into a few categories, some of which are more heavily researched than others. We call these categories analytic tasks and, in this chapter, we consider six tasks, which fall into two broader categories: predictive modeling and exploratory analysis.

Figure 1 depicts the lay of the land for predictive and exploratory analysis tabulating the most common techniques. We will address causal modeling separately in chapter “Foundations of Causal ML”.

Fig. 1
A block flow diagram of the predictive modeling and exploratory analysis. The predictive modeling is classified into regression, classification, and time-to-event. The exploratory analysis is classified into density estimation, outlier detection, and clustering.

Lay of the land of biomedical ML. Tabular categorization of major machine learning methods based on modeling task (columns) and predictor type (rows). See the text for abbreviations and details

In predictive modeling, the goal is to assign values to one or more variables, called outcomes (aka response, or dependent variables), using the known values of other variables (aka predictor variables, independent variables, or features). Somewhat abusing ordinary language, “predictive” in the context of “ML predictive modeling”, does not necessarily imply forecasting future events. Any pattern recognition falls under the category including future forecasting, prognosis, diagnosis and recognizing past events (e.g., retroactive diagnosis). Also, “predictor” and “predictor variable” are often used interchangeably although from context it may be clear whether a predictor refers to a variable (feature) or a full model.

In contrast, exploratory analysis aims to model the relationships among many variables, none of which is designated as an outcome or predictor variable. For example, predicting patients’ risk of mortality (outcome) based on current diagnoses and laboratory results (predictor variables) is a predictive modeling task because variables corresponding to mortality are designated as outcome, variables corresponding to diagnoses and laboratory results are designated as predictors, and we predict the future unknown value of an outcome using known values of the predictor variables. Conversely, understanding a patient population in terms of common comorbidities that co-occur with diabetes is an exploratory analysis task, because there is no particular outcome to predict.

Predictive Modeling Tasks

Within predictive modeling, we distinguish between several tasks based on the outcome type. In the rest of this chapter, we focus on three of them: classification, regression and time-to-event modeling. These are the outcome types and corresponding tasks most frequently encountered in biomedical ML.

Continuous outcomes. Continuous outcomes are measured on a continuous scale. Continuous variables can be ratios (variables that do not have a well-defined 0 point) or intervals (which have a well-defined 0 point). For example, lengths are intervals and a length of zero indicates that the object does not have length. Whether we measure length in inches, centimeters, or miles, 0 length is the same. Conversely, temperature is a ratio, because 0 °F or 0 °C does not mean that the object has no temperature. Furthermore, 0 temperature depends on the scale we use: 0°F and 0°C do not designate the same temperature.

Another relevant distinction from a modeling perspective is the distribution of the continuous variable. Commonly used distributions include Gaussian, Poisson, exponential, negative binomial, etc. Prediction problems with a continuous outcome are referred to as regression problems.

Categorical variables take a value from a set of finite distinct values. For example, color (red, amber, green), grade (A, B, C, F), or risk category (low, medium, high) are categorical variables. Binary (also known as binomial) variables are categorical variables that have exactly two levels (they can take one of two values); while multinomial variables have more than two levels.

Categorical variables with multiple levels can be further classified as nominal or ordinal variables. In case of ordinal variables, the levels are ordered (e.g. good, better, best), while for nominal variables, the levels are not ordered (e.g. colors). Prediction problems with categorical outcomes are referred to as classification problems. If the outcome is binary, we have binary classification; if the outcome is multinomial, we have multi-class classification (aka n-ary or polychotomous classification).

Time-to-event outcomes. The measurement of interest is the time between a particular time point (known as index date or index time) and an event of interest. The quintessential example is survival, where the measure of interest is the time between the start of the study (index date) and death (the event of interest). The predictive modeling task that predicts time-to-event outcomes is referred to as time-to-event modeling or survival analysis (when the outcome is survival).

Sequence outcomes. Sequences are ordered sets of observations and when the outcome of interest is a sequence, we have a sequence prediction problem. Examples include genomic sequence (ordered set of nucleotides) prediction, text synthesis or translation (predicting an ordered set of words), or trajectory mining (predicting future sequences of e.g., disease states).

Structured outcomes. The outcome of the predictive model can also be a complex structure such as a graph or the actual structure of an entity (e.g. protein structure prediction). For techniques to discover causal structure, see the chapter “Foundations of Causal ML”.

Exploratory Analysis Tasks

Density estimation. The goal of density estimation (also encompassing discrete probability functions) is to infer the (often multi-dimensional) probability distribution underlying observed data. The simplest form of density estimation is unidimensional scaled histograms. For example, one might be interested in describing the probability distribution of blood glucose in a population. Density estimation can also be performed on multi-dimensional data and techniques exist for both low and high-dimensional data. Density estimation has several natural uses, including discovery of multiple modes of data, clustering and outlier detection.

Clustering. Clustering creates a grouping of the observations in a data set such that observations that belong to the same group (cluster) are more similar to each other than to observations that belong to a different group. Clustering can be used, for example, for subpopulation discovery, where well-separated groups of observations can represent subpopulations at different states of health or groups of patients with different disease etiology.

Clustering can be achieved based on many principles, one of which is based on data density. In that case, clusters are high-density regions in the data, separated from each other by low-density regions.

Outlier detection. Outliers are observations that are dissimilar to most other observations. Outliers may either fall into low-density regions, or they may behave very differently from model-based expectations (model-based outlier). For example, in a hospitalized population, outliers can be patients who have an unusually long hospital length of stay (LoS), say, above 9 days; or alternatively, they may have a LoS of less than 9 days, but unusually long for the disease that they got admitted for. The first example is clearly patients who fall into a low-density region (very few patients in a patient population stay hospitalized for more than 9 days), while the latter patients are in a low-density region among patients who got admitted with the same disease.

Temporal Characteristics of the Data

A further categorization of methods is based on the temporal characteristics of the data. It is very common for healthcare data to be temporal, thus several AI/ML as well as classical statistical techniques have been developed specifically to take advantage of various temporal characteristics.

Cross-sectional data. This data captures the state of a sample from a population at a specific point in time. The state information often contains temporal information about the past implicitly in the definitions of variables - but not explicitly modeled. The implicit temporal information is typically abstracted to different time scales and granularities. The vast majority of the machine learning techniques expect cross-sectional data.

Cross-sectional data can also be used for predictive modeling in which the outcome occurs at a future time relative to the the index date of the predictor variables. For example, when modeling the 7-year risk of diabetes, the outcome, diabetes, must occur or not within 7 years, but the predictor variables they have been evaluated at a particular point in time (the index date) and changes to them over the 7 years are not of interest.

Longitudinal data. Measurements for a patient population is taken repeatedly over time. Measurements are not necessarily taken at the same time for everyone and not all measurements are taken each time. Routinely collected clinical test data, falls into this category. At most encounters with the health system, some aspect of a patient’s health is measured and recorded. Most patients have more than one encounter and at each encounter, different measurements (e.g., lab tests) can be taken.

Time-series data. Similarly to longitudinal data, in time series data, several measurements are taken over time, but unlike longitudinal data, time-series data focuses on a single sampling unit. If we aim to model the glucose trajectory of a single individual over a long period of time, then we are solving a time-series modeling problem; if we aim to model the glucose control of a population of patients over time, then we have a longitudinal data modeling problem.

Figure 1 tabulates some of the techniques from this chapter. The columns correspond to the various analytic tasks, while the rows correspond to the temporal characteristics of the methods. Naturally, several methods can be used for multiple tasks (with appropriate modifications) and with data sets having multiple temporal characteristics. We either put the methods into the categories where they are most prominent (e.g. SVM into classification), or into shared categories (e.g. many techniques that are used for classification can also be used for regression).

Method Labels

In the following sections, we are going to describe the major machine learning methods in terms of their primary use, additional secondary uses, key operating principles, operating characteristics and properties, and provide a context for their use that helps assess their appropriateness for different modeling tasks. We also mention when and why the use of a method is not recommended.

For each method (or family of methods), we are going to present highly digested and operationally-oriented information in what we call a Method Label. A Method Label is similar to a drug label, presenting the most vital information about a method at-a-glance.

Format of method labels

Main Use

This entry describes the main purpose of the model.

What kind of tasks can it solve?

Within that task, are there specific problems that this method is best suited for?

For example, linear regression solves predictive analytic problems with continuous outcomes.

Context of use

In practice, when is this method used? This can be a subset of the intended use or a superset of the intended use.

For example, ordinary least square regression is designed for Gaussian outcomes, but is often used for a wide range of continuous outcomes.

Secondary use

This entry describes potential situations where the method can also be used.

This may not be the primary intended use of the methods; or this may not be the model that is most appropriate for the use case.

For example, SVM can be used for regression, although its primary use is classification.

Pitfalls

Pitfall 3.1.4.1. This entry lists negative consequences of using this method under certain conditions.

For example, SVM when used for causal problems, leads to wrong causal effect estimates.

Principle of operation

A short description of how the method works.

For example, linear regression is approximating a regression using maximum likelihood estimates of the regression parameters.

Theoretical properties and Empirical evidence

This entry describes any known theoretical properties the method may have and the assumptions linked to them, as well as empirical evidence for the method performance.

Best practices

Best practice 3.1.4.1.This entry provides prescriptive practice recommendations about when and how to use or not the method.

References

Key literature related to the above.

Readers that cannot delve into technical details can still benefit greatly by the information provided in the Method Labels.

Chapter Layout

We begin (in section “Foundational Methods”) with describing the foundational methods for predictive modeling of cross-sectional data. Section “Ensemble Methods” is devoted to ensemble methods which use foundational techniques from “Foundational Methods” to addresses issues related to model stability and performance; and section “Regularization” is devoted to regularization, which addresses high dimensionality, or more broadly, constrains the model complexity. The subsequent three sections address feature selection and dimensionality reduction, time-to-event outcomes, and longitudinal data, respectively. We close the chapter with a brief mention of a few more methods that the reader should be aware of. As the reader will observe we weave classical and modern statistical methods with mainstream ML methods since this reflects modern ML practice and there are significant mathematical, conceptual and computational commonalities between the fields.

Foundational Methods

Foundational methods in the chapter will refer to first-order methods that:

  1. (a)

    Are of high theoretical and/or practical value on their own, or

  2. (b)

    Are of high theoretical and/or practical value in conjunction with other higher-order methods.

Ordinary Least Square (OLS) Regression

OLS regression was invented by Sir Francis Galton in 1875 as he described the relationship of the weight of sweet pea seeds and the weight of the seeds from their mother plants. This experiment also gave rise to the correlation coefficient: Karl Pearson, Galton’s biographer, developed the mathematical formulation for the Pearson correlation coefficient in 1896 [1].

Given a matrix X of predictor variables (independent variables) and outcome (dependent) variable y, the ordinary linear regression model is

$$ y\sim Normal\left( X\beta, {\sigma}^2\right) $$

where β is a vector of coefficients. The outcome is assumed to be normally distributed with mean and variance σ2. In other words, the outcome has a deterministic component Xiβ for the ith observation, and a random component, which is Gaussian noise with mean 0 and variance σ2. The objective is to find the coefficient vector β, which makes the observations y the most likely, that is to maximize the Gaussian log likelihood

$$ \mathscr{l}\left(\beta \right)=\mathrm{const}-{\sum}_i\frac{{\left({y}_i-{X}_i\beta \right)}^2}{2{\sigma}^2}, $$

where i iterates over the observations. The coefficient vector β that maximizes the log likelihood is the same vector that minimizes the least square error ∑i(yi − Xiβ)2, hence the name Ordinary Least Square regression.

As a least square estimator, OLS is “BLUE” (best linear unbiased estimator) [2]. The coefficient estimates are normally distributed, allowing for a Wald-type test for their significance. The least square problem is convex, thus when a solution exists, it is the global solution.

Assumptions. The assumptions follow from the model: (1) for all observations, the noise component has constant variance (σ2). Having uniform variance across the observations is referred to as homoscedasticity. (2) The errors of the observations are independent. (3) Observations are identically distributed. (4) The mean of the observations is a linear combination of the predictor variables. The effect of the predictors on the outcome is thus linear and additive.

Expressive capability. OLS, in its native form, is only able to express linear and additive effects. By explicitly including transformations of the original variables, the linear effect assumption can be relaxed. Explicitly including interactions terms of the predictor variables can relax the additivity assumption. OLS has no ability to automatically discover interactions or nonlinearities, so these need be hand-crafted by the data scientist.

Dependent and outcome variable types. Ordinary linear regression assumes that the observations (rows of X) are independently sampled and thus it is more appropriate for dependent variables that can be represented as regular tabular data and a continuous outcome variable that follows a Gaussian distribution.

Sample size requirement. Ordinary linear regression is not appropriate for high-dimensional data, where the number of predictor variables is similar to the number of observations or exceeds them. As a rule of thumb, 10 observations per predictor variable is recommended. This is one of the least sample-intensive techniques, thus when the number of observations is low and the number of predictor variables is low, ordinary linear regression may be the most appropriate modeling technique.

Main use and its context. OLS is intended to solve regression problems with Gaussian outcomes. Guarantees about the solution hold true only for this use case.

Intepretability. For a covariate (predictor variables) Xi with coefficient βi, every unit increase in Xi is associated with an increase of βi in the outcome, if all other predictor variables are fixed.

As a result, OLS is highly interpretable, and also fit for use with causal estimation once the causal structure is known (see chapter “Foundations of Causal ML”).

We recommend using OLS as a “default” algorithm in low dimensional data unless a generalized linear regression model with a different linkage is more appropriate (see GLM). Building an OLS model, even if its performance is expected to be inferior to more advanced regression techniques, is recommended, because the cost of building an OLS model is minimal, the model is highly interpretable, and it can reveal data problems, biases, design problems and potentially other issues. As we will see the potentially higher predictive performance from other methods needs to be evaluated from the perspective of trading interpretability for performance, and in some applications, higher interpretability can balance out some performance deficit.

Optimality. The coefficients found by maximizing the likelihood are unbiased. Also the log likelihood function is convex, thus the global maximum is easy to find.

Method label: ordinary least squares regression

Main Use

 • Regression problems

 • Continuous, preferably Gaussian outcomes

 • Cross-sectional data

Context of use

 • First choice, most common regression method in low dimensional data

 • When highly interpretable model is required

Secondary use

 • May achieve acceptable performance for some non-Gaussian continuous outcomes

Pitfalls

Pitfall 3.2.1.1. In high-dimensional data, coefficients may be biased or cannot be estimated

Pitfall 3.2.1.2. OLS is negatively affected by high collinearity

Principle of operation

 • Least square estimation (or equivalently, maximizing the Gaussian log likelihood)

Theoretical properties and empirical evidence

 • OLS is a consistent, efficient estimator of the coefficients

 • The coefficients are asymptotically normally distributed

 • Minimizing the least squares is a convex optimization problem. If a solution exists, numerical solvers will find the global solution.

 • Highly interpretable and causally consistent models

 • Vast literature in health sciences documenting successful applications

 • Because it is a simple model, risk of overfitting in low sample is lower than complex models

 • By the same token it can fail to capture highly non-linear data generating functions

Best practices

Best practice 3.2.1.1. Unless a generalized linear model is more appropriate, OLS is a good default technique.

Best practice 3.2.1.2. Building an OLS, even if it is known not to produce optimal predictive performance, can reveal data problems, biases, etc.

References

 • Numerous good textbooks describe OLS regression. Below is one example. Tabachnick & Fidell. Using Multivariate Statistics. Pearson, 2019

Generalized Linear Models (GLM)

Generalized linear models (GLM) were first introduced by Nelder and Wedderburn in 1972. GLM was born out of the desire to model a broader range of outcome types than Gaussian outcomes and was enabled by advancements in statistical computing [3]. The defining characteristic of GLMs is that the data generating functions is not linear and a link function linearizes the relationship between an outcome and the predictor variables, where the outcome is distributed by an exponential family distribution. A fully specified GLM has the following components:

  1. 1.

    The distribution of y

  2. 2.

    A linear predictor η = 

  3. 3.

    The link function g(μ) that links the expectation of E(y) = μ to the linear predictor: η = g(μ); or equivalently, μ = g−1(η).

As an example, let us consider logistic regression, linear regression for outcomes with binomial distribution. The distribution of yi is Bernoulli with parameter ηi, and the link function is

$$ \mathrm{logit}\left({y}_i\right)=\log\ \frac{\Pr \left({y}_i\right)}{1-\Pr \left({y}_i\right)}={X}_i\beta . $$

The objective is to find the coefficient vector β that maximizes the likelihood of observing y, which in case of the logistic regression is the binomial likelihood.

GLMs are frequently used for modeling outcomes that follow other exponential family distributions, including multinomial, Poisson, and negative binomial [4].

Expressive capability. The link function does not change the model’s expressive capability; the relationship between η and the predictor variables is linear, the only difference from OLS is that the linear predictor is transformed through the link function so that GLM can model specific families of non-linear functions.

Dependent and outcome variable types. GLM are best suited for cross-sectional data (with independent and identically distributed observations), but the outcome types have to be distributed in accordance with the link function.

Prediction task. GLM can solve classification problems (logistic and multinomial link) and regression problems where the continuous outcome can be distributed following any of the exponential family distributions.

Theoretical properties. The GLM is a maximum likelihood estimator for the exponential family distributions. Some instances of GLM, such as an overdispersed GLM, does not correspond to an actual exponential family distribution. In such cases, a variance function can be specified and GLM becomes a quasi-likelihood estimator [5]. Both estimators (maximum likelihood and quasi-likelihood) are consistent and efficient. They yield coefficient estimates that are normally distributed and thus the Wald test can be used for testing their significance. Both the likelihood and quasi-likelihood are convex, thus when a solution exists, it is a global solution and solvers can typically find it efficiently.

Method label: generalized linear models

Main Use

 • Predictive modeling with outcomes that follow a distribution from the exponential family (i.e., relationship of outcome and predictor variables can be non-linear)

 • Most common applications are classification (logistic regression), estimating count outcomes (Poisson regression) and exponential outcomes

 • Cross-sectional data

Context of use

 • First-pass/comparator classifier in low dimensional problems with limited need for input interaction modeling

 • Highly interpretable model

Secondary use

 • Applicable also to deviations from the exponential family, most typically where the sample variance is higher than theoretically expected under the corresponding exponential family distribution (over-dispersion)

 • Logistic regression may offer acceptable performance in classification problems, where the linear additive assumption is mildly violated

Pitfalls

Pitfall 3.2.2.1. In high-dimensional data, coefficients may not be estimatable

Pitfall 3.2.2.2. Tendency to overfit in the presence of high collinearity

Principle of operation

 • When the outcome follows an exponential family distribution, GLM is a maximum likelihood estimator

 • When the outcome does not follow an exponential family distribution, GLM can be a quasi-likelihood estimator

Theoretical properties and empirical evidence

 • GLM provides consistent, efficient estimates of coefficients

 • The coefficients are asymptotically normally distributed

 • Minimizing the likelihood or quasi-likelihood is a convex optimization problem. If a solution exists, numerical solvers will find the global solution efficiently.

 • Highly interpretable models

 • GLM can have high performance even when the assumptions are violated. Chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML in Healthcare and the Health Sciences. Enduring Problems, and the Role of BPs” discusses comparisons between logistic regression and more modern ML techniques

Best practices

Best practice 3.2.2.1. Use GLM as first pass, or main comparator classifier

Best practice 3.2.2.2. Building a GLM, even if it is known not to produce optimal predictive performance, can reveal data problems, biases, etc.

References

 • Walter Stroup. Generalized Mixed Linear Models, CRC Press, 2003

 • P. McCullagh, JA Nelder. Generalized Linear Models, CRC Press, 1989

Ordinal Regression Models

There are two main strategies for modeling ordinal outcomes using GLMs. The first one is cumulative logits and the second one is proportional odds [6].

Consider an ordinal outcome variable with J levels, 1 < 2 < … < J. Under the cumulative logits strategy, J-1 logistic regression models are fit. The jth model is a classifier distinguishing y ≤ j verus y > j. Under the proportional odds strategy, again, J-1 models are built, but these models share all coefficients except for the intercept. The jth model is

$$ \mathrm{logit}\left(y\le j\right)={\alpha}_j+ X\beta, $$

where αj is the level-specific intercept and β are the slopes shared across the J-1 models.

Notice, that the cumulative logits model can use any binary component classifier; while the proportional odds GLM is a special case of multi-task learning.

Key reference: Agresti A. Categorical Data Anlaysis, second edition. Chapter 7.2. Wiley Interscience, 2002.

Artificial Neural Networks (ANNs)

For main milestones in the development of ANNs see chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML in Healthcare and the Health Sciences. Enduring Problems, and the Role of BPs”. In this section, we focus on the general form of ANNs for cross-sectional data. Image and language model applications are discussed in Chapters “Considerations for Specialized Health AI and ML Modelling and Applications: NLP” and “Considerations for Specialized Health AI and ML Modelling and Applications: Imaging—Through the perspective of Dermatology”.

Artificial neural networks (or neural networks, NN, for short) can be thought of as regression models stacked on top of each other, and each regression model is thus called a layer. Each of these layers are multiple regression models, meaning they have potentially multivariate inputs as well as multivariate outputs. The outputs from each layer, are transformed using nonlinear functions, called the activation functions, and then passed to the subsequent layer as their input. The final layer (aka the output layer) provides the network’s output(s).

Figure 2 shows an example NN with two hidden (aka encoding) and an output layer. The output from this network is

$$ \hat{y}={f}_3\left({b}_3+{W}_3{f}_2\left({b}_2+{W}_2{f}_1\left({b}_1+{W}_1X\right)\right)\right), $$

where the fi(·) is the activation function, bi are the biases (or bias vectors) and Wi are the weights (weight matrices) of connections coming into layer i. In layer i, the input is multiplied by Wi, the biases bi are added and the result is passed through the activation function fi(), producing the output from layer i. The input to the first layer is X, and the output from the topmost layer, the third layer in this example, is the outcome \( \hat{y} \). Common activation functions include the sigmoid function (logit function), ReLU (rectified linear unit), and softmax.

Fig. 2
A flowchart represents X followed by encoding layers 1 and 2 and an output layer through W 1, 2, and 3, respectively. The output received is the Y cap.

An example NN with three layers

Expressive power. NNs are universal function approximators, they can express any relationship between the predictors and the output without distributional restrictions. This includes non-linear relationships as well as interactions.

Predictor and outcome types. The basic form of NNs introduced here is most appropriate for cross-sectional data without any special structure. However, when the data has special structure, corresponding network architectures have been proposed for many of them. For example, Convolutional Neural Networks (CNN) [7] have been proposed for image data, Recurrent Neural Networks (RNN) [8]and Long Short-Term Memory (LSTM) [9] for sequence data, Transformers [10] for language models, Graph Neural Networks [11] for graph data, etc. We discuss some of these architectures in chapters “Considerations for Specialized Health AI and ML Modelling and Applications: NLP” and “Considerations for Specialized Health AI and ML Modelling and Applications: Imaging—Through the perspective of Dermatology”.

Sample size. The sample size required for NNs can be very large. The deeper the network (the more layers it has), the more parameters it has and the larger the required sample size. Many practical applications of deep learning use millions of parameters. Although the traditional statistical rule of the thumb, that the number of required observations is approximately 10 times the number of parameters, does not hold for deep learning—they can operate on data with fewer samples—the required sample size is still very large.

The largest GPT-3 language model, which has 175 billion parameters, was trained on 45 TB of text data taking 355 GPU-years [12]. Certain structures (most notably convolution) and regularization can alleviate the sample size requirement to some degree.

Interpretability. The key to NNs is to automatically transform the original data space into a new representation that is more amenable to the predictive modeling task at hand. A side effect of this automatic transformation is that the meaning of the original space is lost and the meaning of the resulting variables are often unknown. Thus, NNs are considered black-box (uninterpretable). One way to interpret them is by “local approximation” of the NN using an interpretable model, such as a multivariate regression model, fitted over input-output pairs sampled from the NN model.

NNs in Less Data-Rich Environments

Two significant shortcomings of NN is the sample size requirement and the training cost. Several strategies exist aiming to alleviate these shortcomings.

Transfer learning. Training neural networks, especially highly performant, large networks, is very expensive not only in terms of CPU time but also in terms of required sample. Large pre-trained generic models (so-called foundational models) are available in many application areas, including language processing and computer vision. These generic models transform the input space (say written English text) into a representation that is more amenable to carrying out language modeling tasks than the original representation. To solve a specific language-related task, such as distinguishing patients with and without dementia, a foundational language model (with many pre-trained encoding layers) can become extended by a task-specific layer so that the new model performs the actual classification task. Only the task-specific layer needs to be trained.

Incorporating domain knowledge. Another avenue to reduce training cost is to incorporate domain knowledge as follows: when synthetic data for a domain can be generated, NNs can be pre-trained using synthetic data to learn a representation of the application. Then this pre-trained model can be further refined using real data to solve specific problems in that domain. Beside pre-training, several other methods exist, which include augmenting the input space of a NN with output from physical models (e.g. climate models) or ascribing meaning to hidden nodes and constraining the connections among such hidden nodes based on what is possible in the real world. See [13] for a survey of incorporating domain knowledge into machine learning models.

Method label: artificial neural networks (including deep learning)

Main Use

 • Solving predictive modeling problems. Classification, including classification with very many classes, is most common, but it can also solve regression and time-to-event problems

 • In data-rich environments, ANNs can produce highly performant models

Context of use

 • ANNs are recommended when sample size is very high and a very complex function needs be modeled

 • Neural networks work best for specific applications, with network architectures specifically designed for that application. Such applications include image analysis, text and audio modeling, text synthesis, etc.

Secondary use

 • Modeling distributions using GANs and auto-encoder variants

Pitfalls

Pitfall 3.2.4.1. ANNs can fail when sample size is not large

Pitfall 3.2.4.2. ANNs are innately black-box models. Their use in applications where transparency is important may be problematic

Pitfall 3.2.4.3. ANNs do not reduce the number of features needed for prediction and this may be an important requirement in many biomedical problems. However, using strong feature selectors before running the ANN modeling may be a good combination for some problems (but can be detrimental in others)

Pitfall 3.2.4.4. Training cost is high due to (a) large cost of training a single model, and (b) large number of models that need be trained to explore the immense hyperparameter space

Pitfall 3.2.4.5. ANNs do not have either formal or empirically competitive causal structure discovery capabilities

Pitfall 3.2.4.6. Even when a causal structure is known, estimating causal effects with ANNs leads to biased results because the ANN is not designed to condition on known confounders and may introduce other effect estimation biases (e.g., due to blocking mediator paths and opening M-structure paths)

Principle of operation

 • Minimizing a penalized loss

 • Linear combinations of inputs transformed through a non-linear activation function layer-by-layer yields a non-linear model

Theoretical properties and empirical evidence

 • ANNs with at least two hidden (encoding) layer and unbounded number of units, can be universal function approximators

 • NNs are most commonly solved using gradient optimizers. Because the objective function of NNs can be arbitrarily complex with multiple local optima, the optimizers may fail to reach the globally optimal solution

 • In several biomedical problems deep learning and other ANN learners have exhibited superior accuracy (especially in image recognition). There is also significant evidence that in several clinical domains not involving images they do not outperform vanilla logistic regression (see Chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML in Healthcare and the Health Sciences. Enduring Problems, and the Role of BPs”)

Best practices

Best practice 3.2.4.1. Deep learning is most recommended for predictive modeling in large imaging datasets. Other domains may also be good candidates. In all cases additional (alternative and comparator) methods should be explored at this time within the same error estimation protocols (see chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”)

Best practice 3.2.4.2. At this time ANNs are not suitable for causal discovery and modeling. Formal causal methods should be preferred (chapter “Foundations of Causal ML”)

Best practice 3.2.4.3. ANNs are not suitable for problems where explainability and transparency are required, or when large reduction of the feature space is important to model application

References

 • Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press, 2016

 • Discussion and references in Chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML in Healthcare and the Health Sciences. Enduring Problems, and the Role of BPs”

Support Vector Machines

Support Vector Machines (SVMs) is a family of methods that can be used for classification, regression, outlier detection, clustering, feature selection and a special form of learning called transductive learning [14,15,16]. SVMs use two key principles (1) regularization and (2) kernel projection.

SVM regularization: SVMs cast the classification or regression problems as a non-linear quadratic optimization problem where the solution to predictive modeling is formed as a “data fit loss + parameter penalty” mathematical objective function. Intuitively and as depicted in Fig. 3, each object used for training and subsequent model application is represented as a vector of measurements in a space of relevant dimensions (variable inputs). We will discuss regularization in the general (non-SVM) setting in section “Regularization”.

Fig. 3
An X-Y axis scatterplot represents the S V M and classification problem. A diagonal line and 2 dotted lines on its sides follow an increasing trend. A cluster of stars on the left of the diagonal line is the negative instance, and a cluster of plots on the right is the positive instance.

SVMs and the classification problem as geometrical separation. In the top panel, a geometrical representation of a 2-class predictive modeling (classification problem) with 2 input dimensions (x1, x2) is depicted. Each subject is represented by a dot (i.e., a 2-dimensional vector). Blue dots are negative instances and red dots are positive ones. The line that separates negatives from positives—while maximizing the distance between classes—is the solution to the SVM problem. The instances at the border of each class are the “support vectors”. In the figure we also see the mathematical expression of the classifier hyperplane and its instantiation for the three support vecotrs of the example. Such problems are easily solved by modern software

Binary classification is formulated in SVMs as a geometrical problem of finding a hyperplane (i.e., the generalization of a straight line from 2 dimensions to n dimensions) such that all the instances above the hyperplane belong to one class and all subjects below the hyperplane to the other. Translating this geometrical problem into linear algebra constraints is a straightforward algebraic exercise. Every variable has a weight and collectively these weights determine the hyperplane decision function.

To ensure a model with good generalization performance and resistance to over fitting, the SVM learning procedure requires two elements: (a) That the hyperplane must be such that the number of misclassified instances is minimized (the “data fit loss” part of the objective function to be minimized). (b) A generalization-enforcing constraint, the so-called “regularizer”, that is the total sum of squared weights of all variables, must also be minimized. Specifically, the regularizer must be minimized subject to the locations and labels of training data fed into the algorithm. This is an instance of quadratic program non-linear optimization function that can be solved exactly and very fast.

$$ \mathit{\operatorname{Minimize}}\kern0.5em \frac{1}{2}{\sum}_{i=1}^n{w}_i^2\kern0.75em \mathrm{subject}\ \mathrm{to}\kern0.75em {y}_i\left(\overrightarrow{w}\overrightarrow{x_i}+b\right)-1\ge 0\kern1.25em \mathrm{for}\ i=1,\dots, N $$

A “soft margin” formulation of the learning problem in SVMs allows for handling noisy data (and to some degree non-linearities). The primary method for modeling non-linear decision functions is kernel projection which works pictorially as follows (Fig. 4).

Fig. 4
A dual scatterplot graph of X 2 versus X 1. The positive plots are surrounded by negative plots. It is converted to a 3-D graph through kernel functions. Both the plots follow a bell-shaped pattern on a grid. The top portion is made up of positive plots, while the bottom is made up of negative ones.

Non-linearly separable classification by mapping from an original space (x1, x2) to a different space (with commonly higher number of dimensions) by using kernel functions

In a non-linearly separable problem, there is no straight line (hyperplane) that accurately separates the two classes. The SVM (and other kernel techniques) use a mapping function that transforms the original variables (x1, x2, in Fig. 4) into SVM-constructed features, such that there exists a straight line (hyperplane) that separates the data in the new space. Once the solution is found in the mapped space, it is reverse-transformed to the original input variable space. Once projected back to the original input space the solution is a non-linear decision surface. Because the mapping function is very expensive to compute, special kernel functions are used that allow solving the SVM optimization without incurring the expense of calculating the full mapping. In mathematical terms the above take the following form:

$$ f(x)=\mathit{\operatorname{sign}}\left(\overrightarrow{w}\overrightarrow{x}+b\right) $$
$$ \overrightarrow{w}={\sum}_{i=1}^N{\alpha}_i{y}_i\overrightarrow{x_i}. $$

When data is mapped into higher-dimensional features space \( \varPhi \left(\overrightarrow{x}\right) \),

$$ f(x)=\mathit{\operatorname{sign}}\left(\overrightarrow{w}\varPhi \left(\overrightarrow{x}\right)+b\right) $$
$$ \overrightarrow{w}={\sum}_{i=1}^N{\alpha}_i{y}_i\varPhi \left(\overrightarrow{x_i}\right) $$

Combining them into a classifier yields

$$ f(x)=\mathit{\operatorname{sign}}\left({\sum}_{i=1}^N{\alpha}_i{y}_i\varPhi \left(\overrightarrow{x_i}\right)\circ \varPhi \left(\overrightarrow{x}\right)+b\right)=\mathit{\operatorname{sign}}\left({\sum}_{i=1}^N{\alpha}_i{y}_i\mathrm{K}\left(\overrightarrow{{\mathrm{x}}_{\mathrm{i}}},\overrightarrow{x}\right)+b\right). $$

The above computations are extremely quick to execute and are solved effectively allowing the SVM to explore an astronomical-size space of non-linear interaction effects in quadratic running time and without over-fitting. Let’s demonstrate the remarkable computational and sample efficiency that the kernel projection affords using an example where we will compare the number of parameters that need to be estimated and the sample size needed for a relatively simple non-linear SVM with polynomial kernel degrees of 3 or 5 and number of variables to be modeled ranging from 2 to 100. We will compare with the sample needed and interactions effects that need be constructed and estimated by the corresponding regression model under a conservative requirement of 5 sample instances per parameter.

As can be seen from Table 1. (adapted from [15] for a dataset with 100 variables with up to fifth degree polynomial interaction effects, classical regression would need >96 million parameters to be explicitly constructed and > 482 million sample size in order to estimate the model’s parameters. By comparison the SVM algorithm explores the same space in time quadratic to the number of variables (i.e., in practice in seconds in a regular personal computer). Moreover, the SVM generalization error is independent of the number of variables and is bounded by a function of the number of support vectors which is smaller or equal to the available sample size (see chapter “Foundations and Properties of AI/ML Systems” for more details).

Table 1 Comparison of non-linear SVM vs classical regression in terms of number of parameters. N denotes the (often very small) sample size available in practice

One way to think of the effects of regularization is that by forcing weights to be as small as possible, all variables that are not relevant or are superfluous to the predictive modeling will tend to have zero or near zero weights and are effectively “filtered” out of the model. Equivalently, the minimization of weights entails that the separation between classes is geometrically maximized and statistical machine learning theory shows that this often leads to more generalizable models. In yet another view, regularization entails that the target function is smooth, in the sense that small differences in the input variables result in small changes to the response variable’s values.

Additional aspects of SVMs include: primary and dual formulations of the learning problem (suitable for low dimensionality/high sample, or high dimensionality/small sample situations respectively), using only dot product representations of the data, Structural Risk Minimization (i.e., the model complexity classes are neatly organized in embedding classes so that model selection can be orderly and efficient), and known bounds of error. These bounds are not dependent on the number of input variables but only on the support vectors (which are at most equal to the sample size N) thus demonstrating the power of SVMs to self-regulate their complexity and avoid overfitting (see also chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI” for comprehensive discussion of overfitting and underfitting). SVM scores can be converted to probabilities in a post-hoc manner and can also be used to perform feature selection for other classifiers. While SVMs output scores and not probabilities, these scores can be converted to calibrated probabilities [15].

Method label: support vector machines

Main Use

 • Solving classification and feature selection problems

 • SVMs can produce highly performant models in low sample size/large dimensionality settings

 • Dominant performance in certain domains (e.g., gene expression and other omics and multi-modal based classifiers, text classification)

Context of use

 • SVMs are suitable for clinical data, omic data, text and other unstructured data as well as combinations

 • Especially suited to non-linear, noisy and small sample data.

 • Can be combined with other classifiers and feature selectors to form strong analysis stacks and protocols

 • Are very fast to train and to run model inferences

Secondary use

 • Applicable, but not as first choice, to regression, clustering, and outlier detection

Pitfalls

Pitfall 3.2.5.1. SVMs are unsuitable for causal discovery. The features they select are not interpretable causally. SVM variable weights are not causally valid even if true causal features only are included in the model

Pitfall 3.2.5.2. Linear SVMs are easily interpretable. Non-linear SVMs require additional steps for explanation and are innately “black box” models

Pitfall 3.2.5.3. Error bounds are loose and typically cannot be used to guide model selection

Principle of operation

 • Maximum margin (gap) classifiers with hard or soft margins.

 • Regularization

 • Quadratic Programming formulation of learning problem guarantees optimal solution in tractable time

 • Kernel projection enables fast exploration of immense spaces of non-linearities

 • Structured risk minimization ensures that kernel hyper-parameters relate monotonically to error

Theoretical properties and empirical evidence

 • Can model practically any function

 • Known error bounds

 • Extremely sample and computationally efficient

 • Overfitting resistant

 • Best of class performance in several domains

Best practices

Best practice 3.2.5.1. Primary choice for omics, text classification, and combined clinical/molecular/text tasks

Best practice 3.2.5.2. Secondary choice for feature selection (with Markov boundary methods being first choice). In very small sample situations where Markov boundary methods may suffer, SVM feature selection can be first choice

Best practice 3.2.5.3. SVM weights features or models should not be interpreted causally

Best practice 3.2.5.4. Explain SVMs by converting them to interpretable models via meta-learning or other approaches; and convert scores to probabilities when needed

References

 • Statnikov A, Aliferis CF, Hardin DP, Guyon I. A gentle introduction to support vector machines. In: Biomedicine: Theory and methods (Vol. 1). World scientific. 2011

 • Statnikov,A, Aliferis, CF, Hardin DP, Guyon I. A gentle introduction to support vector machines.In: Biomedicine: Case studies and benchmarks (Vol. 2). World scientific. 2012

 • Vapnik V. the nature of statistical learning theory. SpringerScience & Business Media. 2013

 • Aphinyanaphongs, Y., Tsamardinos, I., Statnikov, A., Hardin, D. and Aliferis, C.F., 2005. Text categorization models for high-quality article retrieval in internal medicine. Journal of the American medical informatics association, 12(2), pp.207–216

 • Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D. and levy, S., 2005. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5), pp.631–643

Naïve Bayesian Classifier (NBC) and Bayesian Networks (BNs)

BNs and the theoretical Optimal Bayes Classifier (OBC) are covered in the AI reasoning under uncertainty, and in the machine leaning theory sections of chapter “Foundations and Properties of AI/ML Systems”. Causal BNs are instrumental for causal discovery and modeling. They are covered in Chapter “Foundations of Causal ML”. The Markov Boundary feature selection methods have their origins in BN theory and are covered in the previous chapters and in feature selection section “Feature Selection and Dimensionality Reduction” of the present chapter. We thus defined here the Naïve Bayes (NB) classifier and then provide a unified method label that spans NB and BN classification techniques.

Naïve Bayes (NB)

NB is a highly restricted simplification of the complete (brute force) application of Bayes’ Theorem in classification.

For an observation vector xi and corresponding outcome yi, which takes values from one of the values c1, c2, …, cm, the predicted probability that the outcome has value cj can be computed through the Bayes formula

$$ \Pr \left({y}_i={c}_j|{x}_i\right)=\frac{\Pr \left({x}_i|{y}_i={c}_j\right)\Pr \left({y}_i={c}_j\right)}{\Pr \left({x}_i\right)}. $$

where Pr(xi) = ∑j Pr (xi| yi = cj) Pr (yi = cj). Suppose we constructed a probability table for Pr(xi| yi), for binary xi it would be a table of size exponential to the number of predictor variables. This would also lead to difficulties in estimating the large sample probabilities from a small sample dataset, because the dataset size would have to be large enough to contain sufficient number of observations to estimate every single xi combination, however low the probability. To reduce this burden, Naïve Bayes classifier makes the assumption that the predictor variables are conditionally independent of each other given the outcome value. This simplifies the computation of Pr(xi| yi) to

$$ \Pr \left({x}_i|{y}_i\right)={\prod}_k\Pr \left({x}_{ik}|{y}_i\right) $$

where xik is the kth element (component) of the vector xi. Under the Naïve Bayes assumption, we only need to estimate Pr(xik| yi) for each each variable xk, which reduces the sample size and compute time required for the estimation from exponential to linear in the number of variables.

The problem with NB is that by adopting unrealistic assumptions of conditional independence, it introduces serious errors in distributions where features are not independent given the target class.

Also, if we allow outcomes to not be mutually exclusive (e.g. a patient may have serval diseases simultaneously), then we need to incorporate 2|m| values in the outcome variable which exponentially increases (to the number m of outcome values, e.g., diagnostic categories) the number of probabilities that need be estimated and stored. Hence it is common to see the added NB assumption of mutually exclusive and exhaustive values of the target variable. Of course in medical domains this assumption is very commonly violated as well.

In summary, the problems with brute force Bayes is that of intractability, and of high error in the estimates of joint instantiation (because invariably the real-life sample size is never enough to estimate an exponential number of parameters). The problems with NB is that assumptions rarely hold in numerous biomedical domains.

In early years of AI/ML and before the advent of BNs (that can decompose the joint distribution and store only the smallest number of probabilities needed to accurately represent it), NB was used widely. All modern benchmarks suggest however that NB is no longer a competitive classifier (unless we can tolerate large departures from predictive optimality in order to save storage space). Finally, it is worth noting the work of [17] that show that under specific target functions and loss functions, NB can perform well even though its nominal assumptions are violated. These types of distributions are rare in biomedicine however, and better alternatives exist.

Method label: Naïve Bayes (NB), Bayesian Networks (BNs), causal BNs (CBNs) and Markov Boundary methods

Main Use

 • Classification, causal structure discovery, feature selection

Context of use

 • Classification under a range of sufficient assumptions that guarantee asymptotic correctness: Naïve Bayes (NB) have highly restrictive assumptions, while BNs (see chapter “Foundations and Properties of AI/ML Systems”) have no restrictions on functions and distributions they can model

 • Flexible classification (chapter “Foundations and Properties of AI/ML Systems”) where at model deployment time the inputs can be any subset of variables and the output can also be any subset of variables (while the rest are unobserved)

 • Causal structure discovery (under specific broad assumptions—chapter “Foundations of Causal ML”)

 • Causal effect estimation (under specific broad assumptions—chapter “Foundations of Causal ML”)

 • Modeling equivalence classes of full or local models (see chapter “Foundations of Causal ML” and “Foundations and Properties of AI/ML Systems”)

 • Closely related to Markov boundary feature selection algorithms (see section on “Feature Selection and Dimensionality Reduction”)

Secondary use

 • Optimal Bayes classifier (OBC) (see chapter “Foundations and Properties of AI/ML Systems”): can be used for theoretical analysis of optimality of the large-sample error of any classifier

 • NB is often used as a minimum baseline comparator in benchmark studies

 • Modeling the full joint distribution of data with BNs (e.g., for simulation or re-simulation purposes)

 • Guiding experiments with hybrid causal discovery and active experimentation

Pitfalls

Pitfall 3.2.6.1. Using NB simply because it is computationally fast and has small storage requirements without paying attention to whether data properties match assumptions may lead to high error

Pitfall 3.2.6.2. Not every probabilistic graphical model is a BN (see chapter “Foundations and Properties of AI/ML Systems”)

Pitfall 3.2.6.3.Not every BN is causal and not every BN learning algorithm guarantees valid causal discovery

Pitfall 3.2.6.4. Bad or uninformed assignment of priors may lead Bayesian algorithms astray

Pitfall 3.2.6.5. Discovery algorithms for BNs and CBNs vary widely in output quality and efficiency

Pitfall 3.2.6.6. BNs and CBNs properties hold in the large sample. In small samples results may be suboptimal

Pitfall 3.2.6.7. Approximating OBC with Bayesian model averaging with a small number of models, maybe far from the theoretically ideal OBC performance

Pitfall 3.2.6.8. BN predictive inference with unconstrained models is computationally intractable in the worst case although average case algorithms exist with good performance chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML in Healthcare and the Health Sciences. Enduring Problems, and the Role of BPs”)

Pitfall 3.2.6.9. Discrete BNs are better developed in practice than continuous-function BNs

Principle of operation

 • Application of Bayes’ theorem combined with various assumptions about the data, heuristic model search procedures, and algorithms designed to infer from data the most likely model that generated it or estimate model-averaged predictions over a set of models

 • Causal discovery in addition relies on distributional assumptions such as the causal Markov condition (CMC), the faithfulness condition (FC) and causal sufficiency (CS)

 • Newer algorithms can discover local or partial models, and overcome violations of CS and FC and worst case complexity

Theoretical properties and empirical evidence

 • All Bayesian classifiers have well understood properties ensuring valid predictions, reliable causal structure discovery and unbiased causal effect estimation when the assumptions hold

 • Closely tied to Markov boundary and causal feature selection

 • CBNs are the backbone of learning causal models in a sound and scalable manner (see chapter “Foundations of Causal ML”)

 • Large body of validation in causal discovery. Large body of applications and benchmarks for classification and feature selection

Best practices

Best practice 3.2.6.1. NB has limited utility in modern health applications and is not a recommended method in usual circumstances

Best practice 3.2.6.2. Use BNs when flexible classification is needed

Best practice 3.2.6.3. Use CBNs when causal structure discovery and causal effect estimation are needed

Best practice 3.2.6.4. Use when modeling equivalence classes of full or local causal or Markov boundary models is needed

Best practice 3.2.6.5. Markov boundary (identified via specialized algorithms) is typically the feature selection method of choice

Best practice 3.2.6.6. Use for modeling full joint distributions (e.g., for simulation or re-simulation purposes) while also preserving causal structure

Best practice 3.2.6.7. Use for guiding experiments in the presence of information equivalences

References

 • See references (and discussions thereof) in chapters “Foundations and Properties of AI/ML Systems,” “Foundations of Causal ML,” “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems,” “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models,” and “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML in Healthcare and the Health Sciences. Enduring Problems, and the Role of BPs”

K-Nearest Neighbor

The k nearest neighbor (k-NN) classifier and regression was introduced in 1951 [18]. K-NN is categorized as a “lazy” classifier/regressor, in the sense that it does not actually construct an explicit model from the data. Decisions are made based on the training data set rather than by a model trained on the data set. Specifically, the class of an instance is the (weighted) majority class of its k nearest neighbors, where k is a user-supplied hyperparameter. In case of k-NN regression, the estimated value of an instance is the (possibly weighted) average of the values of the k nearest neighbors.

The critical component of the k-NN classifier/regressor is the similarity function used to determine the k nearest neighbors. Defining an appropriate similarity functions is non-trivial, and is particularly difficult when (i) the data is high-dimensional or (ii) the importance of the variables varies greatly. The similarity function can be often learned from data. The problem of learning a similarity function from data is known as the distance metric learning.

When we consider the applicability of a method, we usually think about sample size or the form of the decision boundary between the positive and negative classes in a classification problem. In the case of k-NN, we may need to consider the local density of positive and negative instances. If a clear (low-density) separation exists between the positive and negative classes, kNN can be successful, regardless of the shape of the decision boundary; if a clear separation does not exist, kNN will likely not perform well (Fig. 5).

Fig. 5
A set of 2 scatterplot graphs of X 2 versus X 1. Graph A consists of 3 clusters of positive plots surrounded by negative ones. Graph B consists of three clusters of positive. The negative plots are all over the graph.

Two hypothetical two-dimensional problems that can be solved using k-NN classification. Blue ‘-’ signs indicate instances of the negative class; orange ‘+’ the positive class. In the left pane, the positive clusters have a much higher density of positive instances than negative instances, making this an easy problem. In the right pane, there are three positive clusters, and the density of positive (orange) instances in each cluster is similar to that of the negative instances, which makes the problem more difficult

As we said earlier, kNN classifiers do not build an explicit model, instead, they have to determine the k nearest neighbors at classification time. This makes “training” the model cheap (there is no training), but the actual model application can become expensive. Additionally, this makes the classification less robust, as the classification of a new instance depends on the training sample, especially for small k. Increasing k can reduce the dependence on the specific training sample and can also improve robustness in the presence of noise, however, it makes the classification less local, which is the essence of kNN modeling.

The success of kNN classification depends on the amount of noise in the training sample, the choice of k (which again depends on the noise in the training sample), and a distance function that defines the neighborhood of the instances to be classified. Large-sample analysis of k-NN shows that in the sample limit with k=1 the error is at most double that of the optimal Bayes Classifier, whereas with larger k it can approximate the optimal Bayes Classifier [19].

Method label: k-Nearest Neighbor (kNN)

Main Use

 • Classification and regression problems

Context of use

 • Classification or regression based on similarity between instances. Its use is most appropriate when a similarity function can be easily constructed

Secondary use

 • Density estimation

Pitfalls

Pitfall 3.2.7.1 kNN asymptotic error is strongly influenced by the choice of k

Pitfall 3.2.7.2 kNN error convergence to the large sample error as a function of sample size is not well characterized [19]

Pitfall 3.2.7.3 in high dimensional problems the distance metric values become similar for all instance pairs and this affects accuracy

Pitfall 3.2.7.4 kNN application is computationally intensive unless special data structures are used

Pitfall 3.2.7.5 needs to store the entire training data set

Pitfall 3.2.7.6 kNN will produce poor results when the wrong similarity function is used.

Principle of operation

 • Prediction is based on the local density of the k nearest neighbors of the problem instance

Theoretical properties and empirical evidence

 • kNN is non-parametric

 • It can handle arbitrary decision boundaries

 • In the sample limit, the error of the 1-NN classifier (k = 1), is no more than twice the error of the Optimal Bayes Classifier [18]

 • Typical similarity functions used in kNN do not perform well in high-dimensional problems

 • Usually underperforms most modern classifiers in most applications

Best practices

Best practice 3.2.7.1. Use as comparator not as primary classifier.

Best practice 3.2.7.2. Optimize k with model selection.

Best practice 3.2.7.3. Use adaptive kNN for high dimensional data.

Best practice 3.2.7.4. Explore via model selection the right distance metric for the data at hand

References

 • Several textbooks, including [19,20,21]

 • Cover, T. and Hart, P., 1967. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), pp.21–27

Decision Tree Learning

A decision tree is a predictive model that can be used for classification or regression. Figure 6 depicts an example decision tree for classification built over discrete variables.

Fig. 6
A flowchart of the demographic variable classified into lab text j and imagining result k. The lab test j is divided into biopsy result n and probability equals 0.80. Imagining result k is divided into probability equals 0.20 and 0.65. The biopsy result is divided into probability equals 0.96 and 0.05.

Illustration of a decision tree classifying patients as having a positive outcome (e.g., successful treatment) depicted as green, or negative outcome (red). Nodes depicted with blue outline are internal nodes corresponding to observable variables (e.g., demographic, lab test, biopsy, or imaging test results). Branches correspond to the values of each variable. Nodes without children are called “leaves and correspond to DT classifier decisions. Each decision has an associated probability for the values of the response (here, positive patient outcome)

Learning a decision tree from data is called decision tree induction. Induction algorithms partition the input space into a number of hyperrectangles such that the hyperrectangles are statistically homogeneous for one or the other target class. For example, in Fig. 6, the leaf with probability of a positive outcome =0.80 corresponds to patients with Demographic variable i = False and Lab Test j = Positive.

Induction of the optimal decision tree is NP-hard [20]. Algorithms used in practice are computationally efficient by trading off tree optimality with time complexity. Commonly used algorithms (e.g., ID3 [21]) work on the principle of recursive partitioning. Starting with the entire data set, a splitting attribute is selected as root of the tree, and the dataset is split into multiple partitions based on the value of the splitting attribute such that the partitions are maximally enriched in instances of one class or another (equivalently, they have optimal predictivity up to that point). Then each partition is further split using the same strategy recursively until no more partitioning is possible (i.e., the algorithm runs out of sample or informative features). The various decision tree induction algorithms differ in the way the splitting criterion is selected, the stopping criterion, and the way they handle categorical, multi-level categorical and continuous attributes. Often, DT induction algorithms do not have a global objective function, they operate greedily and offer no guarantees of optimality.

Main Properties of Decision Trees

  1. 1.

    Expressive power:

    1. (a)

      DTs can model any discrete function.

    2. (b)

      Predictions in discrete space are made at the leaf nodes and each node corresponds to a hyperrectangle. Therefore, the decision boundary is complex, non-parametric, and the sides of the hyperrectangles are parallel with the coordinate system that spans the input space.

  2. 2.

    Logic-based and concept-learning interpretation of DTs. Each path from the root of the tree to each leaf represents a conjunction (logical AND) of conditions. A tree thus can be translated into a set of conjunctive sentences, that collectively define the target as a logical concept. Therefore there is a close correspondence of DTs with logic and concept learning.

    E.g., in the tree of Fig. 6 the Concept “(most likely) Positive Outcome” is defined by the DT as:

    (Positive Outcome = True) iff:

    (((Demographic variable i = False) AND (Lab Test j = Positive)) OR

    ((Demographic variable i = False) AND (Lab Test j = Negative) AND (Biopsy Result n = Negative)) OR

    ((Demographic variable i = True) AND (Imaging Result k = Positive)))

  3. 3.

    Rule set interpretation of DTs. A DT can also be thought of as a collection of rules of the type: “if the variables on a path have the observed instantiations depicted by the tree, then the decision is determined by the corresponding leaf’s probability”. Hence a DT is a system of rules, each one sufficient (when applicable) to establish a decision.

    E.g., in the tree of Fig. 6 the Concept “Positive Outcome” is defined by the DT as:

    RULE 1

    IF ((Demographic variable i = False) AND (Lab Test j = Positive)) THEN

    Outcome = Positive with probability 0.80

    RULE 2

    IF ((Demographic variable i = False) AND (Lab Test j = Negative) AND (Biopsy Result n = Negative)) THEN Outcome = Positive with probability 0.96

    RULE 3

    IF ((Demographic variable i = True) AND (Imaging Result k = Positive))

    THEN Outcome = Positive with probability 0.65

    RULE 4

    IF ((Demographic variable i = False) AND (Lab Test j = Negative) AND (Biopsy Result n = Positive)) THEN Outcome = Positive with probability 0.05

    RULE 5

    IF ((Demographic variable i = True) AND (Imaging Result k = Negative))

    THEN Outcome = Positive with probability 0.20.

    • The rules corresponding to a decision tree, are individually correct when they match problem instances, hence fully modular (i.e., one rules does not affect the other and can be applied in isolation).

    • Moreover the rules do not need be chained: each inference made by the tree on a patient is the result of applying the applicable rule.

    • The rules are mutually exclusive.

    • Finally the order of the variables in each path/rule, does not matter after the tree is constructed, and can be applied in any order. The order of variables during the DT construction from data, however, is very important for the quality of the DT induced.

    Notice that such rule sets are trivial to understand. By comparison rule-based systems where the rules have to be chained in complex forward/backward sequences have logic that cannot be understood easily by examining individual rules.

  4. 4.

    Subpopulation discovery/clustering interpretation of DTs. Each path of a DT describes a subpopulation that has a defined probability for the target variables and the group members are homogeneous in their probability for the target and in their values for the features along the path. Thus a DT provides a grouping/clustering of the feature space/individual observations that is tied to the outcome. In our running example:

    GROUP 1: everyone who has:

    ((Demographic variable i = False) AND (Lab Test j = Positive))

    GROUP 2: everyone who has:

    ((Demographic variable i = False) AND (Lab Test j = Negative) AND (Biopsy Result n = Negative))

    Etc. for the other paths.

    From the tree:

    Group1 will have 0.80 probability of Outcome = Positive

    Group2 will have 0.96 probability of Outcome = Positive

    Etc. for the other groups.

    Because DTs can be understood in several ways (rules, concepts, groups) and in a modular manner they are an exemplary model of interpretable and explainable ML (as long as DTs are of modest size).

  5. 5.

    A key problem with interpretability is subtree replication. Especially in the presence of noise, the exact same subtrees can appear under multiple nodes across the tree, which creates redundancy, hindering interpretation.

  6. 6.

    Decision trees are susceptible to overfitting. During tree induction, some method of protection against overfitting needs to be applied. Such methods include empirical testing on a validation set, regularizers (e.g., maximum allowed number of instances in a leaf, or maximum allowed tree depth, maximum allowed model complexity as part of the stopping criterion). Some of these can be enforced during learning, or after a tree is created.

Method label: Decision Trees

Main Use

 • Classification with maximum interpretability

 • As component of ensemble and boosted algorithms

Context of use

 • DTs are highly interpretable: a decision tree can be translated into a set of rules. Interpretation is the main reason to use a decision tree

 • DTs are non-parametric, non-additive models that can represent linear and nonlinear relationships.

 • DTs are commonly used as a base learner in an ensemble

 • Baseline comparator

Secondary use

 • Regression problems (regression trees)

 • Explaining black box ML models by converting them to functionally equivalent DTs

Pitfalls

Pitfall 3.2.8.1. A single DT on its own typically does not have very high performance

Pitfall 3.2.8.2. DTs depending on how they are used can be unstable (which makes them a good choice for bagging)

Pitfall 3.2.8.3. DTs if not regularized have a tendency for overfitting

Pitfall 3.2.8.4. High dimensionality is a problem when feature selection has not been pre-applied

Principle of operation

 • Recursive partitioning of the data. The inductive algorithm may build a DT following a partitioning strategy, or may follow other procedures (e.g. genetic algorithms or other search). Once a DT is built, however, it encodes a partitioning of the data

Theoretical properties and empirical evidence

 • DT induction is NP-hard. Greedy algorithms are used, which provide no guarantees of optimality

 • Unless regularization and feature selection are applied, they have a tendency to overfit

 • DTs are highly interpretable

 • In an ensemble learning context, DTs can handle very high dimensionality, noise, and can identify interaction terms

Best practices

Best practice 3.2.8.1. Use DT for interpretable modeling alone or in conjunction with other methods

Best practice 3.2.8.2. Use for target variable-specific subpopulation discovery

Best practice 3.2.8.3. Use as baseline comparator method

Best practice 3.2.8.4. Use in ensembling (boosting or bagging (Random Forest)) algorithms

References

Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993

Several textbooks e.g. [19,20,21]

Clustering

The family of clustering techniques deals with grouping objects into meaningful categories (e.g., subjects into disease groups or treatment-response groups). Typically, the produced clusters contain objects that are similar to one another but dissimilar to objects in other clusters [19, 20]. As has been shown mathematically, there is no single measure of similarity that is suitable for all types of analyses nor is clustering the most powerful method for all types of analysis even when applicable. Thus this generally useful set of methods is known to be often abused or misused [22]. We demonstrate these dangers with the following two figures.

Figures 7 and 8 show that it is not possible for a clustering algorithm to anticipate predictive use of the clusters if the clustering algorithm does not have information about the labels (or other assumptions that amount to such information).

Fig. 7
An illustration of a cluster of distributed data points.

Demonstration of the fallacy of using unsupervised clustering for predictive modeling. Question: what is a good clustering of the above data?

Fig. 8
A set of 3 illustrations of a cluster of dual data points. The data points are separated by a concave upward line in A, a right diagonal line in B, and a vertical line in C.

Demonstration of the fallacy of using unsupervised clustering for predictive modeling, continued. Effect of the target function that generates data labels. As shown, identifying “good” (in the predictive sense) clusters of the above (and any) data, absolutely depends on the values of the target function. Case (a) left, case (b) middle, and case (c), right, involve exactly the same data in terms of input space but with different target functions defined over the input data (positive class depicted in red, negative in black). As a result, classifiers (blue lines) have to the class labels take into account in order to be accurate. The resulting classifiers are radically different and cannot be identified without reference to the target function that generates the class assignments

Best Practice 3.2.9

Do not use unsupervised clustering to produce groups that you intend to use predictively. Use predictive modeling, instead. Once accurate and interpretable classifiers have been built, subpopulations or other useful clusters can be extracted (see section on Decision Trees for a detailed example).

Clarifying Misconceptions About Unsupervised/Supervised Learning, Similarity/Distance-Based Learning, and Clustering

  1. (a)

    As explained in chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI” a common way that unsupervised clustering apparently yields decent predictive performance (e.g., in some parts of the high-throughput based genomics literature), is because a pre-selection of features was performed on the data based on how strongly the features correlated with the outcome. The analysts in these cases inductively biased the clustering toward classification of the desired response variable. This inductive bias is not enough to lead a clustering algorithm to optimal accuracy levels, however. Moreover, if such feature pre-selection is not done in a nested cross-validated manner (see chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI”) the resulting classification accuracy estimates will be highly biased (in the error estimation sense) and generalization will suffer.

  2. (b)

    Because clustering is often defined in terms of similarity, it may be tempting to think that all similarity-based classification is predictively flawed. In reality, powerful classifiers exist that use distance and similarity functions (e.g., KNN, SVMs, etc. see section in the present chapter) however they are supervised (i.e., they approximate a particular target function that generates the data) which allows them (along with the rest of their design) to be capable of accurate predictive modeling.

  3. (c)

    Conversely, and to re-iterate the point previously made, because good similarity-based predictive modeling exists, that does not mean that an unsupervised method, such as clustering, can be successful for predictive modeling.

  4. (d)

    Finally, users of learning methods such as BNs and Causal Probabilistic Graphical Models (CPGMs) do not specify a target response variable and some may confuse them with unsupervised methods. In reality, because they model the joint probability distribution (and underlying causal generating function for CPGMs) they are supervised learners for all variables which can then be used as potential target responses at model execution time (see also chapter “Foundations and Properties of AI/ML Systems”, section on flexible modeling with BNs).

Importance of Feature Selection for Similarity-Based Classification

If the features containing all information about the response variable are not included in the training data, then the predictive accuracy of similarity-based (and any other sufficiently powerful classifier family) will be compromised relative to the best classifier that can be built with this data. Compromised performance may also happen because of using feature selectors that allow large numbers of redundant or irrelevant features. This can overwhelm classifiers that, for example, lack sufficiently strong regularization or other anti-overfitting measures. In high dimensional spaces it is critical to apply sound feature selection algorithms so that choice of features enhances rather than hinders classification.

The following table summarizes essential properties of clustering algorithms.

Method label: clustering

Main Use

 • Group data for exploratory analysis, summarization and visualization purposes

Context of use

 • Summarize, visualize and explore data (e.g., outliers, apparent distribution mixtures, etc.)

Secondary use

 • Often used inappropriately for causal discovery and classification purposes

Non-recommended Uses and Pitfalls

Pitfall 3.2.9.1. Clustering variables should not be used to infer that they are causally related (a common mistake in genomics literature)

Pitfall 3.2.9.2. Clustering individuals should not be used to build classifiers (also a common mistake in genomics literature)

Pitfall 3.2.9.3. Choice of similarity metric, algorithm, and parameters inductively biases results toward specific groupings. There is no such thing as “unbiased” clustering analysis

Principle of operation

 • Typically unsupervised method (i.e., clustering algorithms do not have access to response variable values)

 • Group together data instances that are similar and apart dissimilar instances

 • In other versions, cluster variables or cluster simultaneously variables and instances

 • Use similarity metrics and algorithms that employ those to create clusters

 • Clusters can be overlapping or mutually exclusive (soft vs hard clustering)

 • Clusters can be hierarchically organized or distinct

Theoretical properties and Empirical evidence

 • Across all possible uses all clustering algorithms are on average equivalent (“Ugly Duck Theorem”)

 • Clustering lucks the inductive bias, information capacity and the computational complexity to be compatible with causal discovery

Best practices

Best practice 3.2.9.1. Do no use clustering to discover causal structure

Best practice 3.2.9.2. Do not use clustering to create accurate classifiers

Best practice 3.2.9.3. Tailor the use of clustering algorithm and metric to the problem at hand

Best practice 3.2.9.4. Derive predictive subgroups from properly-built and validated classifiers (Decision Trees are particularly good candidates—See “Decision Tree Learning”)

Best practice 3.2.9.5. Perform sensitivity analysis to study the impact of the choice of parameters, metrics and algorithms

Best practice 3.2.9.6. Repeat and summarize multiple runs of randomized clustering algorithms

Best practice 3.2.9.7. Either start from causal and predictive algorithms and use clustering to summarize, visualize, etc. their results, or start from clustering analysis in preliminary data to inform the design and analysis with focused techniques

References

 • Dupuy, A. and Simon, R.M., 2007. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. Journal of the National Cancer Institute, 99(2), pp.147–157.

 • Several textbooks [19,20,21]

Ensemble Methods

So far, a single predictive model has been trained on a training data set and the prediction from this model is the final prediction. In ensemble learning, an ensemble, i.e., a set of models, is trained on the training data and their output is combined into a final prediction. Figure 9. illustrates the ensemble learning process. A set of m models, called base learners, L1, …, Lm are trained on their corresponding data sets X1, …, Xm. The data set Xi can be the data set X itself or a sample from it. Predictions from the m base learners are combined using the meta learner L*. The meta learner can be as simple as majority voting or as complex as a neural network. The different ensemble learning methods depend on (i) how they generate the data set Xi from X, (ii) the base learners they use and (iii) the meta learner.

Fig. 9
A flowchart of the ensemble learning. The dataset X is divided into X 1, X 2, X m minus 1, and X m. They are followed by L 1, L 2, L m minus 1, and L m, respectively. They lead to L to the power asterisk.

Generic form of Ensemble Learning

The key motivation for using ensembles is improving predictive performance. This is achieved in four possibly overlapping ways. First, if we assume that the base learners make mistakes independently, as long as the base learners achieve better (lower) than random error rate, the ensemble will have a lower error rate than the individual base learners. In practice, the base learners do not make errors independently, but the ensemble still tends to achieve better performance than the base learners. Second, the ensemble learning framework allows us to combine models of different characteristics, potentially allowing for compensating for biases inherent in some of the methods. Third, ensembles of various types of base learners can expand the base learners’ expressive capabilities: the ensemble can express much more complex relationships than base learners. Fourth, the ensemble may reduce the variance of a collection of models built on small samples.

Model Stacking

Model stacking is essentially the generic form of ensemble learning. The data sets Xi can be the original data set (X), a bootstrap resample of X, or a subsample of X. For high-dimensional data sets, Xi can be a random projection of X, which helps most when X consists of highly redundant features. For low dimensional X, Xi can consist of random linear combinations of the features in X. The base learners can be of any type and no uniformity is required: the ensemble can contain different types of models. Finally, the meta learner can be any sufficiently powerful ML algorithm; as of late, a common choice is neural networks.

Expressive ability. The relationship that ensembles can model between the inputs and the output depends on the base learners’ expressive ability. In practice, model stacking can increase the base learners’ expressive ability.

Example. We show a stack of logistic regression models, where both the base learners and the meta learner are logistic regression models. The ensemble is applied to the problem depicted in Fig. 10.

Fig. 10
A dual scatterplot of the X 2 versus X 1. The plots are clustered on the corners, separated by dotted, broken, and straight lines, forming an irregular plus sign at the center.

Illustration of an ensemble of 10 logistic regression models with a logistic regression meta learner on a two-dimensional classification problem

This is a two-class classification problem using a two-dimensional data set with strong interactions (almost an exclusive OR). The two colors (orange and blue) represent the two classes. The ensemble consists of 10 logistic regression models, each trained on a random subsample of the original data X. The meta learner is also a logistic regression model. The solid line in the figure is the contour line corresponding to 20% probability of positive, the dotted line to 50%, and the dashed line to 80% probability of positive class. Although the individual logistic regression models cannot solve this problem, the ensemble can.

Note on the difference between stacked linear regression and neural networks. This example of stacked logistic regressions appears similar to a 2-layer neural network with sigmoid activation, however, in a 2-layer neural network the “logistic regression models” on the first layer are optimized together with the second layer, while in stacked regression, the first layer models are constructed first (on random subsamples) with the actual output as the dependent variable, their coefficients are fixed, and only then is the second layer model (the meta learner) constructed. While the second layer logistic regression is being fitted, the coefficients of the first layer are not modified.

Method label: model stacking

Main Use

 • Classification or regression

Context of use

 • This is a generic form of ensemble learning

 • There are four main reasons for using ensemble learning:

  – Ensembling can reduce error by taking advantage of independent errors of the base learners

  – Base learners have different inductive biases

  – The ensemble can expand the base learners’ hypothesis spaces by combining their spaces

  – The ensemble may reduce variance of base learners

Secondary use

 • Other types of learning (not only classification or regression) can be ensembled

Pitfalls

Pitfall 3.3.1.1. Interpretability suffers

Pitfall 3.3.1.2. Increased computational cost

Pitfall 3.3.1.3. Potentially additional data may be needed for training the ensemble

Principle of operation

 • Multiple base learners are built and then a meta learner is trained on the output of the base learners and the actual outcome. The prediction from the meta learner is the prediction of the ensemble

Theoretical properties and empirical evidence

 • Stacking draws its formal foundations from fundamentals of ML theory plus the theory of weak learner boosting and bagging

 • Several stacking algorithms exhibit best of class performance in various domains. [23, 24] individual algorithms (e.g., Adaboost, random forests) have distinct characteristics (described below)

Best practices

Best practice 3.3.1.1. Consider stacking as a high priority choice of algorithm when high performance is needed, base algorithms do not perform well, and interpretability is not a strong requirement

References

 • Wolpert (1992). “Stacked Generalization”. Neural Networks.

5 (2): 241–259

 • Breiman, Leo (1996). “Stacked regressions”. Machine Learning. 24: 49–64.

 • Textbooks [20]

Bagging

Bagging, also known as bootstrap aggregation, forms the data sets for the base learners as a bootstrap resample (i.e., sample with replacement) of the original dataset. The base learner can be any learning algorithm and the “meta learner” is simply majority voting (or average in the continuous outcome case).

The key benefit of bagging is reducing the variance of the base learner. If a classifier’s predictions are sensitive to minor perturbations in the data, this means that the generalization error is negatively affected by the variance of the base learning. Bagging can help reduce random fluctuations across possible training samples. If the base learner is robust to small perturbations, then the generalization error is caused mainly by bias in the base learner, and bagging cannot help. Thus, bagging is most useful in conjunction with base learners that have high variance (such as decision trees).

Figure 11 illustrates bagging. The left panel shows a one-dimensional data set, where the horizontal axis shows the data (“x”) and the vertical axis is simply a jitter-plot visualization (i.e., the y axis spreads randomly the corresponding x points so that data points on the x axis are not plotted on top of each other). Red points represent the positive class and black points the negative class. The true decision boundary for the positive class is x < = −2 or x > 2; this is shown as two dashed lines. We constructed 200 decision stumps (decision tree with depth 1, with a single binary split of the input dimension) on bootstrap samples of this data set. Such stumps can accurately learn the target function in x ≥ 2 or in x ≤ −2. The right panel shows 200 decision boundaries from these 200 decision trees as dotted gray lines. The horizontal axis is the data dimension (“x”) and the vertical axis is the probability (from the trees) of an instance with the corresponding x value belonging to the positive class. The blue line corresponds to the bagged prediction using the previously constructed 200 trees.

Fig. 11
A set of 2 graphs. A, A scatterplot of jitter versus X. One of the plots is clustered at the center while the other 2 are on the sides, separated by vertical lines. B, A line graph of the probability of positive class versus x. The line follows a parallel trend, descends, moves flat, rises, and moves flat.

Illustration of bagging

Bagging can also help increase the expressive capability of the base learner. Fig. 11 demonstrates that by bagging weak learners, improved accuracy can be achieved in a simple example. Bagging decision stumps, for example can form a decision boundary that can only be achieved by deeper (more than one-level) trees without bagging. Also it can be seen in Fig. 11 that the bagged classifier smooths the decision surface.

Bagging a set of models is more resilient to overfitting than overfitting any single model. Overfitting that arises as a result of a predictor aligning with an outcome randomly (i.e. in a specific sample), is not likely to happen in a different sample. Therefore, only one (or very few) trees are affected by this particular random alignment. Other random alignments can also occur in other samples, and they will tend to average out. Thus overfitting is still much possible on the individual model level, but less so at the level of bag of models.

Random Forests

A Random Forest (RF) is an ensemble classification method using decision trees as the base learning algorithm combining the following four ideas: (a) bagging decision trees, (b) random feature projection, (c) off training sample error estimation, and (d) model complexity restrictions.

The RF generates Xi, the data set on which the ith tree is grown by bootstrapping once at the start of modeling. The features are randomized at each tree’s node expansion step. Feature randomization increases the independence among the trees in the forest, while bagging aims to reduce the variance of the predictions by each tree. The predictions from the tree in the forest are combined using majority voting or averaging (for continuous outcomes).

Method label: fandom forest

Main Use

 • Classification

Context of use

 • RFs achieve high predictive performance

Secondary use

 • Regression, time-to-Event (Random Survival Forest)

Pitfalls

Pitfall 3.3.3.1. As with all ensemble techniques interpretability suffers

Pitfall 3.3.3.2. If not restricting the size of trees RFs can overfit

Principle of operation

 • Bagging trees

 • Random projection of the features

 • Off-training sample error estimation

 • Model complexity restrictions

Theoretical properties and Empirical evidence

 • Overfitting is controlled

 • In practice, RFs can handle very high dimensional problems

 • Excellent empirical performance observed in benchmarks across biomedical domains [25]

Best practices

Best practice 3.3.3.1. Use as primary or high priority choice when high predictivity is required and interpretability is not of high importance

Best practice 3.3.3.2. Do not rely on internal error estimation but use an independent unbiased error estimator

Best practice 3.3.3.3. When feature selection is important, combine with an external feature selection algorithm

Best practice 3.3.3.4. Control DT size using as starting point the recommendations of the inventor in original publication [24]

References

 • Breiman L (2001). "Random Forests". Machine Learning. 45 (1): 5–32

 • Hastie, Friedman, Tibshirani. Elements of statistical learning second edition, Chpater 15. 2009, springer

 • Tan, Steinbach, Karpatne, Kumar. Introduction to data mining, second edition. Chapter 4.10.6., 2018, Pearson

Boosting

Boosting creates an ensemble of base models sequentially, where the ith base model is aimed at correcting errors made by the ensemble of the previous i-1 base models. The resulting ensemble is an additive model, where the final prediction is the sum of the predictions from the base models. The data set Xi for the ith model can be the data set X itself, a bootstrap resampled version of X, or a weighted version of X to emphasize difficult-to-classify instances.

Consider a boosted ensemble of m models. The prediction for an instance Xi is

$$ {y}_i=f\left({\sum}_{j=1}^m{m}_j\left({X}_i\right)\right) $$

where f(·) is the link function and mj(·) is jth model. Similarly to GLM, f() is identity for gaussian outcomes, logit/expit for binomial outcomes, exp for counts and survival outcomes, etc. The ensemble is built up iteratively, following a gradient descent procedure

$$ {M}_{j+1}={M}_j+\gamma \frac{\partial l}{\partial {M}_j}, $$

where Mj is the ensemble after the jth iteration, l is the log likelihood of Mj, and γ is the learning rate. For the exponential family of distributions, including the Gaussian, binomial (and multinomial) and Poisson (survival) outcomes, the derivative of the log likelihood is the residual. Thus the (j + 1)st model is the model fitted to the residual of the ensemble Mj. This leads to a very straightforward interpretation that the subsequent model is built to the errors (residual) of the ensemble Mj thus the (j + 1)st base model aims to correct the mistakes made by the previous j base models.

Gradient boosting can overfit, but typically only slowly to the number of modeling iterations. The number of base models in a boosted model (the number of iterations to perform) is a hyperparameter h. It has been observed that models consisting of substantially more base models than optimal still overfit only minimally. A potential reason for this is that when a model reaches overfitting, adding further base models has very minimal impact. However, boosting will eventually overfit, so overfitting should be controlled by h.

Gradient Boosting Machines (GBM) are a special case of boosting, where the base learners are decision trees, often decision stumps (1 level-deep trees). Similar to bagging, boosting can also expand the base model’s expressive capability.

AdaBoost is another special case of boosting which relates to the generic gradient boosting the same way as Fisher scoring (or Newton Raphson) optimization algorithm relate to gradient descent. AdaBoost is specifically designed for binomial (multinomial) outcomes.

The (j + 1)st model is added to the ensemble using

$$ {M}_{j+1}={M}_j+\gamma {\left(\frac{\partial^2l}{\partial {M}_j^2}\right)}^{-1}\frac{\partial l}{\partial {M}_j}, $$

where \( \frac{\partial^2l}{\partial {M}_j^2} \) is a diagonal matrix containing the second derivatives of the log likelihood l. For binomial outcomes, the second derivative is p(1-p), where \( {p}_i= expit\left({M}_j\left({X}_i\right)\right)=\frac{\exp \left({M}_j\left({X}_i\right)\right)}{1+\exp \left({M}_j\left({X}_i\right)\right)} \).

AdaBoost is implemented by weighing the ith training instance by (pi(1 − pi))−1 and using the residuals ri = yi − pi as the dependent variable.

Method label: gradient boosting machines (GBM)

Main Use

 • Classification

Context of se

 • GBMs offer high predictive performance

 • Interpretability is limited to variable importance

Secondary use

 • Exponential family distributions, including time-to-event

Pitfalls

Pitfal 3.3.4.1. As with all ensemble techniques interpretability suffers

Pitfal 3.3.4.2 If not controlling the complexity parameter it can overfit

Principle of operation

 • Gradient Boosting with decision stumps as base learners

 • Gradient descent (GBM) or Fisher-scoring (AdaBoost) in model space

 • Can be thought of as successively reducing the residual errors from previous cycles of modeling.

Theoretical properties and empirical evidence

 • Very expressive

 • Overfitting can be controlled

 • Can handle very high dimensional data

 • For AdaBoost, theoretical (but loose) error bounds are proven.

 • Certain boosting algorithms (e.g., Adaboost) have sensitivity to noise

 • Top performer in several types problems/data

Best practices

Best practices 3.3.4.1. Use as primary or high priority choice when high predictivity is required and interpretability is not of high importance

Best practices 3.3.4.2. When feature selection is important, combine with an external feature selection algorithm

Best practices 3.3.4.3. Control overftting by restricting number of iterations (number of trees)

Best practices 3.3.4.4. If data is noisy, prefer noise-robust variants.

Best practices 3.3.4.5. Select appropriate link function for exponential family outcomes

References

 • Schapire, R.E., 1990. The strength of weak learnability. Machine learning, 5, pp.197–227

 • Freund, Y., 1999, July. An adaptive version of the boost by majority algorithm. In proceedings of the twelfth annual conference on computational learning theory (pp. 102–113)

 • G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172–181

 • Long, P.M. and Servedio, R.A., 2008, July. Random classification noise defeats all convex potential boosters. In proceedings of the 25th international conference on machine learning (pp. 608–615)

Regularization

A problem is considered high-dimensional when the number of predictors is comparable to or exceeds the number of observations. When the number of predictors equals the number of observations, an OLS regression fit can be an exact fit, with 0 training error. Such a model will be most likely overfitted. When we consider other types of models, with the number of predictors very close to or exceeding the number of observations, overfitting becomes highly likely. We start our discussion with an explanation of regularization, a general solution to the high-dimensionality problem, and next, we discuss ways in which regularization can be added to various modeling techniques.

Regularization from a Bias/Variance Perspective

The material builds on the BVDE in chapter “Foundations and Properties of AI/ML Systems”. When a model is overfitting and has high variance and low bias, it can be advantageous to increase bias provided that it reduces variance. One way to achieve this is to reduce complexity and another way, the subject of this section, is regularization [26].

We previously saw how SVMs perform regularization and that the resulting RFE-SVM feature selector is one of the strongest feature selectors (second in performance to Markov Boundary methods in empirical performance across various domains). See section “Support Vector Machines” for SVM regularization, and section “Feature Selection and Dimensionality Reduction” for feature selectors. See also chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring Problems, and the Role of Best Practices” when feature should be interpreted causally or not.

Next we examine the Elastic net family of regularizers which we will call Maximum Likelihood (ML) Regularizers.

Let l(β) denote the log likelihood function, with β representing the model parameters. Model fitting without regularization solves

$$ {\max}_{\beta }l\left(\beta \right). $$

ML regularization adds a penalty term P(β)

$$ {\max}_{\beta }\ l\left(\beta \right)-\lambda P\left(\beta \right), $$

where λ controls the amount of regularization. Different ML regularization methods differ in the form of P(). Table 2. shows the most common regularization terms.

Table 2. Common penalty (regularization) terms

Ll regularization modifies the coefficients by pulling them away from maximum likelihood estimate. The maximum likelihood estimate is unbiased, thus regularization introduces bias, in hope of reducing variance. The various ML regularization methods differ in the way they modify the coefficients. Lasso is shrinking the coefficients towards 0 and it has the ability to set some of the coefficients exactly to 0. This allows Lasso to be used as a feature selector. Therefore, Lasso penalty is in principle most useful when some of the features are not truly related to the outcome. In contrast, Ridge penalty simply shrinks coefficients towards zero without actually setting them to 0. Elastic net allows to blend these two penalties together.

Adaptive lasso [27] also addresses the situation where some of the variables are not related to the outcome. However, by weighing the penalty on the individual coefficients, it aims to shrink important variables less and eliminate variables that are not related to the outcome (shrink their coefficients all the way to 0). The property that the adaptive lasso can set non-zero coefficients to variables that are relevant with probability approaching one is called the oracle property. Feature selection is discussed in more detail in the section "Feature Selection".

The procedure for the adaptive lasso proceeds in two steps. First, a regular lasso model is constructed. In the second step, an adaptive lasso model is constructed, with the weights being the inverse of the coefficients from the first lasso.

We will discuss overfitting in chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI”. There are many reasons for overfitting, including high dimensionality, with many irrelevant features and highly correlated features in a problem with moderate dimensionality. In the former case, lasso or adaptive lasso is most appropriate: lasso will discard some of the features. In the latter case, Lasso and Ridge will have different effect. When overfitting occurs as a result of high collinearity, the OLS estimator tends to set the coefficients of collinear variables to very high positive and similarly high negative values. Lasso will select one of the correlated features and set the coefficients of the others to zero, while Ridge will set the coefficients to similar values across the correlated features. Ridge prevents the coefficient from taking the extremely high values.

Bias and correctness. Since the penalties aim to trade bias for variance, the estimates are likely biased. Lasso has feature selection ability, however, it is not guaranteed to find the correct support (the exact set of variables that are predictive of the outcome). Adaptive lasso in theory finds the correct support and may help reduce or even eliminate the bias.

Method label: penalized regression

Main Use

 • Predictive modeling with exponential family outcomes

Context of use

 • High-dimensional data

 • Data with collinear variables

Secondary use

 • Lasso can be used for variable selection

Pitfalls

Pitfall 3.4.1.1. The models assume linearity and additivity. They are not appropriate if these assumptions are violated

Pitfall 3.4.1.2. Does not have the same interpretability as unregularized regression models. Specifically cannot be used to estimate effects of an exposure on outcomes controlling for confounders given to the model

Pitfall 3.4.1.3. The theory of optimality of the ML regularized regression does not address the selection of strongly vs weakly relevant vs irrelevant features which is essential to feature selection

Pitfall 3.4.1.4. Extensive benchmark results show weak empirical feature selection performance across many datasets, loss functions and comparator algorithms

Principle of operation

 • Regression models

 • Biases the coefficient estimates to reduce variance (bias-variance tradeoff)

Theoretical Properties and empirical evidence

 • They tend to give biased estimates (on purpose)

 • BVDE give theoretical support to the means of operation

 • “Oracle property” (as defined by [27])

Best practices

Best practice 3.4.1.1. Penalized regression can operate in high dimensional datasets that classic regression cannot handle at all

Best practice 3.4.1.2. Use as comparator method along with others as appropriate for the application domain

Best practice 3.4.1.3. May be useful for feature selection, but not as first-choice methods

Best practice 3.4.1.4. When non-linear regularized models are needed, consider link functions that model the non-linearity as well as kernel SVMs, kernel regression

References

Hastie, T., Tibshirani, R., Friedman, J.H. and Friedman, J.H., 2009. The elements of statistical learning: Data mining, inference, and prediction (Vol. 2, pp. 1–758). New York: Springer

Regularizing Groups of Variables

In the previous section, Lasso was used for variable selection, primarily in the context when many variables could be assumed irrelevant to the outcome. Variables may be related to each other and form groups. It may be useful to select variables on a per-group basis [28].

Assume that variables form K groups with pk variables in the kth group. Let β(k) denote the coefficients of the variables in group k. With l denoting the log likelihood function, the group lasso is formulated as

$$ {\max}_{\beta }\ l\left(\beta \right)-\lambda {\sum}_{k=1}^K\sqrt{p_k}{\left|\left|{\beta}^{(k)}\right|\right|}_2. $$

The sparse group lasso [29] formulation

$$ {\max}_{\beta }\ l\left(\beta \right)-\lambda \alpha {\left|\left|\beta \right|\right|}_1-\lambda \left(1-\alpha \right){\sum}_{k=1}^K\sqrt{p_k}{\left|\left|{\beta}^{(k)}\right|\right|}_2 $$

also has a (regular) lasso penalty, allowing for selecting variables on a per-group basis and further selecting variables within the groups.

Regularizing Partial Correlations

The precision matrix, that is the inverse of the variance-covariance matrix, is a key parameter of a multivariate normal distribution. Regularization is necessary to estimate the precision matrix when sufficient observations are not available.

Let Θ denote the precision matrix and S the sample covariance matrix. The regularized estimate of Θ is computed as

$$ {\max}_{\theta \ge 0}\mathit{\log}\ \mathit{\det}\ \varTheta \kern0.5em - tr\left( S\varTheta \right)-\lambda {\sum}_{j\ne k}{\varTheta}_{jk}. $$

The inverse covariance matrix contains the partial correlations. Two random variables Xi and Xj are independent if their covariance is 0; and they are conditionally independent (conditioned on all other variables) if the ijth element of the inverse covariance matrix is 0 [30].

Regularization to Constrain the Search Space of Models

So far, we have shown examples of regularization to avoid overfitting, either by computing a sparse solution (some parameters/coefficients are set to exactly 0) or by introducing bias to reduce variance. We have done so mostly in the context of regression.

Regularization is more general. It can be broadly viewed as a means to constrain the search space of models to confer some desirable property on the model. It is not limited to regression, but it is often used in conjunction with likelihoods. This is not a requirement, it can be used with any kind of quasi-likelihood or arbitrary loss functions. Constraining the search space of models through regularization is arguably the most common use and providing an exhaustive overview is impractical as new applications are continuously developed. In this section, we simply show some examples.

Knowledge distillation. Deep learning models are regarded complex highly performing models. Very complex DL models have been trained in many areas that can be further modified for particular applications (See chapters “Considerations for Specialized Health AI and ML Modelling and Applications: NLP,” “Considerations for Specialized Health AI and ML Modelling and Applications: Imaging—Through the perspective of Dermatology”). These modified models are still very large, and deploying such models into resource-limited environment is difficult. The teacher-student paradigm, consist of a large teacher model used to train a small student model. Knowledge distillation is the process of generating smaller student models (models with fewer parameters), deployable in limited-resource environment, that are trained based on the more complex teacher models. Knowledge distillation is often implemented using regularization: the student model is regularized so that it resembles the teacher model in various aspects [31].

Learning DAGs. Traditionally, learning DAGs is a combinatorial optimization problem. However, the NOTEARS method introduced a penalty term, applicable to weight matrices, that enforces DAG-ness when this weight matrix is used as an adjacency matrix [32].

Dropout (Neural Networks)

Dropout layers in Neural Networks aim to reduce overfitting due to noise. In dropout, a pre-defined (as hyperparameter) proportion (dropout portion) of nodes are dropped from the hidden layers and possibly from the input layers. “Dropping from the network” means that the inputs and outputs of these nodes are severed and thus these nodes no longer influence the prediction. The set of nodes to be dropped is selected at random in each epoch. After the epoch, the nodes are restored [33].

Dropout layers have properties both from regularization as well as from ensemble learning. Clearly, they are similar to regularization, because they constrain the network architecture by dropping some nodes.

The ensemble perspective can be explained as follows. One way to reduce the risk of overfitting in a neural network would be to build an ensemble of neural networks, i.e. multiple neural network models with different parameterizations. Given the high cost of training neural networks, this approach is impractical. Instead, dropout re-configures the network temporarily, which means that the network being trained in each epoch has a (slightly) different architecture. Over the training epochs, a range of different network architectures are explored.

An alternative explanation of the dropout layers is, that the potentially high number of parameters a network has over possibly many layers, makes the network susceptible to co-adaptation, where multiple parts (sets of nodes) of the network get optimized so that some parts can correct for errors made by other parts. This co-correction allows for easely fitting noise, which is undesirable. Dropping nodes out of the network at random, breaks these co-adaptation patterns.

Dropouts can be defined on the input layer, as well. In this case, the network temporarily ignores some of the input features. This is similar to introducing noise into the data to make the model more robust.

When a neural network model is used to make predictions, all nodes are used for the prediction, i.e. no dropout is used for prediction. Since the nodes were trained with some of the nodes missing, the weights of the nodes may be too high. To correct for this, the weights are scaled down by the dropout portion at prediction time.

Why Regularized Models and Other Predictive Modeling Should Not Be Used For Reliable Causal Modeling

Regularization, with the exception of penalizing the precision (partial correlation matrix), has profound impact on a modeling technique’s ability to condition on variables.

Figure 12 depicts a dataset with two variables A and B and an outcome (target) T. The target T is binary, points in red correspond to positive outcome and the points in black to negative. The data generating causal relationships are AT and AB. This means that B is independent of T given A; or in other words, A alone is sufficient to classify the instances, B contains no additional information. From a causal perspective, the association of B with the target is confounded by A (their common cause). Any valid causal method must be able to differentiate correlation from causation, i.e., determine which correlations are causal and which are confounded. This is not happening in the example since the decision surface (orange line) of the (logistic) ridge regression gives the same weights to both A and B. Ridge regression is not designed to realize that B and T are conditionally independent given (conditioned on) A. Contrast this with the decision surface (blue line) produced by unpenalized Logistic Regression, which assigns almost zero weight to the confounded variable (B) and nearly all the weight to the true cause (A) since LR is capable of correctly conditioning on any set of confounders (thus correctly estimating direct causal effects).

Fig. 12
A dual-line and a dual scatterplot of B versus A. The lines are logistic and ridge regressions. The line logistic is vertical at 0. The line ridge is diagonally left. The 2 sets of plots are clustered in the top right and bottom left corners.

Illustration of the ability and inability to condition on a variable

Ridge regression is not the only method that has this problem. The maximal margin decision boundary that SVMs would select, is similar to the orange line; and most classifiers including modern regularized regression methods, principal component and other classical dimensionality reduction methods, as well as all predictive modeling without causal properties, will make similar errors.

Figure 13 illustrates another example where Lasso penalized regression (and other non-causal techniques) will fail to correctly condition on variables. In this example, we have 7 variables, A, B, …, E, S and the target variable T. A, B, …, E are direct causes of T. Variable S which is not causal for T (but confounded with it via A,....E) synthesizes information from causal variables A, B, …, E. In this setup, S and T are independent conditioned on A, B, …, E. However, since S synthesizes information from A, B, …, E, it can contain more information about T than any subset of A, B, …, or E. Thus, when building a predictive model for T, a penalized model, like Lasso, can prefer a set of predictor variables that contains S over the correct set of A, B, …, E. In contrast, unpenalized logistic regression (and any sound causal algorithm) will correctly identify that S is independent of T given the other variables and will assign a zero coefficient to S given the true causes.

Fig. 13
An illustration represents the blocks A, B, C, D, and E are mapped to the blocks S and T.

Simple example where a confounded correlate synthesizes information from multiple true causes

For large scale benchmark studies comparing all modern predictive modeling algorithms with causal algorithms see refs. [34, 35]. In these references, it is explained why various additional non-causal methods fail to model causality.

(b) The broader implication is that because discovering causal structure and estimating effects require sound conditional independence tests and conditional association estimation, regularized regression and other purely predictive methods cannot be used for causal structure discovery or causal effect estimation even for simple questions (e.g. direct causal effect estimation) and even when the complete set of confounders is measured and are included in the model. Although unpenalized regression can be used for causal effect estimation, note that this has to be guided by knowledge of the causal structure (or elicitation of it using complex causal modeling algorithms).

See chapter “Foundations of Causal AI/ML” for more details and best practices.

Feature Selection and Dimensionality Reduction

Variable selection (aka feature selection) for predictive modeling has received considerable attention during the last three decades in a variety of data science fields [36, 37]. Feature selection and dimensionality reduction are techniques of choice to tame high dimensionalities in diverse big data applications. Intuitively, feature selection for prediction aims to select a subset of variables for constructing a diagnostic or predictive model for a given classification or regression task. Ideally this selected set should include all variables with unique information and discard variables the information of which is subsumed by the selected set (since they add no information to the classifier). Dimensionality reduction maps the original data on a smaller number of dimensions so that fitting models is faster, less prone to overfitting and less sample intensive (all problems created by the high dimensionality and collectively known as “Curse of Dimensionality”).

Feature Selection

Key concepts of the theory of feature selection were introduced in chapter “Foundations and Properties of AI/ML Systems”. Here we will extend that material with a refinement of the types of feature selection problems typically addressed in practice, their relative complexity, examples, and algorithms that are used for feature selection.

Motivation and Standard Feature Selection Problem

In practice, reducing the number of features in clinical predictor models can reduce costs, and increase ease of deployment. Second, a more parsimonious model can be easier to interpret. Third, sometimes, identifying the most impactful features helps gain an understanding of the underlying process that generated the data. Finally, many classifiers do not perform well with very high dimensional data: they may overfit, exhibit too long compute times, or even fail to fit models.

Feature selection can be approached from three high-level theoretical perspectives. (1) The first one is that of overfitting/underfitting. In high-dimensional data sets, feature selection can reduce overfitting, but in lower dimensions if not conducted properly it can introduce both overfitting and underfitting (not both at the same time). It can underfit relative to a model that has more features (i.e., resulting model does not have enough capacity); and it can overfit if it is overly influenced by random variations in the data; a model with the same number of features could perform worse on the training set but perform better on a test set. (2) Feature selection in moderate or low dimensions can be used specifically to produce a parsimonious (and thus potentially more interpretable and practical) model, but this may induce underfitting if feature selection is not optimal. (3) Suboptimal feature selection can introduce instability when it selection of features is influenced by random perturbations in the data. To what extent such instability affects predictivity depends on whether the unstable features share the same information for the target or not. It is therefore important to deploy feature selection methods which are both theoretically sound and empirically strong.

The standard feature selection problem (chapter “Foundations and Properties of AI/ML Systems”) is typically defined as:

Find the smallest set of variables St in the training data such that the predictive accuracy of the best classifier than can be fit for response T from the data is maximized

A commonly used classification of feature selection considers three primary categories of methods: wrapper methods, filter methods and embedded methods [36]. We will further elaborate this taxonomy by considering whether the feature selection methods have a strong theoretical framework and properties (formal) vs not (heuristic).

Heuristic Algorithms

Wrapper methods. These methods conduct a heuristic search in the space of all subsets of the variables to assess each subset’s predictive information with respect to the selected predictive model. Wrapper methods treat the classifier as a black box, fit a model to that data with a particular feature set and evaluate the model. Then they build a model with a different feature set and re-evaluate the new model. This process continues until some stopping criterion is met. This approach is computationally expensive, often overfits and underperforms, and also has great variation in performance depending on methods.

a. Stepwise Regression. Historically it has been used broadly for statistical regression modeling but its use is reduced because (a) standard implementations have been shown to overfit and are unstable [38] and (b) regularized regression has alleviated the need to use them substantially.

These methods can start from an empty model and use a forward single variable inclusion step, iterated with a backward single variable elimination step method until no improvement can be made. Backward elimination starts with a full model (a model contains all predictor variables) and eliminates one feature at a time. It eliminates the feature with least statistical significance (or equivalently, the one that improves the objective criterion the least). Stepwise feature selection stops when all features in the model achieve a certain level of significance (e.g. p-value of 0.05) and no other significant feature can be added. Alternatively, a penalized objective criterion (e.g. AIC—described later) can be used and the stepwise feature selection process terminates when this penalized objective is maximized.

b. SVM-RFE (recursive feature elimination) is an example of a more recent (and surprisingly powerful) wrapper method. SVM-RFE builds an SVM model and examines the contribution of features in the model. In each iteration, 50% of features with the lowest importance are eliminated, a model is re-fit and its accuracy evaluated in a test set. The process iterates recursively until predictivity drops. Due to SVM’s resilience to high dimensions, SVM-RFE can produce a stable, highly accurate and non-overfitted set of features. What it typically lacks is minimality (although in practice it often selects parsimonious models).

c. Bagging for feature selection. In an attempt to improve the stability of feature selection methods and reduce their bias, bagging can be used. Models are constructed on bootstrapped versions of the training data using a base feature selection technique. Features that appear in some percentage (e.g. 50%) of the models are selected. The tendency of the feature selection method to select a feature because it randomly appears better than another can be mitigated by using multiple samples.

Univariate Selection Filtering. Filter methods select (or pre-select) features before the learning algorithm is applied to the data. Unlike wrapper methods, filter methods do not use a predictive model for evaluating the features, but rather use statistical criteria to select features. The advantage of this approach is that it is agnostic of the learning algorithm and can be much more computationally efficient than fitting the model to the data.

Univariate variable screening (aka Univariate Association Screening, UAS or univariate association filtering, UAF) is the most commonly-used filter method for pre-selecting a set of variables that have significant association with the outcome at a predefined significance level; or in the case of UAF, variables are ranked based on their univariate association with the outcome and the top k variables are selected. Any common measure of association (e.g. correlation, signal-to-noise, G2, etc.) can be used. The rationale for univariate variable screening is that variables without a univariate association with the outcome are (often) not relevant to the outcome. A key advantage of variable screening is saving computational effort since the dimensionality of problems can be reduced early in the analysis.

Embedded methods. In embedded methods, the modeling technique itself incorporates a method to reduce the influence of irrelevant variables. Examples of this approach are regularization techniques such as SVMs, LASSO and similar methods. These methods and the mechanism by which they eliminate features is described in section “Regularization”.

Feature Selector Algorithms Based on Formal Theories of Relevancy

The previously described feature selection methods are essentially heuristic because they do not utilize a principled framework for optimal feature selection. In the remainder we will discuss formal feature selection frameworks (i.e., Kohavi-John and Markov Boundary) and will describe algorithms that conduct provably optimal feature selection.

Kohavi-John and Markov Boundary framework for Standard predictive feature selection problem. Kohavi and John [37] decompose the standard feature selection problem as follows:

  • A feature X is strongly relevant if removal of X alone will result in performance deterioration of an optimal Bayes classifier built on all data.

  • A feature X is weakly relevant if it is not strongly relevant and there exists a subset of features, S, such that the performance of the Optimal Bayes Classifier fit with S is worse than the performance using S U {X}.

  • A feature is irrelevant if it is not strongly or weakly relevant.

Intuitively, choosing the strongly relevant features provides the minimal set of features with maximum information content and thus solves the standard feature selection problem (since a powerful classifier in the small sample or the Optimal Bayes Classifier in the large sample) will achieve maximum predictivity.

Recall from chapter “Foundations and Properties of AI/ML Systems” that a set S is the Markov Boundary of variable T (S = MB(T)), if S renders T independent on every other subset of the remaining variables, given S, and S is minimal (cannot be reduced without losing its conditional independence property). This is the MB(T) in the probabilistic sense.

Tsamardinos and Aliferis connected the Kohavi-John relevancy concepts with BNs and Markov Boundaries as follows: In faithful distributions (see chapter “Foundations and Properties of AI/ML Systems,” “Foundations of Causal ML”) there is a BN representing the distribution and mapping the dependencies and independencies so that:

  1. 1.

    The strongly relevant features to T are the members of the MB(T).

  2. 2.

    Weakly relevant features are variables, not in MB(T), that have a path to T.

  3. 3.

    Irrelevant features are not in MB(T) and do not have a path to T.

    Thus in faithful distributions: the Markov boundary MB(T) = solution to the standard feature selection problem.

Local causally augmented feature selection problem and Causal Markov Boundary. In faithful distributions with causal sufficiency (see chapter “Foundations of Causal ML”) there is a causal BN that is consistent with the data generating process and can be inferred from data in which: strongly relevant features = members of MB(T) and comprise the solution to the local causally augmented feature selection problem of finding:

  1. 1.

    The direct causes of T

  2. 2.

    The direct effects of T

  3. 3.

    The direct causes of direct effects of T

Thus in faithful distributions with causal sufficiency the causal graphical MB(T) set is connected with the probabilistic MB(T). Inducing the probabilistic MB(T) then:

  1. 1.

    Solves the standard predictive feature selection problem, and

  2. 2.

    Solves the local causally augmented feature selection problem.

Equivalency-augmented feature selection problem and Markov Boundary Equivalence Class. In faithful distributions the MB(T) exists and is unique [39]. However, in non-faithful distributions where target information equivalences exist (“TIE distributions”) we may have more than one MB(T) [40]. The number of Markov Boundaries can be exponential to the number of variables and in empirical tests Statnikov and Aliferis extracted tens of thousands of Markov boundaries before terminating the experiments [40].

In TIE distributions:

  1. 1.

    The Kohavi John definitions of relevancy break down since there are no Kohavi-John strongly relevant features any more, only weakly relevant and irrelevant ones. This is because if S1, S2 are both in the MB equivalence class {MBi(T)} then: independent (S1, T | S2) and independent (S2, T | S1).

  2. 2.

    The 1-to-1 causal and probabilistic relationship of the probabilistic and graphical MB(T) breaks down. A variable can be a member in some MBi(T) without having a direct causal or causal spouse relationship with T.

  3. 3.

    The standard predictive feature selection problem is solved by the smallest member in the equivalence class of MBi(T).

  4. 4.

    The Equivalency-augmented feature selection problem is to find the equivalence class of all probabilistic MBi(T).

These feature selection problem types can be further subdivided as shown in Fig. 14. The problem types depicted were chosen on the grounds that (a) applied ML papers often aim to solve them, and (b) proofs of feature selection soundness often use these subtypes as the goal. They are organized from simpler/lower complexity or hardness in the bottom, increasing while moving to the top.

Fig. 14
A list of 10 feature selection problem types is subdivided into 4 levels from easiest to hardest. 1 and 2 are easiest, 3 to 6 are easy, 7 and 8 are hard, and 9 and 10 are the hardest.

A taxonomy of progressively harder feature selection problems. From bottom to top of the figure, ten FS problem types of increasing complexity are depicted. Problems 1–5 are addressed with simple association criteria and can be tackled with regularized algorithms. Problems 6–7 correspond to the Standard Feature Selection Problem and require specialized algorithms. Problem 8 corresponds to the Causally-extended Standard Feature Selection Problem and requires specialized algorithms. Problems 9–10 correspond to the Causally-extended Standard Feature Selection Problem with Equivalence Classes and requires specialized algorithms.

We will further elaborate on the nature of these problem types by explaining subtypes 3 and 10 in the next two figures. These explanations along with the material of chapter “Foundations and Properties of AI/ML Systems” and especially the definitions of standard, causal and equivalency class problems, should make obvious their relevance to many real-life tasks (Figs. 15 and 16).

Fig. 15
An illustration represents the blocks S 1 ellipsis S k are mapped to a block T. A block L 1 ellipsis L n.

Demonstration of FS Problem type 3 in the feature selection complexity taxonomy. The response variable is depicted in black. Green variables are the ones we seek to retain and red the ones to be discarded. Variables starting with S depict strongly relevant variables (i.e., cannot be discarded without loss of information—see chapter “Foundations and Properties of AI/ML Systems”). Variables starting with I depict irrelevant variables (i.e., can be discarded without loss of information—see chapter “Foundations and Properties of AI/ML Systems”). In domains with the data structure depicted and Faithful distributions, this problem is easily solvable by selecting all and only variables with non-zero marginal (i.e. univariate) association with the response. Regularized methods typically give zero weights to such irrelevant variables under these conditions

Fig. 16
A block flow chart of the feature selection problem type 10. R 1 ellipsis R k is followed by M D C 1 ellipsis M D C k. The chart ends with M D E prime m, M D E n, and M S p prime 1.

Demonstration of feature selection Problem type 10 in the FS complexity taxonomy. The response variable is depicted in black. Plain green variables are the ones we seek to retain and differentiate. Red ones are to be differentiated (deep vs pale red) and discarded from predictive modeling. Variables starting with M depict Markov Boundary variables (cannot be eliminated without loss of predictive accuracy). An example MB is {MDC1, …, MDCk, MDEm, …, MDEn, MSP1}. Variables with names starting with MDC are members of the MB and direct causes of T. Variables with names starting with MDE are members of the MB and direct effects of T. Variables with names starting with MSP are members of the MB and direct causes of direct effects of T (i.e., “spouses” of T). Variables starting with I depict irrelevant variables (i.e., have no information about T and can be discarded without loss of information). Variables starting with R depict redundant variables (i.e., have information about T but should be discarded without loss of information if most compact models are needed). Variables with same name and the corresponding “prime” ones without prime names are equivalent in information content with respect to the response T (we only consider for simplicity contect-free equivalence [40]). A variable or variable set can have a large number of equivalent variables/sets (not depicted here for simplicity). By substituting equivalent variables, we obtain equivalent Markov Boundaries. For example MB {MDC1, …, MDCk, MDEm, …, MDEn, MSP1} is equivalent to MB {MDC1, …, MDC’k, MDEm, …, MDEn, MSP1}, and to {MDC1, …, MDCk, MDEm, …, MDEn, MSP’1}, and {MDC1, …, MDCk, MDEm, …, MDE’n, MSP1}, and so on. An exponentially large number of equivalent MBs can exist in a distribution. The feature selection/causal FS problem depicted requires highly specialized algorithms and cannot be solved with simple regularization or variable filtering

Modern Markov Boundary Algorithms—Faithful Distributions

We do not mention MB algorithms with only historical significance. For a review see [34]. Modern MB algorithms with guaranteed correctness, sample efficiency, and excellent empirical performance are instantiations of a broad family called GLL and include several HITON variants (HITON MB, HITON PC, interleaved or not, with symmetry correction or not, with additional wrapping or not, etc.), and MMMB and MMPC algorithms. They can be instantiated for recovery of full Markov boundaries or direct causal edges only. The IAMB family also exhibits sound and computationally efficient behavior in real data but is not as sample efficient.

These algorithms have been extensively tested and compared to all major feature selectors including wrappers, UAF, SVM-RFE, Lasso, LARS, LARS-EN, etc. See the benchmarks in [34, 41] for experiments covering in total > 120 algortithms, >270 dataset/tasks and multiple loss functions and data types. Consistent with the theory of feature selection the Markov Boundary algorithms in these benchmarks return the smallest feature sets and lead to classifiers with maximum predictivity. They also achieve near-perfect empirical discrimination among strongly relevant, weakly relevant and irrelevant features and are causally consistent.

Modern Markov Boundary Algorithms—TIE Distributions

Statnikov et al. [40, 42] invented the TIE* and iTIE* algorithhm families that can extract the full equivalence class of Markov Boundaries. The algorithms are correct and efficient. In typical usage they utilize GLL as subroutines . In [40, 42] extensive experiments are reported with comparisons to dozens of real and simulated datasets over multiple domains including comparisons with all availialbel comparators for feature selection equivalence class discovery. According to these benchmarks, MB equivalence classes are common, their sizes varies across domains, and TIE* algorithms recover them with great accuracy.

We next provide a summary table (Table 3.) with properties of widely used feature selection approaches comparing their ability to tackle the feature selection problems 1–10.

Table 3 Comparative capabilities of current FS strategies and algorithms including simple univariate filtering, regularized regression, SVM-RFE and Markov Boundary methods across the FS complexity categories of Fig. 10.”+” = can solve, “-“cannot solve

We close this section with the method label of feature selection methods.

Method label: feature selection (FS)

Main Use

 • Finding a small set of variables that has all information about the response (FS)

Context of use

 • Can be used to reduce the number of inputs to a classifier or regressor model so that:

  – Over fitting is avoided

  – Learning is faster

  – Deployment of a model is easier, faster, cheaper

  – The ML models are more understandable

 • Causal FS methods also reveal local causal structure around the response variable

 • Some learning algorithms have embedded FS (for example Decision Tree learning and Random Forests). In such cases adding formal FS algorithms often further enhances their performance

Secondary use

 • Data simplification, compression and visualization

 • Clustering and subgrouping based on FS transforms of the data

Pitfalls

Pitfall 3.5.1.2. FS methods that are not designed for causal discovery cannot be interpreted causally and any estimates of causal effects will be biased

Pitfall 3.5.1.3. FS itself can be over fitted to the data if model selection protocols used are not well-designed (see chapter on overfitting)

Pitfall 3.5.1.4. FS like any other component of analysis needs be tailored to the data and problem. Using a default FS everywhere may lead to suboptimal results

Pittfall 3.5.1.5. If a classifier has embedded FS, DR, or regularization does not mean that it cannot benefit from FS

Principle of operation

Highly-dependent on the specific FS method

 • Markov boundary FS is based on Bayesian Network theory and is additionally concordant with Kohavi-John FS theory in faithful distributions. In non-faithful distributions MB FS has strong advantages over Kohavi-John FS

 • RFE-SVM is based on fitting SVM models and performing wrapping over models with progressively smaller feature sets chosen based on the SVM weights

 • Univariate association filtering (UAF) rank-orders variables according to association with the response and chooses the top k variables

 • Wrapping is heuristic search over the space of all possible subsets, each one evaluated for a specific classifier and loss function of interest

 • Stepwise regression procedures originated in statistics and examine a series of regression models by iteratively including and discarding variables according to inclusion and exclusion criteria while conducting tests of statistical significance of model improvement at each step

Theoretical properties and empirical evidence

 • Markov boundary FS accurately solves the standard FS problem by finding the smallest subset of variables that has all the information in the data about the response. Worst case computational complexity for inferring the MB is exponential to the number of variables but real-life complexity of best-of-class algorithms on common data is very efficient. In faithful distributions with causal sufficiency the Markov boundary solves a causal version of the standard FS problem: It finds the direct causes, directs effects and direct causes of direct effects of the response. Equivalence class MB induction recovers the whole set of MBs in the data. Excellent empirical performance in most domains

 • RFE-SVM is not guaranteed to find the optimal FS solution and is not causally valid. Computational complexity is low order polynomial. It is very robust to small sample size and has very good performance in many domains

 • Univariate filter selection (UAF) is not guaranteed to give the smallest set of variables with all information about the response. The top-ranked UAF variables do not need to be causally related to the response. Computational complexity is very small and sample efficiency is high

 • Wrapping is learner-specific, computationally intensive and tends to overfit. Not suitable for causal discovery, typically

 • Stepwise regression procedures are relatively fast but do not guarantee correct results and tend to overfit

Best practices

Best practice 3.5.1.1. Markov boundary procedures are first choice for FS when modest sample size (or more) is available and regardless of how high is the dimensionality. They are particularly appropriate when causal interpretation of findings is desired and when we wish to have consistent and coherent predictive and valid causal models. Also when we wish to find equivalence classes of optimal feature sets or optimal classifiers

Best practice 3.5.1.2. SVM-RFE is a first choice in very small sample size and high dimensional/small sample when causal conclusions are not sought

Best practice 3.5.1.3. UAF is common in genomics. Contrary to common over-interpretation by some researchers, the top ranked variables are not strongly suggestive of biologically/mechanistically/causally important or even valid factors. UAF has a place however when sample sizes are extremely small

Best practice 3.5.1.4. Generic wrapping and stepwise procedures should be (and are increasingly) retired from practice

References

[36,37,38,39,40,34,41,42]

Dimensionality Reduction

As we mentioned earlier, the main objective of dimensionality reduction is to transform a high-dimensional space into a lower-dimensional representation. While feature selection achieves a lower dimensional space by keeping a subset of the original features without modifying the actual features, dimensionality reduction combines several of the original features into new features.

Dimensionality reduction techniques can be categorized as supervised vs unsupervised and linear vs nonlinear.

  1. 1.

    Supervised versus unsupervised: supervised techniques can use supervising information (such as outcome) to guide the dimensionality reduction. For example, linear discriminant analysis uses the class label to help project a multi-dimensional feature space into a lower-dimensional representation that maximally distinguishes among the classes. Unsupervised dimensionality reduction does not use outcome information. In this section, we focus on unsupervised dimensionality reduction; high-dimensional classification or regression are handled in other parts of this chapter.

  2. 2.

    Linear vs nonlinear. The transformation that reduces a high-dimensional space into a the lower-dimensional representation can be linear or non-linear. New features created by linear dimensionality reduction techniques are linear combinations of the original features, while non-linear dimensionality reduction uses nonlinear combinations. For example, autoencoders are arbitrarily complex non-linear transformations (they may increase or reduce the dimensionality), while classic Principal Component Analysis PCA is linear. Nonlinear dimensionality reduction is also known as manifold learning ([43], Chapter 20). Unsupervised dimensionality reduction has a vast literature; here, we focus on two classical approaches. For other popular dimensionality reduction techniques, the reader is referred to [44].

Principal Component Analysis (PCA)

Given a data matrix X, with columns as variables and rows as observations, find a matrix U = [u1, u2, …, up], such that (i) the ui ‘s are orthogonal to each other and (ii) each subsequent principal component (or component for short) ui, captures a maximal portion of the remaining variance.

Figure 17 depicts an illustration of PCA. The left pane shows the original Gaussian data. The variance of the data along both dimensions is approximately equal. PCA transforms this space into a new representation, the principal component (PC) space. Data in the PC space is depicted in the right pane. The horizontal axis corresponds to the first PC and the vertical axis to the second. As we can see, the variance (and thus information content) of the data along the first PC is much larger than along the 2nd. If we had to create a lower dimensional (i.e. one-dimensional) representation of the original data, we could choose the first PC as this new dimension, as it would capture much more variability than any of the original variables. In fact, among all linear combinations of the original variables, the first PC captures the highest amount of variance (under the constraint the total variances of the original and transformed space must equal).

Fig. 17
A set of 2 scatterplot graphs of original data and P C space. A. X 2 versus X 1. The plots are scattered diagonally right. B. Comp 2 versus Comp 1. The plots are distributed horizontally at the center.

Illustration of PCA. The left pane depicts a two-dimensional synthetic dataset. The blue and orange lines are the axes of the transformed space. The right pane depicts the same data set in the principal component space. The horizontal axis is the first and the vertical axis is the second principal component. The variance of the data is much higher along the first principal component than along the second

Properties of PCA
  1. 1.

    The components computed by PCA are linear combinations of the original features. Thus PCA is a linear dimensionality reduction method.

  2. 2.

    Each vector ui is an eigenvector of XTX for centered X.

  3. 3.

    The ith PC has variance λi, where λi is the eigenvalue corresponding to the ith eigenvector.

  4. 4.

    λi is the total variance.

Exploratory Factor Analysis

The motivation behind factor analysis is that a (relatively) small number of unobservable factors can explain the observed variables. For example, “intelligence” is a quantity that is directly unobservable, thus it is measured through a battery of tests that is believed to be related to intelligence. In this example, the test results are the observations and intelligence is the latent factor.

Given an m × n observation matrix X consisting of n observations (columns) and m features (rows), we wish to explain these observations by p factors, then a factor model is of form

$$ X-M=\varLambda F+\varepsilon $$

where M is the mean matrix containing the row means of X in its rows, Λ is the (m × p) loadings matrix, F is the (p × n) factor matrix, and ε is the error matrix (m × n) with mean 0 and finite variance.

Assumptions. We assume that

  1. 1.

    F and ε are independent

  2. 2.

    The factors in F are independent of each other

  3. 3.

    F is centered.

PCA can be viewed as a special case of factor analysis where Λ is orthogonal.

Method label: dimensionality reduction (DR)

Main Use

 • Computes a lower dimensional representation of a high-dimensional features space

Context of use

 • Lower dimensional mapping can help with visualization

 • Can be used to reduce the number of inputs to a classifier or regressor model so that:

  – Overfitting is avoided

  – Learning is faster

 • May reveal structure properties of the domain

 • Some learning algorithms have embedded DR (for example deep learning and other ANNs)

 • Factor analysis: Estimate the values of unobserved factors through multiple observed variables

Secondary use

 • Data simplification, compression and visualization

 • Clustering and subgrouping based on DR transforms of the data

Pitfalls

Pitfall 3.5.2.1. Some nonlinear methods can be very computation intensive

Pitfall 3.5.2.2. DR does not reduce the number of inputs that need to be measured in order to deploy a model. Many expensive, dangerous (to measure) and unnecessary inputs discarded by FS will be needed by DR to be measured for model deployment

Pitfall 3.5.2.3. DR methods that are not designed for causal discovery cannot be interpreted causally and any estimates of causal effects will be biased

Pitfall 3.5.2.4. DR itself can be overfitted to the data if model selection protocols using it are not well-designed (see chapter “Evaluation”)

Pitfall 3.5.2.5. DR like any other component of analysis needs be tailored to the data and problem. Using a default DR everywhere may lead to suboptimal results

Pittfall 3.5.2.6. If a classifier has embedded DR, that does not mean that it cannot benefit from FS or regularization

Principle of operation

 • Greatly differs by the method

 • Generally, variables in the transformed representation are required to have some sort of independence of each other, and they collectively capture maximal amount of information across the full distribution

 • Embedded DR constructs lower-dimensional transforms of the original data in a way that is consistent with the inductive bias of the embedding learner

Theoretical properties and empirical evidence

PCA:

 • The number of PCs does not exceed the sample size

 • PCs are independent of one another

 • PCs cannot be interpreted as causal factors and loadings cannot be interpreted as causal effect sizes

EFA:

 • It is a probabilistic model

 • PCA is a special case of EFA when loadings vectors are orthogonal

 • Causal interpretation of hidden factor effects on measured variables under strong assumptions

Best practices

Best practice 3.5.2.1. When eliminating expensive, dangerous and unnecessary inputs by predictor models is beneficial, then use FS instead of DR

Best practice 3.5.2.2. For prediction of specific outcomes, FS targeting these outcomes should be the methods of choice

Best practice 3.5.2.3. Using a top-2 PC data transform is a staple of data visualization for exploratory purposes

Best practice 3.5.2.4. Both PCA and EFA should not be over interpreted causally, predictively or otherwise

Best practice 3.5.2.5. PCA for classification can be overfitted, so it needs to be treated like any other data operation by the model selection and error estimation protocol

References

 • For an overview, see chapter 20 in Murphy KP. Probabilistic Machine Learning: An Introduction. MIT Press, 2022

 • Hinton, G.E. and Roweis, S., 2002. Stochastic neighbor embedding. Advances in neural information processing systems, 15

Time-to-Event Outcomes

Survival data, (aka time-to-event data), describes the distribution of time until an event occurs. This event can be the failure of a device, incidence of a disease, a recurrence of a disease, an adverse event, or death. Time is the number of days, weeks, months, years, etc. from the beginning of follow-up until the event. Alternatively, it can also be calendar time such as the subject’s age at the time of the event. We tend to think of events as negative, such as death (after all the field of survival analysis is named after studying survival time, the time to death), but it can also be a positive event, such as discharge from hospital. In the following, we use the terms “survival” and “time-to-event” interchangeably as long as context clarifies the use, and we also use the terms “event”, “failure” and “death” interchangeably, unless this causes confusion.

Analytic tasks involving a time-to-event outcome are analogous to most other outcome distributions. The main tasks are (1) estimating the time-to-event (or the survival probability distribution S(t)); (2) testing whether two time-to-event distributions are statistically different; and (3) assessing whether one or more covariates (e.g. exposures) significantly affect the survival distribution.

The need for survival analysis. At first glance, time-to-event could be viewed as a continuous quantity and be modeled as one of the many known non-negative distributions, however, this approach breaks down for the following reasons. First, some subjects never experience the event of interest within the practical time frame of the study. Discarding these patients (with unknown time-to-event) leads to loss of information, because we know that these patients did not experience an event until the end of the study. In other words, time-to-event is not missing completely, it has been bounded. Second, some subjects are lost to follow-up before the study ends. Again, discarding such patients because their time-to-event is missing, discards useful information (i.e., that they had not experienced an event until the time they were lost to follow-up). Both of these situations are referred to as right censoring (see terminology section below). Third, in a study where the outcome is not death, many enrollees may have already experienced the event before enrollment. If this is allowed, cases with time-to-event = 0 can have high probability. Moreover, parametric distributions handle the general properties of time-to-event modeling poorly. As an example, fourth, outliers (extreme survivors) are common, and they have potential to become an influential point for some distributions. Also, fifth, many parametric distributions have parameters that mathematically relate to their moments (mean, variance). Censored data can make the estimation of moments on which model parameters depend, difficult, thus compromising the model.

Pitfall 3.6.1

In most practical settings, it is a significant pitfall to model time-to-event/survival using ordinary predictive modeling classification or regression.

Best Practice 3.6.1

When modeling time-to-event outcomes, specialized methods, such as the ones described in this section, should be used, at minimum as comparators with conventional techniques.

Terminology

Let T be a random variable with Ti denoting the time at which an event happened to subject i. Let f(t) denote the density of T and let F(t) denote the cumulative density of T. The cumulative density is referred to as the failure function and is defined as

$$ F(t)=\Pr\ \left(T\le t\right)={\int}_0^tf\left(\tau \right)\ d\tau . $$

The survival (or survivor) function is the complement of the failure function and is defined as the probability that a subject survives beyond a particular time t

$$ S(t)=\Pr \left(T>t\right)={\int}_t^{\infty }f\left(\tau \right) d\tau =1-F(t). $$

Properties of the survival function. The survival function is monotonic, non-increasing, equals 1 at time 0 and decreases to 0 as time approaches infinity. [45].

Often, instead of the survival function, we model the instantaneous “probability” of an event. The hazard function is the instantaneous “probability” per unit time that an event occurs exactly at time t given that the patient has survived at least until time t,

$$ h(t)=\underset{\varDelta T\to 0}{\lim}\Pr \kern0.5em \left(T\le t\le T+\varDelta T|T>t\right) $$

Properties of the hazard function. The hazard function can be thought of as the “velocity” of the failure function or the rate of change in the failure function. Since the survival function is non-increasing, the failure function is non-decreasing and h(t) is non-negative. The hazard is not a true probability, it is a rate [45].

The cumulative hazard is

$$ H(t)={\int}_0^th\left(\tau \right)\ d\tau . $$

The hazard and survival functions are linked to each other through the following relationship [46]. By taking the derivative of ln S(t), we get

$$ \frac{d\ \ln\ S(t)\ }{dt}=\frac{dS(t)/ dt}{S(t)}=-\frac{f(t)}{S(t)}=-h(t), $$

which leads to

$$ S(t)=\exp \left(-H(t)\right) $$

Figure 18. shows the survival (left) and the hazard (right) functions for the diabetes dataset in [47]. The horizontal axis corresponds to the follow-up time (in years). For visualization purposes we show points (in grey color) on the actual hazard “curve”. There is one point every follow-up day. The hazard estimates can change frequently in any direction as long as they remain non-negative. To further improve interpretability, a smoothed version of the hazard curve is also presented in black. The survival curve is a non-increasing step function starting at 1 at time 0 and ending at 0 at time infinity. It appears smooth in this figure because of the high resolution (daily) and large sample size, but it is nonetheless a step function. Note that the survival function relates to the lack of event (probability of not having an event), while the hazard function relates to experiencing an event (the rate of having an event).

Fig. 18
A set of 2 line graphs. A. Survival function versus follow-up. The line descends from (0, 1.0) to (6, 0.93). B. Hazard versus follow-up. The line follows a concave downward trend. Data are estimated.

Illustration of the Survival and Hazard functions. The left panel shows the survival function, while the right panel shows the smoothed hazard function for the diabetes data set in [47]

Censoring

When a patient is lost to follow-up and is no longer observable, the time-to-event beyond the time of the patient dropping out cannot be observed. This is not a typical missing data problem as it first appears, because we have partial observations: the event did not occur while the subject was under observation. This partial observability is called censoring.

Left censoring happens, when the event takes place before the subject enters observation. We know that the event has already occurred at time 0, but we do not know when. Right censoring happens when the event takes place after the subject is no longer observed. We know that the event did not take place during the observation period but we do not know when/whether it occurred afterwards. Common reasons for right censoring are that the study ended, the subject is lost to follow-up or the subject withdrew from the study. Finally, interval censoring brackets the time of the event between two time points. We know that the event did not take place before the first time point and that it already occurred by the second time point.

Let C denote the time to censoring with density g() and cumulative density G(). With \( \tilde{T} \) denoting the true time-to-event, the subject’s follow-up time T is \( T=\min \left(C,\tilde{T}\right) \). Let δ denote the event type: δ =1 if an event took place (\( \tilde{T}\le C \)); and δ = 0 if the subject got censored (\( \tilde{T}>C \)).

Censoring is random, if \( {\tilde{T}}_i \) is independent of Ci given Xi, where Xi is the covariate vector of observation i. Random censoring assumes that subjects who are censored at time t are similar in terms of their survival experience to the subjects remaining in the study. Independent censoring is a related concept. When a study has subgroups of interest, independent censoring is satisfied if censoring is random in all subgroups. Uninformative censoring happens when the distribution of Ci and \( {\tilde{T}}_i \) do not share parameters [46, 48].

Competing risks arises when we have multiple outcomes of interest and the occurrence of one outcome prevents us from observing another outcome. As an example, consider heart disease and mortality as two outcomes of interest. If a patient dies (from a cause other than heart disease) we can no longer observe the patient’s time to heart disease. In this case, we may have complete observation of the time-to-death, but we only have partial information about the time to heart disease: we only know that it is greater than the time-to-death.

Inference About Survival

In this section, we discuss methods to summarize the time-to-event distribution of a population. First, the time-to-event distribution can be summarized into a statistic (a single number) much in the same way as the mean or median summarizes aspects of a typical distribution. The fundamental difference is censoring: some subjects may not experience an event and thus their exact time-to-event is unknown. Next, we describe the time-to-event distribution as a function of time. We show methods to estimate the survival function and equivalently the cumulative hazard function. Finally, we present methods of constructing confidence intervals around the survival and cumulative hazard functions.

Summary Statistics of Survival

A concise way of describing the survival distribution is by presenting summary statistics. Often used summary statistics of common statistical distributions include the mean, the standard deviation, and the median. However in survival analysis, in the presence of censoring, it is desirable to account for the follow-up times when we compute summary statistics. Below, in Table 4, we describe some of the commonly used survival statistics [45].

Table 4 Common statistics to summarize survival time distributions

Estimating the Survival Function

We present two estimators of the survival function: the Kaplan-Meier and the Nelson-Aalen estimator. They yield very similar results.

The Kaplan-Meier estimator is more commonly used for estimating survival itself, and this is the preferred method for exploring and visualizing time-to-event data.

The Nelson-Aalen estimator, on the other hand, estimates the cumulative hazard function, and is mostly utilized by other methods, such as the Cox Proportional Hazards model.

Kaplan-Meier (Product Limit) Estimator

Let the index j iterate over the distinct time points tj when an event took place. Let us assume that there are J such time points. The product limit formula is

$$ {\displaystyle \begin{array}{c}\hat{S}\left({t}_j\right)=P\left(T>{t}_j\right)=P\left(T>{t}_j|t>{t}_{j-1}\right)P\left(T>{t}_{j-1}\right)\\ {}=\left[1-P\left(T={t}_j|T>{t}_{j-1}\right)\right]P\left(T>{t}_{j-1}\right)\\ {}=\kern3.75em \left(1-{h}_j\right)S\left({t}_{j-1}\right),\end{array}} $$

where hj is the hazard at time tj. Expanding this formula yields the Kaplan-Meier estimate

$$ \hat{S}\left({t}_j\right)={\prod}_j\left(1-{\hat{h}}_j\right)={\prod}_j\left(1-\frac{d_j}{n_j}\right), $$

where dj is the number of events and nj is the number of patients at risk at time tj.

Nelson-Aalen Estimator

The Nelson-Aalen estimator estimates the cumulative hazard as

$$ \hat{H}(t)={\sum}_{j:{t}_j\le t}\frac{d_j}{n_j}. $$

The relationship between the cumulative hazard and the survival function can be used to estimate survival, yielding the Breslow formula

$$ \hat{S}\left({t}_j\right)=\exp \left(-\hat{H}(t)\right)={\prod}_{j:{t}_j\le t}\exp\ \left(-\frac{d_j}{n_j}\right) $$

Comparison of the Kaplan-Meier and the Breslow (Nelson-Aalen) estimators (Fig. 19). Since exp (−hj) ∼ 1 − hj for small hj, the Kaplan-Meier and the Breslow estimates are very similar and asymptotically equal. The Breslow estimate has uniformly lower variance but is upwards biased [46]. When ties are present in the data, the Kaplan-Meier estimate is more accurate. Fleming and Harrington proposed a modification to the Breslow estimate by introducing a small jitter to break the ties in the follow-up times.

Fig. 19
A multiple-line graph of survival function versus follow-up. The lines are Kaplan, Meier, Nelson, and Aalen. All the lines descend from (0, 1.0) to (6, 0.92). Data are estimated.

The Kaplan-Meier and the Nelson-Aalen survival curves for the diabetes data set [47]. The two curves are so close that they are virtually indistinguishable

Confidence Intervals for the Survival Curves

Whenever conducting a survival analysis it is imperative to present confidence intervals (CIs). Statistical packages routinely offer such estimates. However when survival analysis is conducted with less conventional time-to-event modeling methods, often packages that implement these methods offer no facilities for CI estimation. We thus present here the fundamentals of estimating CIs for survival curves and hazard curves.

There are two fundamentally different approaches to constructing the confidence intervals and for each approach there are numerous variants. For brevity, in this section, we focus on one common method for directly estimating the confidence interval of the survival function. The interested reader is referred to Appendix 1 for the other methods.

Greenwood’s formula. We consider constructing the confidence interval in survival space (as opposed to log survival or hazard space). The variance of the log survival function can be estimated using Greenwood’s formula

$$ \mathrm{Var}\left(\log S(t)\right)\hat ={\sum}_{j:{t}_j\le t}\frac{d_j}{n_j\left({n}_j-{d}_j\right)}, $$

where dj and nj are defined previously as the number of events at tj and the number of patients at risk at time tj, respectively. The delta method can be used to derive the variance of the (non-log) survival function, which yields the plain-scale confidence interval

$$ \hat{S}(t)\pm z\sqrt{\hat{S}{(t)}^2{\sum}_{j:{t}_j\le t}\frac{d_j}{n_j\left({n}_j-{d}_j\right)},} $$

where z is the normal quantile corresponding to the confidence level [46].

Method label: Kaplan-Meier (KM) estimator of survival curves

Main Use

 • Estimate survival curves

 • Visualize the survival curves

Context of use

 • Non-parametric modeling

 • Predict survival probability at time t

 • The data does not meet the assumptions of more sophisticated (e.g., cox regression) survival modeling

Secondary use

 • Checking the proportional hazards assumptions

Pitfalls

Pitfall 3.6.1.1. Estimating the effect of covariates is difficult. A separate curve is computed for each covariate combination. Does not scale to more than a very small number of covariates

Principle of operation

 • Non-parametric estimator

Theoretical properties and empirical evidence

 • In biomedicine it is practically expected and used in every publication involving survival

Best practices

Best practice 3.6.1.1. Plotting the KM curve can reveal data problems. Consider the complementary log-log plot of the KM curve

References

Recommended textbooks include [45, 46, 48]

Comparing Survival Curves

Comparing the estimated survival curves from two or more populations. Two survival curves are considered statistically equivalent when the data supports the hypothesis that these two curves are identical and any apparent difference between them is due merely to random variations in the samples that were used to estimate the curves.

In this section, we focus on the log rank test. Extensions of the log rank test are described in Appendix 1. Consider a group variable, which divides the population into G groups. At each unique event time, j = 1,...,J, the association between grouping and survival can be assessed. The null hypothesis is that the hazard at time tj is the same across all groups for all j. The alternative hypothesis is that the hazard differs between the groups at at least one j.

Let ngj denote the number of subjects at risk in group g at time tj and let dgj denote the number of failures in group g at time tj. For simplicity, we concentrate on the two-sample test, where G = 2. The expected number of failures in group 1 at time tj is

$$ {e}_{1j}=\frac{n_{1j}}{n_{1j}+{n}_{2j}}\left({d}_{1j}+{d}_{2j}\right) $$

The observed number of failures across time in group g is Og = ∑jdgj and the expected number of failures is Eg = ∑jegj. The log-rank statistic becomes

$$ Z=\frac{{\left({O}_g-{E}_g\right)}^2}{\mathrm{Var}\left({O}_g-{E}_g\right)}, $$

and the variance can be estimated from the hypergeometric distribution. Z follows a Χ2 distribution with 1 degree of freedom and can be used as test of curve equivalence [48].

Cox Proportional Hazards Regression

Two important uses of regression models is to assess the effect of covariates on the hazard and to make predictions. Regression models we consider fall into two categories: semi-parametric and parametric models. Semi-parametric models, the Cox proportional hazards regression in particular, models the hazard as a product of a non-parametric baseline hazard function (which is a function of time) and (the time-invariant) multiplicative effect of the covariates. The covariates thus have a proportional (multiplicative) effect on the baseline hazard. Fully parametric models make a distributional assumption about the cumulative hazard (as a function of time) and model the parameter of this distribution as a linear additive function of the covariates. In this section, we focus on the Cox proportional hazard model (aka Cox model, Cox PH); fully parametric models will be discussed in the section "Parametric Survival Models".

The proportional hazards assumption. Fig. 20 illustrates the proportional hazards assumption using the diabetes example from [47]. The left panel shows the cumulative hazard of diabetes as a function of years of follow-up time. The orange curve in the plot corresponds to patients with impaired fasting glucose (IFG) and the blue line corresponds to patients with healthy glucose. At all time points, the ratio of cumulative hazard along the orange line versus the blue line is constant, 6.37. In other words, having IFG (versus not having IFG) confers a proportional, 6.37-fold, increase of diabetes risk upon the patients, and it remains constant across time. To translate this into the terminology of Cox models, the baseline hazard corresponds to patients without IFG (the corresponding covariate x = 0) and they have a time-dependent risk of diabetes depicted by the blue curve. Patients with IFG (x = 1), experience a risk (hazard) that is proportionally (6.37 times) higher across the entire timeline (orange curve).

Fig. 20
A set of 2 dual-line graphs. The lines are baseline and I F G. A. Cumulative hazard versus follow-up. Both lines follow an increasing trend. B. Survival function versus follow-up. The lines follow a decreasing trend.

Proportional Hazards Assumption. The left panel shows the cumulative hazard of patients with normal glucose (in blue) and impaired fasting glucose (IFG) (in orange) as a function of follow-up time (in years). The ratio of the underlying hazards of the orange line to the blue line is constant: the hazard along the orange line versus the blue line has the same proportion. The right panel transforms the cumulative hazard into survival probability

The Cox model. Let X be the covariate matrix, and let Xi denote the covariate vector for subject i. The hazard at time t is modeled as

$$ {h}_i(t)={h}_0(t)\exp \left({X}_i\beta \right) $$
(1)

where hi(t) is the hazard of the ith subject at time t, h0(t) is the baseline hazard (common across all subjects) at time t, and β are regression coefficients. The cumulative hazards can be expressed as

$$ {H}_i(t)={H}_0(t)\exp \left({X}_i\beta \right)=\exp \left({X}_i\beta \right)\ {\int}_0^t{h}_0\left(\tau \right) d\tau $$

showing that the covariates increase (or decrease) the cumulative hazard proportionally relative to the baseline cumulative hazard. For additional details about the model (e.g. the partial likelihood function), see Appendix 1.

Assumptions.

  1. 1.

    The proportional hazards assumption: the covariates have a proportional (multiplicative) effect on the hazard relative to the baseline hazard.

    Consider two subjects, i and j, with covariate vectors Xi and Xj, respectively. The hazard ratio of these two subjects is

    $$ \frac{H_i(t)}{H_j(t)}=\frac{H_0(t)\exp\ \left({X}_i\beta \right)\ }{H_0(t)\exp \left({X}_j\beta \right)}=\frac{\exp \left({X}_i\beta \right)\ }{\exp \left({X}_j\beta \right)\ } $$

    and is constant with respect to time (the \( \frac{\left(\exp \left({X}_i\beta \right)\right)}{\left(\exp \left({X}_j\beta \right)\right)} \) ratio does not depend on time). The name proportional hazards reflects the fact that the hazards of two patients are proportional to each other.

    Continuing with the diabetes example, if patient i has IFG (Xi = 1) and patient j does not (Xj = 0), with β = 1.85, the hazard ratio is exp.(1.85) = 6.37. Therefore, the ratio of the hazards between the orange and the blue curves in Fig. 20 is 6.37.

  2. 2.

    Independence. Observations with an event are independent of each other. Only observations with an event are multiplied in the partial likelihood.

  3. 3.

    The effect of the covariates is linear and additive on the log-log survival.

Testing the Significance of the Covariates

Generally, in regression, we have two ways to test the significance of a coefficient. The first method is the likelihood ratio test and the second one is the Wald test. Although the proportional hazards regression maximizes a partial likelihood (as opposed to a full likelihood) as it leaves the baseline hazard unspecified, this does not affect the likelihood ratio test and both methods remain applicable.

Estimating the Baseline Hazard

Fitting a Cox proportional hazards model does not require the estimation of the baseline hazard. After the model has been fitted, the baseline hazardfunction is estimated using a variant of the Nelson-Aalen estimator that incorporates effects of covariates

$$ {\hat{H}}_0(t)={\sum}_{j:{t}_j\le t}\frac{\delta_j}{\sum_k{R}_k\left({t}_j\right)\exp \left({X}_k\beta \right)}. $$

where Rk(t) indicates whether subject k is in the risk set at time tj. Notice, that when β = 0, this reduces to the Nelson-Aalen estimatorfrom the “Terminology” section.

The variance of the baseline hazard is also based on the Nelson-Aalen estimator\( \mathrm{Var}\left({\hat{H}}_0(t)\right)={\sum}_{j:{t}_j\le t}\frac{d_j}{{\left({\sum}_k{R}_k\left({t}_j\right)\exp \left({X}_k\beta \right)\ \right)}^2}. \)

Making Predictions

For an individual i, hazard can be estimated as

$$ {\hat{H}}_i(t)={\hat{H}}_0(t)\exp \left({X}_i\beta \right) $$

and the corresponding survival can be computed using the Breslow estimator (see section "Estimating the Survival Function".)

$$ {\hat{S}}_i(t)=\exp \left(-{\hat{H}}_i(t)\right) $$

Testing the Proportional Hazards Assumption

There are three methods for testing the proportional hazards assumption: (1) visual inspection, (2) formal statistical testing with time-dependent covariates, and (3) Schoenfeld residuals. We describe the first two methods and refer the interested reader to Appendix 1 for a more thorough discussion of the Schoenfeld residuals.

Visual Inspection

The first method is visual inspection of the log-log survival plot. Since under the proportional hazards assumption,

$$ {\hat{S}}_i(t)=\exp \left(-{\hat{H}}_0(t)\exp \left({X}_i\beta \right)\right), $$

its log-log transform is

$$ \log \left(-\log {S}_i(t)\right)\hat =\left(\log {H}_0(t)\right)\hat +{X}_i\beta . $$

The log-log transform of two survival curves, corresponding to two different values of Xi, (say) x1 and x2, only differ in the Xiβ term, which is not a function of time t, thus the two curves should be parallel with a distance of (x2 − x1)β between them.

To check the validity of the proportional hazards assumption, we plot the log-log transform of the Kaplan-Meier survival curves for two different values of Xi and expect these curves to be parallel.

A benefit of visual inspection is that we can see where (at what t) the violation of the proportional hazards assumptions happens and we may also see patterns that suggest the functional forms to correct the violation. However, the decision whether the proportional hazards assumption is violated is subjective, no formal test is applied and hence no test statistic or p-value is obtained to guide the decision as to whether the proportional hazards assumption is violated.

Figure 21 shows the complementary log-log plot of the diabetes data set. The two curves correspond to two levels of the covariate glucose status: the blue line shows patients without impaired fasting glucose (IFG) while the orange line shows patients with IFG. Since the two curves are parallel, the proportional hazards assumption appears to hold for glucose status.

Fig. 21
A dual-line graph of log minus log survival versus log follow-up. The parameters are baseline and I F G. The lines follow an increasing trend.

Log-log survival plot of the diabetes dataset. The blue line corresponds to patients with healthy glucose levels, and the orange line to patients with impaired glucose levels. The log-log plots for the two levels of glucose status (normal versus impaired glucose) are parallel, suggesting that the proportional hazards assumption is acceptable

Figure 22 shows two synthetic examples where the proportional hazards assumption is violated. In both examples, the blue line represents the baseline hazardand the orange line corresponds to some exposure. In the left panel, the effect of the exposure changes from beneficial to harmful at about 2 years. In the right panel, the effect of the exposure (orange line) is quadratically related to (log) time.

Fig. 22
A set of 2 dual-line graphs of log minus log survival versus log follow-up. The parameters are baseline and I F G. The baselines follow an increasing trend. The I F G starts from the middle and follows an ascending trend.

Violations of the proportional hazardassumption. The blue line is the baseline hazard, while the orange line corresponds to some treatment. The left panel shows a violation where the treatment effect “switches over”: while it is beneficial initially, it becomes harmful after some time. The right panel shows a violation, where the treatment line is a function of time. The curve suggests a function form (quadratic)

Time-Dependent Covariates

The second method is based on time-dependent covariates. Under the proportional hazards assumption, adding regression terms involving interactions between the covariates and functions of time should not improve the fit. To check the validity of the proportional hazards assumption, we fit models of the form

$$ h(t)={h}_0(t)\exp \left( X\beta +\left(X\times g(t)\right)\gamma \right), $$

where g(t) are vectors of function of time, X × g(t) are covariate-time interactions and γ is the coefficient vector of the covariate-time interaction terms. Under the proportional hazards assumption, we expect γ = 0.

A benefit of this method is that a statistical test is performed, a p-value is obtained, and thus the decision is objective. A weakness is the need for choosing an appropriate function g(t). Different choices of g() can lead to different conclusions. Common choices include the identity: g(t) = t; the log transform of time: g(t) = log t ; and the heaviside function, where g(t) = 1 if t exceeds a threshold τ and g(t) = 0 otherwise.

In practice different functions of t are tested. The complementary log-log plot can provide hints as to the functional form of the violation.

Addressing the Violations of the Proportional Hazards Assumption

The consequences of violating the proportional hazards assumption are usually not dire. Violations do not usually affect the predictions, they mostly affect the error estimates.

Workarounds for the violations exist, however, they end up answering a question that is different from the original research question.

When the data set is large, violations are almost unavoidable. Thus depending on the extent of the violation and purpose of the study, we may opt to ignore the violation.

Suppose a test reports a proportional hazards violation, we start by verifying that the non-proportionality is substantial. Not all non-proportionalities are substantial. Statistically significant non-proportionality can arise from large sample sizes, where even small deviations from proportionality can become significant; or violation can arise also from influential points. The former can be ignored, the latter can be removed. To assess whether a non-proportionality is substantial, the key method is visualization. Not only can visualization show whether the non-proportionality is substantial, but it can also suggest a functional form to correct it.

For example, a formal test reports violations for the diabetes data set. However, inspecting the complementary log-log plot (Fig. 21) shows no violation of concern; the statistically significant violation is simply a result of the large sample size (54,700 patients) and is inconsequential to the analysis results.

Once we verified that the violation is substantial and decided to address it, we have several options.

  1. 1.

    The first option is stratified Cox models. If the covariate with the non-proportionality is a factor with relatively few levels, it can be used as a stratification factor in a stratified Cox model. The non-proportional effect now becomes part of the baseline hazard. If the covariate is a quantitative (continuous valued) variable, stratified Cox models can still be constructed, but the variable needs to be categorized (into a few categories) before it can be used as a stratification factor.

  2. 2.

    If the non-proportionality is present in a relatively short timeframe and not in the entire timeline, the timeline can be partitioned into segments in which the proportional hazards assumption holds and separate Cox models can be constructed in each time segment.

  3. 3.

    Finally, if the non-proportionality was detected through methods (2) or (3—see Appendix 1), using time or a transformation of time, g(t), adding an interaction term with the appropriate time-transformation can resolve the non-proportionality.

Method label: Cox proportional hazards regression

Main Use

 • Regression models for time-to-event outcomes

Context of use

 • Right-censored data

 • Interest is the effect of covariates and making predictions

 • Same interpretability as classical regression models for other outcome types

Secondary use

N/A

Pitfalls

Pitfall 3.6.3.1. The key assumption is the proportional hazards assumption. Often, violation of the proportional hazards assumption is a non-issue, occasionally it can lead to problems

Pitfall 3.6.3.2. The models assume linearity and additivity. Not appropriate if these assumptions are violated

Pitfall 3.6.3.3. High dimensionality is a problem for the unregularized model

Principle of operation

 • It is a semi-parametric regression model

 • The effect of covariates is a proportional (multiplicative) increase/decrease relative to a time-dependent baseline hazard

 • Coefficient estimates are obtained from maximizing a partial likelihood

Theoretical properties and empirical evidence

 • Although a partial likelihood is maximized, the favorable properties of maximum likelihood estimation are preserved: Estimates are consistent, efficient and asymptotically normally distributed

 • Partial likelihood is convex and thus easy to solve

Best practices

Best practice 3.6.3.1. First-choice model for time-to-event data

Best practice 3.6.3.2. Consider, additionally, whether the problem can be solved as a classification problem, or using survival modeling versions of ML predictive models

Best practice 3.6.3.3. In the presence of substantial violations, different models, including extensions of the cox PH, may be more appropriate

Best practice 3.6.3.4. Consider the Markov boundary feature selector for survival analysis that results from using cox proportional hazards models as conditional independence testing within the Markov boundary algorithm

Best practice 3.6.3.5. For high-dimensional data, consider regularized cox proportional hazards models. Also consider the cox Markov boundary method described above

Best practice 3.6.3.6. If age is included in the model and is nonlinear, consider an age-scale Cox PH model

References

Kleinbaum DG, Klein M. survival Analysis. A self-learning text. 2020, springer

Therneau T, Grambsch P. modeling survival data. Extending the cox model. 2000, springer

Extensions of the Cox Proportional Hazards Regression

Several extensions to the Cox PH model have been proposed. In this section, we review some of them.

Stratified Cox Model

Stratified Cox models allow the population to be divided into different non-overlapping groups, called “strata”. Each stratum has its own baseline hazard and each group may also have its own coefficient vector. The standard form of a stratified Cox models is

$$ {h}_i(t)={h}_{0k}(t)\exp \left({X}_i\beta \right) $$

which assumes a common covariate effect across all strata that is proportional to the stratum-specific baseline hazard, h0k(t) for the kth stratum. The coefficients represent an “average” hazard ratio across the population (regardless of strata). This is the most flexible way of incorporating effects that violate the proportional hazards assumption, but stratified cox models offer no direct way of assessing the significance of the stratifying factor. An alternative form of the stratified Cox models considers the possibility of some covariates in a stratum (or some of the strata) having an effect that differs from its effect in other strata. Such effects are incorporated as interaction effects between the covariate and the stratum. If all covariates have interactions with the strata, then the resulting Cox model is the same as fitting separate Cox models for each stratum. Naturally, having to estimate separate baseline hazards and interaction terms requires sufficient sample size.

Recurring Events and Counting Process Cox Model

So far, time-to-event data was described by the triplet {Ti, δi, Xi}, where Ti denotes the time to event, δi the event type (event or censoring), and Xi is the covariate vector. Alternatively, each subject’s timeline can be divided into multiple segments and each segment can be described by a quartet {starti, endi, δi, Xi}, where starti and endi are the two end points of the time segment, δi denotes whether an event occured in the time segment, and Xi is the covariate vector. This format is called the counting process format. Many applications of the counting process format exist, here we highlight a few.

The first application is the change of the time scale. The term time scale refers to the way time is measured. The triplet format measures time on the study scale, and, specifically, time 0 is when subjects entered the study. The counting process format allows for different time scales. For example, time can be measured as patients’ age, where starti is the age when they entered the study and endi is the age when they experienced an event. We discuss different time scales later in more detail.

Another commonly used application of the counting process format is time-dependent covariate Cox models. Time-dependent covariate Cox models allow for modeling under the assumption that the covariates can change over time. The time scale is divided into multiple segments and each segment can have its own covariate vector. As long as the subjects experience at most one event, the time-dependent covariate Cox model does not cause any complications, even though each subject can contribute multiple observations (rows). This stands in contrast to longitudinal data analysis (section "Longitudinal Data Analysis"), where observations from the same subject are correlated and this causes estimation issues. The key assumption to avoid such estimation problems is that the subjects have at most one event.

A third application of the counting process format is when subjects can experience multiple events. The timeline can be divided into multiple segments when subjects experience an event: resulting in a separate timeline for the first, second, etc. event. Now, each subject can enter the partial likelihood function multiple times. Several remedies exist. First, we can consider only the first event of all patients. Second, we can use longitudinal data analysis techniques. Analogues of both GEEs and mixed effect models exist for time to event outcomes. A third, commonly used option is to initially fit a model ignoring the correlation due to the possibly multiple observations per subject (with event) and then re-computing the error estimates, taking the correlation into account. Chapter 8.2.2 of [19, 46] describes three popular variations of this option in detail.

Age-Scale Models

The term time scale refers to the way time is measured for a time-to-event outcome. Typically, time is measured from a particular event, e.g. enrollment into the study, to the end of study. This is the study time scale. An alternative is calendar scale, where time is measured based on a calendar, e.g. the age of the participant.

Changing the time scale has two important effects. First, the risk sets are different. At first sight it may appear that age scale can be easily converted into a study-scale by Ti = endi − starti, however, the risk sets are different. Consider two patients. The first one enters the study at the age of 40 and suffers a heart attack (event of interest) at the age of 51. The second one enters the study at 55 and suffers a heart attack 5 years later at the age of 50. On the study-time scale, we have two events, one at 5 and one at 11 years. At the time of the first event, at year 5, we have a risk set of two patients. In contrast, on the age scale, we have two events, one at 51 and one at 60. At both events, the risk set contains only one patient. Since the risk sets are different, the survival estimates (or equivalently the hazard estimates) are different, as well. These two time-scales yield different results and admit different interpretations.

The second effect of age-scale relates to how age is entered into the model. One option is to use study-scale and add a covariate that represents age; and the other option is to use age-scale. In case of using age scale, age is modeled completely non-parametrically; the baseline hazard is a function of age. As such, the statistical significance of the age effect is difficult to assess. Conversely, when age is added as a covariate, the usual assumptions (linear, additive effect) apply and the baseline hazard is based on time in the study. Whether we use age-scale or study-time scale can also be determined based on whether the model assumptions about age as a covariate are reasonable.

Parametric Survival Models

The Cox proportional hazards model estimates the effects of the covariates first and then estimates the baseline hazard in a non-parametric manner. Non-parametric estimation typically requires more samples than parametric estimation.

Parametric models that model the time-dependent hazard (or equivalently, the survivor) curves in a fully parametric manner, can be more sample efficient if their assumptions are met.

In this section, we model the time-to-event variable T using parametric distributions. Consider X, a covariate matrix, β the regression coefficients and W is the error term. Rather than modeling T directly, we model its natural logarithm as

$$ \log\ T=\mu + X\beta +\sigma W $$

In this model, μ is called a location parameter, σ is called the scale parameter and W is the error term. Similarly to linear regression, in parametric survival models, the coefficients have a linear effect on the location parameter of the distribution of logT.

Principle of operation. Recall from section “Predictive Modeling Tasks”, that in OLS regression with covariates X, outcome y, and error term ε, the model can be written as y =  + ε. The error term is assumed to follow a normal (Gaussian) distribution, with location parameter (mean) μ = 0 and scale parameter (standard deviation) σ. The covariates linearly affect the location parameter and the outcome thus have the same distribution as the noise, i.e. Gaussian, but with location parameter μ =  and scale parameter σ (which remained unchanged).

Parametric survival models work analogously. The error term W is assumed to have a particular distribution with location and scale parameters μ and σ, respectively. The outcome logT then follows the same distribution as W, with location parameter μ + . The model assumes that the covariates effect the location parameter linearly. The various parametric survival models differ in their choice of the distribution of W. We refer the reader to Appendix 1, which discusses several such distributions and the corresponding parametric survival model.

Property [Accelerated Failure Time (AFT)]. The covariates shift the location μ, which accelerates or decelerates the passing of time. This class of models is referred to as accelerated failure time (AFT) models. Let S0(t) denote the survival time distribution when all covariates are 0. The survival time distribution for a subject with covariates X is

$$ {\displaystyle \begin{array}{c}S(t)=\Pr \left(T>t\right)=\Pr \left(\log T>\log t\right)\\ {}=\Pr \left(\mu + X\beta +\sigma W>\log t\right)\\ {}=\kern0.5em \Pr \left(\mu +\sigma W>\log t- X\beta \right)\\ {}=\Pr \left(\exp \left(\mu +\sigma W\right)>t\exp\ \left(- X\beta \right)\right)\\ {}=\kern0.375em {S}_0\left(t\exp \left( X\beta \right)\right)\end{array}} $$

The covariates, depending on the sign of , accelerate or decelerate the passing of time by a factor of exp(−).

Figure 23 shows an AFT model fitted to the diabetes dataset. The outcome is diabetes-free survival, the horizontal axis is follow-up years. The orange line represents patients with impaired fasting glucose (IFG) and the blue represents patients with normal fasting glucose. Patients with normal fasting glucose have higher diabetes-free survival probability. If we draw a horizontal line at a particular (diabetes-free) survival probability, and compute the ratio of the time it takes to get to that probability along the blue line versus the orange line, we would find that this ratio is constant, exp.(−2.08) = 0.12 in this example. In other words, the time it takes for the diabetes-free survival to drop to a probability P is much shorter (takes 0.12 times as long) for patients with IFG than without.

Fig. 23
A dual-line graph of survival versus log follow-up years. The lines are baseline and I F G. The lines follow a decreasing trend from (0, 1.0) to ( (6, 0.95) and (6, 0.65). Data are estimated.

Illustration of an accelerated failure time model on the diabetes data set

Method label: accelerated failure time (AFT) models

Main Use

 • Regression models for time-to-event outcomes

Context of use

 • Right-censored data

 • Interest is the effect of covariates and making predictions

 • Same interpretability as regression models for other outcome types

Secondary use

 

Non-recommended Uses and Pitfalls

Pitfall 3.6.4.1. The key assumption is the accelerated failure time (AFT) assumption. Not appropriate if this assumption is violated

Pitfall 3.6.4.2. The models assume linearity and additivity (location shift). Not appropriate if these assumptions are violated

Pitfall 3.6.4.3. High dimensionality is a problem

Principle of operation

 • Fully parametric model that specifies the full likelihood

 • The error term is assumed to have a location-scale distribution. This ensures that the log survival time has the same distribution. Covariates change the location parameter, accelerating/decelerating the passing of time

Theoretical Properties and empirical evidence

 • Parameter estimates are obtained using maximum likelihood estimation. They are consistent, unbiased, efficient and asymptotically normally distributed

 • AFT is a family of distribution with different properties

Exponential survival model—Constant hazard assumption

Weibull survival model—AFT and PH

Log-logistic survival model—AFT and proportional odds assumption

Best practices

Best practice 3.6.4.1. Use AFT if the assumptions are met

Best practice 3.6.4.2. Use cox PH if only the PH assumption is met

References

KleinJP, Moeschberger ML. SURVIVAL ANALYSIS techniques for censored and truncated data. 2003, springer

Kleinbaum DG, Klein M. survival Analysis. A self-learning text. 2020, springer

Parametric Survival Models Versus Cox PH Models

If the model assumptions of the parametric models are met, the parametric models are more sample efficient. If the assumptions are not met or if we are in doubt, the semi-parametric model is more robust to model misspecification and only requires the proportional hazards assumption.

Appendix 1 describes method to check the appropriateness of various parametric survival models. In this section, we show one example, comparing the fit from a Weibull model (a particular type of parametric survival model) with Cox PH model.

The left panel in Fig. 24 shows the complementary log-log plot of the diabetes data set. We continue to use impaired fasting glucose (IFG) as the sole covariate and the two survival curves were computed using the Kaplan-Meier estimator. The two lines corresponding to the two values of this covariate, IFG in orange and non-IFG in blue, are reasonably straight and parallel for the first 6 years. As shown in Appendix 1, the curves being parallel indicates that the proportional hazards assumption holds. If the curves are straight the AFT assumption holds. Beyond 6 years, the curves turn and become horizontal. They remain parallel but they no longer continue to have a constant slope. The turn signals a violation of the AFT assumption, however, they remain parallel, indicating that the PH assumption is still met. This appears to be a small violation, however, a large portion of the population have a follow-up time in excess of 6 years.

Fig. 24
Two graphs. A. A dual-line graph of log minus log survival versus log follow-up. Both lines follow an increasing trend. B. A multiple-line graph of survival versus follow-up. The parameters are no I F G, I F G, Kaplan-Meier, and Weibull. All lines follow a decreasing trend.

Weibull survival model on the diabetes data set. The right panel shows the complementary log-log survival curve. The orange line corresponds to patients with IFG and the blue line without. The right panel shows the survival curves. The solid lines are estimated using the Kaplan-Meier estimator, while the dashed lines are computed from a Weibull model (see Appendix 1 for details). Orange corresponds to patients with IFG, while blue corresponds to patient with healthy fasting glucose

The right panel in Fig. 24 shows the Weibull fit (in dashed lines) and the Kaplan-Meier survival curve (in solid line) for the IFG patients (orange) and non-IFG patients (blue). We can see that the lack of events beyond 6 years caused a substantial bias in the Weibull estimates. We expected this bias based on the violation of the AFT assumption. Since the PH assumption is still met, a Cox model would be a better fit for this data.

Non-Linear Survival Models

The regression models in the previous sections all assume that the covariates have a linear (additive and proportional) relationship with the log hazard or log survival time. To overcome this limitation, the original features X can be transformed through a non-linear non-additive transformation to serve as the input to the partial or full likelihood function of the above models. Deep-learning based survival models and the Gradient Boosting Machine (GBM) for time-to-event outcome have taken this approach. The term in the Cox partial likelihood is replaced by a non-linear non-additive function f(X). This function is an ANN for deep learning and a GBM for Cox GBM.

A Random Survival Forest (RSF) consists of a collection of B trees. This collection does not directly maximize a likelihood function like the previously discussed methods, so RSF works slightly differently. In RSF, each of the B trees models the cumulative hazard of a patient using the Nelson-Aalen estimator. The cumulative hazard estimates from the B trees are then averaged to obtain an overall prediction for the cumulative hazard [50].

One key in time-to-event modeling is censoring. The partial likelihood automatically takes censoring into account, but the full likelihood may not. Deep learning models based on the full likelihood, assuming a Weibull distributed survival time, have been proposed. An alternative to the partial likelihood in the presence of censoring is the censoring unbiased loss (CUL), which is a general method for bias-correcting the unobservable loss. Censoring unbiased deep learning (CUDL) follows this strategy [51, 52].

High-dimensional data. Similar to non-survival regression models, high dimensionality, when the number of predictor variables is large relative to the number of observations, poses a challenge. In non-survival regression, regularizing the likelihood function was one of the solutions. Analogous solutions by regularizing the partial likelihood function of the survival models has been proposed in the form of an elastic-net style Cox model.

Survival models for longitudinal data. When we have longitudinal data, the covariates and the outcome can change over time. We have already discussed extensions to the Cox model that allow for changing predictors (time-dependent covariates) and recurring events. In the general regression setting, longitudinal data is handled through marginal models or through mixed effect models, because the observations become correlated. We have also discussed that in the Cox model, as long as we only have one event per patient, marginal or mixed effect models are not required [46].

Apart from providing the correct error estimates in the longitudinal setting, mixed effect models are also used for separating subject-specific and population effects. Frailty models are the time-to-event outcome analogues of the mixed effect regression models and allow for separating subject-specific effects and population effects.

Longitudinal Data Analysis

Longitudinal data is generated when measurements are taken for the same subjects on multiple occasions. For example, EHR data of patients is longitudinal as the same measurements, e.g. vitals, are taken at multiple encounters. Longitudinal data stands in contrast with single cross-sectional data, where measurements are taken (or aggregated) at a single particular time point. It also contrasts with time series data, where measurements are taken for a single subject (or for few subjects) for a long period of time and inference is conducted within the subject.

Using longitudinal data offers several advantages. (1) It can provide more information about each subject than data from a single cross-section since we observe the subject over a time span. (2) It also allows for a crossover study design, where a patient can be a control patient for himself: When a subject experiences an exposure during the study period, he/she is a “control” subject before the exposure and is an “exposed” patient after the exposure. (3) it also allows for separating aging effects from intervention effects. Finally, (4) it allows for separating subject-specific effects from population effects [5, 53].

Figure 25 shows an illustrative synthetic data set. Five subjects are followed over 10 time periods and a measurement is taken in each time period. The left panel shows a plot of the data set. The horizontal axis represents time, and the vertical axis is the measurement. We can see an overall upward trend: as time increases the measured values increase. We fitted a linear regression model to the entire data, which is shown as the bold black line. This model is a population-level model and it confirms this increasing trend. We also fitted a regression line, shown as dashed gray lines, to each individual subject. These are called individual-level lines. We can see that most (all five in this sample) subjects also exhibit an increasing trend, but their initial points (y-intercepts) vary, and their slopes also vary. Some methods allow for modeling individual effects such as the per-subject intercept and per-subject slope.

Fig. 25
2 graphs. A. A multiple-line graph of the outcome versus time. A line and 5 dotted lines on the sides follow an increasing trend. B. A scatterplot of the error versus index. The plots in different shades are scattered in a linear pattern.

Longitudinal Data Illustration. Five subjects are followed over 10 time periods and a measurement is taken in each time period. The left panel shows a plot of the data set. The horizontal axis represents time and the vertical axis is the measurement. The bold black line depicts an overall trend (population trend) and the 5 dashed lines represent the (individual) trends of the five subjects. The right panel shows the error relative to the population trend. The horizontal axis is an index, grouped by subject. Different colors represent different subjects

These advantages of longitudinal data analysis, however, come at a price. The multiple observations of the same patients are correlated with each other, which violates the i.i.d. (independent, identically distributed) assumption that most analytic methods make.

The right panel in Fig. 25 shows the error relative to the population-level regression model (the bold black line in the left panel). The horizontal axis is simply an observation index and the vertical axis is the (signed) error (residual). Observations from the five subjects are grouped together along the horizontal axis in increasing order of time: index 1–10 corresponds to the 10 time points of the first subjects, etc. Different subjects are depicted in different colors. We can see that the errors of each subject (errors depicted in the same color) tend to form clusters. Within a subject, once we know the error of one observation, errors of the other observations will typically not differ as much as errors from a different subject. This means that errors of the same subject are correlated with each other. There is also a trend within each subject: as time increases, the errors tend to increase or decrease. This is due to the differences in the growth rates of the different subjects (the differences across the slopes of the gray lines in the left panel).

When we assume that the errors in the right panel are generated from 50 independent observations, we would estimate the variance of the outcome to be about 1 (ranging between −2 and 2). Once we account for the fact that the observations came from 5 different subjects, the spread of the error becomes the range covered by the same color, and the variance becomes approximately 0.57; and after accounting for the differences in individual growth rates, the error variance drops to (approx.) 0.1. Such reduction in the noise variance leads to much improved estimates and is very beneficial for detecting significant effects from exposures.

The data is balanced when measurements for all subjects are taken at the same time points.

When the data is balanced, coefficient estimates, whether they are computed using methods for longitudinal data or for cross-sectional data, will be similar albeit with substantially different errors. If the purpose of the analysis is prediction for previously unseen subjects, no individual effect estimates will be available, thus the results obtained from the regular regression models will be very similar to those obtained from the longitudinal models.

Conversely, when the design is not balanced, methods specifically designed for longitudinal data should be used. Also, when the significance of the coefficients needs to be estimated, or estimating errors is important, or individual (within-subject) effects are of interest, or if predictions are to be made for previously seen patients (whose individual effect sizes are already estimated), methods specifically designed for longitudinal data should be used (regardless of whether the design is balanced or not).

As we mentioned earlier, the key drawback of using longitudinal data is the correlation among the observations of the same subject. All methods in this section address this correlation. Moreover, linear mixed models (LMMs) can additionally model within-subject variability, while generalized estimating equations (GEE) offer improved coefficient estimates at lower computation cost relative to LMMs. Both of these techniques are described in later sections.

Terminology and Notation

The sampling unit of the analysis is a subject or a patient and we index the sampling units i = 1, . . , N. The analytic units are observations. Each patient can have multiple observations, indexed by j = 1, …, ni, taken at ni different occasions (time points). The time of these occasions are denoted by tij, the time of the jth occasion for the ith patient.

The design is balanced, if all subjects share the same time points.

Let yij denote the response variable (of patient i at occasion j) and let X be covariates. The covariates for subject i can be time-invariant (constant across time) or they can vary across time (a situation referred to as time-varying covariates). The vector of time-invariant covariates for subject i is denoted by Xi and the vector of time-varying covariates from subject i at occasion j is denoted by Xij.

Random effects are effect estimates that are computed for observation units that are thought of as a random sample from a population. In contrast, fixed effects are effect estimates computed for specific observation units. The within-subject effects are random effects, because the corresponding units, namely the subjects, are thought of as a (hopefully) representative random sample (the discovery cohort) from a population of patients. We could have conducted our study with a different random sample from the same population and we would expect similar results. Conversely, the time effects are fixed effects, because we wish to know the effect of a specific time period j on the outcome. The time points are not a representative random sample from a population of time points, they represent periods of exposure to the intervention. If we conducted our study using different time periods, say 2 months exposures as opposed to 2 days, we would certainly expect to get different results.

The questions we ask about longitudinal data are similar to and are a superset of the questions we ask about cross-sectional data. These questions include:

  1. 1.

    Are two sets of observations (yi1, yi2, …, yin and yk1, yk2, …, ykn), one for patient i and the other one for patient k, different?

  2. 2.

    Are observations at different time points j and k different (y·j=?y·k)? Or more broadly, describe the changes in observations over time.

  3. 3.

    Making predictions. We may wish to predict the value of the observation at a particular time point for a subject we have observed before; or we may want to predict the value of an observation for a subject that we have not seen before.

  4. 4.

    Estimate the effect of exposures.

  5. 5.

    Estimate subject-specific effects.

ANOVA and MANOVA for Repeated Measures

Before the advent of more advanced and flexible analysis methods, repeated ANOVA and MANOVA were the first-choice methods for analyzing repeated measures data. In this chapter, we focus on the more advanced methods (which subsume ANOVA and MANOVA), and detailed discussion of ANOVA and MANOVA are presented in Appendix 2. Given their historic importance and hence presence in the health sciences literature, we still provide method labels for them below.

Method Label: Repeated Measures ANOVA

Main Use

 • ANOVA for repeated measures data

Context of use

 • Single-sample or multiple-sample ANOVA

 • Assumes the data to be in the PP (person-period) format

 • Assessing the significance of time effects and treatment effects

Secondary use

 

Pitfalls

Pitfall 3.7.2.1. Repeated measures ANOVA is not a predictive model

Pitfall 3.7.2.2. Repeated measures ANOVA assumes compound symmetry; not appropriate when this assumption is violated

Principle of operation

 • Operates on the same principle as most ANOVA methods

 • See Appendix 2 for detailed models

Theoretical properties and empirical evidence

 • Requires balanced design

 • Assumes the compound symmetry

 • Performs statistical tests of time effect and treatment effects

 • Contrasts can be used to perform specific tests (e.g. difference between two treatment levels)

Best practices

Best practice 3.7.2.1. Also consider the random intercept LMM. The LMM is more flexible and contains the ANOVA specification as a special case

References

 • Hedeker D, Gibbons RD. Longitudinal Data Analsyis. Wiley, 2006. Chapter 2

Method label: repeated measures MANOVA

Main Use

 • MANOVA for repeated measures data

Context of use

 • Single-sample or multiple-sample MANOVA

 • Assumes the data to be in the PL (person-level) format

 • Assessing the significance of time effects and treatment effects

Secondary use

 

Pitfalls

Pitfall 3.7.2.3. Repeated measures MANOVA is not a predictive model

Pitfall 3.7.2.3. Repeated measures MANOVA in its original form, does not allow for missing observations

Principle of operation

 • Operates on the same principle as most ANOVA /MANOVA methods

 • See Appendix 2 for detailed models

Theoretical properties and Empirical evidence

 • Requires balanced design

 • In contrast to ANOVA, it does not make the compound symmetry assumption, but it does not allow missing values

 • Performs statistical tests of time effect and treatment effects

 • Contrasts can be used to perform specific tests (e.g. difference between two treatment levels)

Best practices

Best practice 3.7.2.2. Also consider LMMs

References

 • Hedeker D, Gibbons RD. Longitudinal Data Analsyis. Wiley, 2006. Chapter 3

Linear Mixed Effect Models

The key difference between methods developed for longitudinal data and for cross-sectional data lies in their ability to take within-subject correlations into account. Linear Mixed Effect Models (LMM), the subject of the present section, aim to partition the variance-covariance matrix into within-subject and between-subject variances.

If differentiating and estimating within-subject versus between-subject variance is of interest, then Linear Mixed Effect Models should be used.

Model Specification and Principle of Operation. Regular regression models model the outcome as a combination of deterministic “fixed” effects and a random noise

$$ {y}_i={\beta}_0+{X}_i\beta +{\varepsilon}_i, $$

where β0 is an intercept, β is a vector of coefficients for the fixed effects imparted by the covariates Xi and ε is a normally distributed noise term with mean 0 and variance σ2.

Mixed effects regression models, similarly to regular regression models, allow for fixed effects, but they further partition the “noise” into different anticipated random effects. Different types of LMM models differ in the random effects they anticipate, which in turn, confers different structures on the variance-covariance matrix.

Let the subscript i correspond to the subject and j to the (index of) the occasion when the subject was observed. Let Xij denote the covariate vector and yij the response of subject i at occasion j. The time point of this occasion is tij.

Mixed effect models are often expressed in the hierarchical format. The first-level model is on the level of the population

$$ {y}_{ij}={\beta}_{0i}+{X}_{ij}\beta +{t}_{ij}{\beta}_{ti}+{\varepsilon}_{ij} $$

and the second-level (subject-level) models define the models for the (subject-specific) intercept β0i and (time) trend βti for subject i. Mixed effect models are a family of models that chiefly differ in the way β0i and βti are defined.

Assumptions. Different definitions lead to different variance-covariance matrices based on different assumptions, however, all mixed effect models share some common assumptions.

  1. 1.

    As in all linear models, the fixed effects, Xij, are assumed to have a linear (additive and proportional) relationship with yij. This assumption can be relaxed by including a priori known interactions and nonlinearities.

  2. 2.

    Time enters the mixed effect models explicitly (tij). This allows for observation times to vary across subjects. In many models, time has a linear additive effect on the response, however, models with curvilinear relationships will be discussed later.

  3. 3.

    The structure of the variance-covariance matrix is specified through a random intercept and/or trend. This allows for the dimension of the variance-covariance matrix to vary across patients, which in turn, allows for a differing number of observations across subjects. The second and third properties combined make mixed effect models appropriate for the analysis of longitudinal data that is not of repeated measures design (observation times vary) or for repeated measures design with missing observations.

  4. 4.

    Models in this chapter assume an outcome with Gaussian distribution, but mixed effect models have been extended to the exponential family outcomes through a linkage function that linearizes these outcomes. These models, Generalized Mixed Effect Models, are the mixed-effect analogues of GLMs.

In the following sections, we describe specific mixed effect models, their assumptions, relationships between covariates, time and outcome they can represent, and the variance-covariance matrix forms these assumptions yield.

Random Intercept Models

Random intercept models are mixed effect models with a subject-specific random intercept effect but only with a population average trend effect. The second level models are thus

$$ {\beta}_{0i}={\beta}_0+{\upsilon}_i $$
$$ {\beta}_{ti}={\beta}_t $$

The subject-specific intercept β0i is decomposed into a population average effect β0 and a subject-specific random effect υi. The time effect βti is simply the population average trend (slope) βt (without a subject-specific random effect). Thus, the random intercept model decomposes the “noise” into a subject-specific random effect υi and the actual noise at the jth occasion εij.

It is further assumed that

$$ {\upsilon}_i\sim N\left(0,{\sigma}_{\upsilon}^2\right) $$
$$ {\varepsilon}_i\sim N\left(0,{\sigma}_e^2\right) $$

This yields a block-diagonal variance-covariance matrix. Each block corresponds to a subject and is of the form

$$ {\varSigma}_i=\left[\begin{array}{cc}\begin{array}{ccc}{\sigma}_v^2+{\sigma}_e^2& {\sigma}_v^2& {\sigma}_v^2\\ {}{\sigma}_v^2& {\sigma}_v^2+{\sigma}_e^2& {\sigma}_v^2\\ {}{\sigma}_v^2& {\sigma}_v^2& {\sigma}_v^2+{\sigma}_e^2\end{array}& \begin{array}{cc}\cdots & \kern1.25em {\sigma}_v^2\kern1.25em \\ {}\cdots & {\sigma}_v^2\\ {}\cdots & {\sigma}_v^2\end{array}\\ {}\begin{array}{ccc}\vdots \kern2.5em & \vdots & \kern3.125em \vdots \\ {}{\sigma}_v^2\kern2em & {\sigma}_v^2& \kern2.375em {\sigma}_v^2\end{array}& \begin{array}{cc}\ddots & \vdots \\ {}\cdots & {\sigma}_v^2+{\sigma}_e^2\end{array}\end{array}\right] $$

This form of variance-covariance matrix is referred to as compound symmetry. It assumes that the covariance between observations of the same subject are constant over time. This is often unrealistic: observations closer to each other in time are typically more correlated than observations further away in time.

Random Growth Models

Random growth models, in addition to the subject-specific random intercept, also have a random slope for time. This allows (i) for changes (slopes) to vary across subjects and (ii) for time to enter the variance-covariance matrix. The second-level model is

$$ {\beta}_{0i}={\beta}_0+{\upsilon}_{0i} $$
$$ {\beta}_{ti}={\beta}_t+{\upsilon}_{ti} $$

Similarly to the way the intercept was decomposed into a subject-specific effect υ0i and a population-level effect β0 in the random intercept model, in the random growth model the time effect is also decomposed into a subject-specific effect υti and a population-level time effect βt. It is assumed that

$$ {\upsilon}_{0i}\sim N\left(0,{\sigma}_{\upsilon_0}^2\right),\kern0.28em {\upsilon}_{ti}\sim N\left(0,{\sigma}_{\upsilon_t}^2\right) $$
$$ {\varepsilon}_i\sim N\left(0,{\sigma}_e^2\right) $$

With subjects i and k being independent, the variance-covariance matrix is block-diagonal, with each block representing a patient and taking a form of

$$ {\varSigma}_i={\sigma}_e^2I+{T}_i{\varSigma}_{\upsilon }{T}_i^T $$

where

$$ {T}_i^T=\left[\begin{array}{cccc}1& 1& \cdots & 1\\ {}{t}_1& {t}_2& \cdots & {t}_{n_i}\end{array}\right] $$

and

$$ {\varSigma}_i=\left[\begin{array}{cc}{\sigma}_{\upsilon_0}^2& {\sigma}_{\upsilon_0{\upsilon}_t}\\ {}{\sigma}_{\upsilon_0{\upsilon}_t}& {\sigma}_{\upsilon_t}^2\end{array}\right]. $$

With time entering the covariance matrix, the covariance among the observations of the same patient can change over time.

Polynomial Growth Model

To model non-linear time effects, the level-1 model can be extended with polynomials of time.

Specifically, in vector notation, it becomes

$$ {y}_i={\beta}_{0i}+{X}_i\beta +{T}_i{\upsilon}_i+{\varepsilon}_i. $$

where Ti contains polynomial of ti. To be able to model a quadratic time effect, Ti would be

$$ {T}_i=\left[\begin{array}{ccc}1& {t}_1& {t}_1^2\\ {}1& {t}_2& {t}_2^2\\ {}1& \vdots & \vdots \\ {}1& {t}_{n_i}& {t}_{n_i}^2\end{array}\right]. $$

Comparison of the Various Model Assumptions

Figure 26 illustrates the difference among the three model types. Four synthetic data sets were generated using four different assumptions. In all four data sets, five subjects were observed at 10 time points. The four data sets are plotted in the four panels. For all four panels, the horizontal axis is the index j of the observations, grouped by subject. Since the key issue in longitudinal data is partitioning the errors (based on these four assumptions), the vertical axis corresponds to the error relative to a population-level model.

Fig. 26
A set of 4 scatterplots of random intercept, growth model, quadratic-time random intercept, and quadratic growth. They represent the error versus the index. The plots in different shades are scattered in clusters in graph A and random linear patterns in B, C, and D.

Comparison of the various model assumptions

The first assumption is the random intercept. This causes errors to cluster by subject. The mean of the error in each subject is the subject’s random intercept βoi. No other structure can be observed: the scale of the errors remains the same over time.

The second assumption corresponds to the growth model. In addition to clustering due to the random intercept, the plot also shows that the errors consistently increase over time, at a rate that differs across patients. This growth rate is the random slope υit. Observations of the same subject closer together in time have more similar errors (and thus observations) than observations of the same subject further apart in time. This is a violation of the compound symmetry structure, but the random growth model can handle this situation correctly.

The third assumption is quadratic time, random intercept. The data has both linear and quadratic population-level time effect but only a random intercept. We only removed the linear time effect, thus the errors (residuals) form a per-subject parabola, indicative of a quadratic effect. The parabolas have similar shape across patients (although different parts of the same parabola are visible), which suggests that this is a (quadratic) population-level effect, but the parabolas have different foci along the y axis, suggesting a subject-level random intercept.

Finally, the quadratic growth model has both population-level as well as a subject-level quadratic time effect. The quadratic structure is apparent in the parabolic shapes of the within-subject errors, however, the shape of the parabolas change across the patients, suggesting a subject-level effect. Because of the strong population-level quadratic time-effect, it is difficult to see whether the subject-level time effect is only linear or quadratic. The parabolas are located at different positions along the vertical axis, which indicates a subject-level random intercept.

Generalized Linear Mixed Effect Models (GLMM)

Generalized Linear Mixed Effect Models relate to LMMs the same way as Generalized Linear Models (GLMs) relate to linear regression models. GLMMs allow for a link function to link the expectation of the outcome with the linear predictor. Similarly to GLMs, GLMMs can thus be used to solve regression problems with non-Gaussian dependent variables, such as classification problems (logistic outcome), counting problems (Poisson outcome), etc.

Method label: linear mixed effect models (LMM)

Main Use

 • Regression models for longitudinal data

Context of use

 • Longitudinal data with balanced or unbalanced design

 • Separates subject-level effects from population-level effects

 • Predictive modeling with within-subject predictions

 • Accurate error estimates are required or the interest is the statistical significance of covariates

Secondary use

 • Generalized LMM has been developed for non-Gaussian response variables

Pitfalls

Pitfall 3.7.3.2. GEEs can be computationally more efficient and may produce better predictive models. Use LMM when the goal is to identify subject-level effects

Principle of operation

 • Partitions the error into subject-level and population-level components

 • Random intercept model: Assumes a subject-specific intercept

 • Random growth model: Assumes a subject-specific intercept and time-trend

 • Polynomial growth model: Assumes a subject-specific curvilinear time effect

Theoretical properties and empirical evidence

 • See the text for the detailed assumptions

 • ML estimator. Coefficient estimates are consistent, asymptotically normal

Best practices

Best practice 3.7.3.1. Use LMM when the goal is to identify subject-level effects

Best practice 3.7.3.2. If the main purpose is estimating the effect size of covariates or making predictions for previously unseen subjects, GEE can be more computationally effective

References

 • Hedeker D, Gibbons RD. Longitudinal Data Analsyis. Wiley, 2006. Chapter 4

Generalized Estimating Equations

As discussed earlier, the key statistical challenge with longitudinal data is the correlation among the observations of the same subject. This challenge is addressed by assuming a variance-covariance matrix for the error when the regression parameters are estimated. In the previous section (“Linear Mixed Effect Models”), we described a method for constructing such a matrix by separating the error variation into a set of subject-specific and a set of population-level effects. These effects define the form of the variance-covariance matrix. An alternative strategy is to assume a functional form for the variance-covariance matrix. This second strategy is the subject of the current section.

In this approach, the parameters that define the variance-covariance matrix are treated as nuisance parameters and the main interest is the coefficients of the covariates, including time. The variance-covariance parameters are marginalized (integrated out) and hence this type of models are referred to as marginal models.

Model Specification. The specification of the generalized estimating equations models proceeds similarly to that of generalized linear models (see section “Foundational Methods”). Given a covariate matrix X, the following components are defined.

  1. 1.

    Linear predictor: ηij = Xijβ;

  2. 2.

    Linkage function that links the mean of the linear predictor to the expectation of the outcome g(E(Yij)) = μij;

  3. 3.

    A variance function relating the mean of the outcome to its variance: Var(yij) = ϕV(μij);

  4. 4.

    A working variance-covariance matrix parameterized by a: R(a)

The first three components are shared with the generalized linear models; GEEs add the fourth component.

Several variance-covariance matrix forms are implemented by statistical software packages and the most common matrices are described below.

  1. 1.

    Identity: R(a) = I. This assumes that the observations of a subject are independent of each other and thus it reduces a GEE to a regular GLM.

  2. 2.

    Exchangeable: R(a) = ρ. Observations of the same subject have constant covariance ρ, which does not depend on time. This matrix form is the same as the compound symmetry in the random intercept models.

  3. 3.

    Autoregressive: \( R(a)={\rho}^{\left|j-{j}^{\prime}\right|} \). With j and j’ denoting two time steps, the covariance among observations of the same subject depends on time. If ρ < 1, then the further away the observations are in time, the smaller the covariance.

  4. 4.

    Unstructured. Each element of the matrix is estimated from data.

Among the four matrices, we have already seen the identity and the exchangeable structures and the unstructured matrix is straightforward to imagine. The autoregressive matrix will take the following form

$$ R\left(\rho \right)=\left[\begin{array}{ccc}1& \rho & {\rho}^2\kern0.5em {\rho}^3\kern0.5em \cdots \\ {}\rho & 1& \begin{array}{cc}\rho & \begin{array}{cc}\ {\rho}^2& \cdots \end{array}\end{array}\\ {}\begin{array}{c}{\rho}^2\\ {}\vdots \end{array}& \begin{array}{c}\rho \\ {}\vdots \end{array}& \begin{array}{cc}1& \begin{array}{cc}\ \rho & \kern0.5em \cdots \end{array}\\ {}\vdots & \begin{array}{cc}1& \kern0.5em \ddots \end{array}\end{array}\end{array}\right]. $$

When |ρ| < 1, increasing powers of ρ become smaller, thus the more distant two observations are in time, the smaller their covariance.

Figure 27 shows three types of error distributions. For 5 subjects, 10 observations were generated using independent error (left panel), exchangeable error (middle panel) and autocorrelated error (right panel). The 5 subjects are shown in different colors and their 10 observations are ordered by time along the horizontal axis. The noise has standard normal distribution with σ = .1 in all three cases. The error in the left panel is noise and all errors, regardless of which subject they came from, are independent: knowing the error of an observation for a patient does not provide any information about the error of another observation of the same patient or about any observation of any other subject. In the middle panel, the error has a noise component and a random intercept component. Errors are correlated within each subject and subjects are independent of each other. We have seen this correlation structure earlier. Finally, in the right panel, we have autocorrelated errors. Two errors of the same subject are more similar to each other the closer they are to each other in time

Method label: generalized estimating equations (GEE)

Main Use

 • Regression models for longitudinal data

 • A linkage function can be specified

Context of use

 • Longitudinal data with unbalanced design

 • Most used when the focus is on coefficient estimates and making predictions for previously unseen patients

Secondary use

 

Pitfalls

Pitfall 3.7.4.1. No individualized effects are estimated. Consider the LMM if separation of the individual effects from the population effect is desired

Principle of operation

 • Uses estimating equations

 • It is a marginal model. Assumes a parametric form for the working variance/covariance matrix and marginalizes it out

Theoretical properties and empirical evidence

 • Uses M estimation. Specification of the likelihood is not required

 • Solving estimating equations is very computationally efficient

 • Even if the structure of the variance/covariance matrix is misspecified, it yields good results

Best practices

Best practice 3.7.4.1. Use GEE when predictions for previously unseen subjects is needed

Best practice 3.7.4.2. Use LMM when subject-specific effects are of interest

Best practice 3.7.4.3. GEE can be more computationally efficient than LMM

References

 • Hardin, J.W. and Hilbe, J.M., 2002. Generalized estimating equations. Chapman and hall/CRC

 • Hedeker D, Gibbons RD. Longitudinal Data Analsyis. Wiley, 2006. Chapter 3

Fig. 27
A set of 4 scatterplots of the error versus the observation. The plots in different shades are scattered horizontally in A, clusters in B, and a zig-zag pattern in C.

Illustration of the error distributions corresponding to the Independent, Exchangeable and autocorrelated variance/covariance structures

Brief Summary of Other Techniques of Interest

Network science. The field of network science [54] offers a completely different approach to conventional predictive modeling and causal discovery methods. Network Science leverages the remarkable consistency in the properties of a broad array of systems that are adaptive and robust. Systems that exhibit these measurable properties are called Complex Adaptive Systems (CAS) [55]. The application of Network Science to problems of health and disease is called Network Medicine [56] and its main idea follows: a disease represents a pathologic biological process that emerges, and is sustained over time, because it is embedded in a transformed biologic system that acquires adaptive properties. Accordingly, if such an adaptive system related to a given disease is identified, the capacity to determine its areas of vulnerability may reveal promising targets or new approaches for treatment. A typical network science analysis proceeds by building network representations of complex systems and then calculating a number of metrics on the network model. Such metrics include: Network Diameter, Characteristic Path Length, Shortest Path Distribution, Degree Distribution, and Clustering Coefficient. The specific structure and properties of the network model help the analyst identify drug or other intervention targets and other important system properties.

Active learning. The field of Active Learning studies methods for the iterative collection of data and corresponding refinement of models until an accurate enough model is built or other termination criteria are met. Active Learning methods address both predictive modeling and causal discovery tasks [57,58,59,60,61].

Outlier detection. Outlier (or novelty) detection methods seek to find extreme or otherwise atypical observations among data. “Super utilizer” patients is a prototypical example of outliers that has great importance for healthcare. Numerous methods have been invented for outlier detection over the years in many fields including statistics, engineering, computer science, applied mathematics etc. and they are based on multivariate statistics, density estimation, “1-class” SVMs, clustering, and other approaches [15, 62].

Genetic Algorithms (GAs). Genetic Algorithms are heuristic search procedures in the space of models that the analyst wishes to consider. For example, the analyst may use GAs to find a good linear regression, a good SVM, a good Decision Tree or other model of choice. The search resembles the process of genetic evolution and can be shown to advance rapidly to better models [21]. On the other hand, GAs are computationally very expensive and prone to get trapped in local optima (i.e., solutions that cannot be improved in the next reachable steps in any direction in the model search space, although a better solution does exist several steps away). GAs also are used when the analyst does not have a good insight about the process that generates the data, or about which method may perform well for the task at hand. When such insight exists, it is typically better to use methods that have known properties that guarantee high performance for the desired analysis [63, 64].

Visualization. Visualization methods rely on the capability of the human visual apparatus to decode complex patterns when these patterns have been represented in convenient visual forms. Another use of visualization serves explanatory purposes; that is, presenting and explaining results that were obtained via computational means. Interactive data visualizations, where users are allowed to manipulate their views of the data to obtain more information, have been found to be rapid and efficacious in identifying early infection and rejection events in lung transplant patients [65]. Data visualization can also be useful in displaying health care data, such as that coded with the Omaha System; and intraoperative anesthesia data, such as maintenance of blood pressure [66]. Evaluation techniques have been developed to gauge visualization effectiveness in clinical care [67]. Significant challenges exist, however, in implementing visualization more widely in electronic health records, many of them resulting from the highly multivariable nature of case-oriented medical data which can lead to misleading results. In biological research, heatmaps, clustering, PCA-based visualization and lately t-SNE are very widely used [68].

Recommended Level of Understanding for Professional Data Scientists

The information provided in chapters “Foundations and Properties of AI/ML Systems,” “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science,” and “Foundations of Causal ML” describing fundamental techniques and their properties aims to provide on one hand a big picture description of methods, and a concise summary of their relative and absolute strengths and weaknesses and types of outputs (e.g., models) produced by each method.

We recommend that the reader commits to memory the methods information in the above chapters to the extent feasible, and especially for the application domain(s) of interest to them. This will help them evaluate, choose and appropriately apply the right methods, a skill set that eventually, with time and practice will become second nature.

The professional data scientist however should have a much deeper level of understanding that in addition to the information here includes knowing the key algorithms of each method family and possess the abilityFootnote 1 for each algorithm to:

  1. (a)

    Describe it in pseudocode from memory;

  2. (b)

    code it in a programming language of choice;

  3. (c)

    trace the algorithm on paper for small but representative example problems;

  4. (d)

    describe the algorithm’s function to an expert, a novice, or a lay person at the appropriate level of nuance/simplification;

  5. (e)

    recite its key theoretical properties;

  6. (f)

    prove the properties or at least outline the essence of the proofs; and

  7. (g)

    interpret the algorithms’ output.

These skills are typically developed with a combination of formal training, and hands-on experience. The many technical references provided throughout this volume provide a core knowledge base for the technically-oriented reader.

Classroom Assignments and Discussion Topics Chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”

  1. 1.

    What kind of ML tasks are implied by the following questions? (There could be more than one correct answers.)

    1. (a)

      What is a particular patient’s risk of type-2 diabetes mellitus (T2DM) in 7 years?

    2. (b)

      How many years will it take for a particular patient to develop T2DM?

    3. (c)

      What is the likely next diseases a particular patient with T2DM will develop?

    4. (d)

      What diseases do patients with T2DM typically develop?

    5. (e)

      In patients with T2DM, what other diseases are commonly present?

    6. (f)

      What is the average age at which patients develop T2DM?

    7. (g)

      At what age is a particular patient going to develop T2DM?

    8. (h)

      What is the expected cost of medications (per annum) for an average T2DM patient?

    9. (i)

      What is the expected cost of medications (per annum) for a particular patient (given other diseases the patient may suffer from)?

  2. 2.

    What kind of modeling tasks are described by the following questions? Also, name the outcome type. There can be more than one solution; give as many answers as you can.

    1. (a)

      Predicting the length-of-stay for hospitalized patients.

    2. (b)

      Predicting whether the length-of-stay of hospitalized patients will exceed 9 days.

    3. (c)

      Predicting the risk of developing diabetes (in patients who are not known to have diabetes currently).

      • We wish to know the probability that the patient develops diabetes within 7 years from now.

      • We wish to know the probability that the patient develops diabetes on any day between now and 7 years from now.

      • We wish to know how many days (from now) it will take for a patient to develop diabetes.

    4. (d)

      Predicting the type of cancer (e.g. small-cell, non-small-cell, large-cell, squamous cell).

    5. (e)

      Given a sequence of diseases a patient has already developed, predict the most likely next disease.

    6. (f)

      What kind of diseases do hospitalized patients with a length-of-stay in excess of 4 days suffer from?

    7. (g)

      What are the most common reasons (e.g. admitting diagnoses) for receiving opioids?

    8. (h)

      Identify distinct patient groups, based on their lab results, in a cohort of pre-operative patients.

  3. 3.

    What are the most appropriate modelling methods in the following scenarios? Assume you are tasked with building a diabetes risk prediction model that estimates the probability that a patient develops diabetes in 7 years given the patients’ health records.

    1. (a)

      Suppose you have 20 different predictor variables, which are reasonably informative and uncorrelated, have 100 patients in your training sample, and each patient has one observation vector (for all 20 predictor variables).

    2. (b)

      Suppose you have the same 20 predictor variables as in (a), but now you have 10,000 patients, each contributing one observation vector.

    3. (c)

      Suppose you have 200 predictor variables that form highly correlated blocks. Each variable is important and has its own unique effect despite the high correlation. Further, you only have 200 patients, one observation vector per patient.

    4. (d)

      Suppose you have 2000 predictor variables, most of which are irrelevant to the task at hand. Unfortunately, you do not know a priori which variables are irrelevant. You have 200 patients, one observation vector per patient.

    5. (e)

      Suppose you have 20 different predictor variables, which are reasonably informative and uncorrelated, 1000 patients, and you have 10 observation vectors per patient. These 10 observations were collected at equally spaced time intervals for all patients.

    6. (f)

      Suppose you have 20 different predictor variables, which are reasonably informative and uncorrelated, 1000 patients, and you have 10 observation vectors per patient. These 10 observations were collected at exactly the same time for all patients.

    7. (g)

      Suppose you have 20 different predictor variables, which are reasonably informative and uncorrelated, 1000 patients, and you have 10 observation vectors per patient. These 10 observations were collected at different times for each patient and the collection time is known.

    8. (h)

      Suppose you have 20 different predictor variables, which are reasonably informative and uncorrelated, 1000 patients, and you have varying number of observations per patient.

  4. 4.

    You are tasked with building a survival (time-to-event) model that estimates patients’ risk of developing diabetes at any time within the next 7 years. What kind of model would you use in the same scenarios as in question 3?

  5. 5.

    You have decided to build classifiers for a classification task. You are given the predictor variables, the outcome and a training data set. What models would be appropriate under the following scenarios?

    1. (a)

      The predictor variables are not highly correlated, all have approximately linear effects, and your training data set contains 10 observations per predictor variable.

    2. (b)

      The predictor variables are not highly correlated, all have approximately linear effects, and you have several million observations per predictor variable. Which algorithms are most likely to run into computational problems?

    3. (c)

      The predictor variables are not highly correlated, but they may have unknown non-linear effects. You have sufficient amount of data, but not to the extent where you would expect computational issues.

    4. (d)

      The predictor variables are not highly correlated, but they may have unknown non-linear effects. You have only 1 observation per predictor variable.

    5. (e)

      The predictor variables are correlated and may have unknown non-linear effects. You have sufficient amount of data for any algorithm.

    6. (f)

      The predictor variables are not highly correlated, all have approximately linear effects, and the clinical expects are asking for an “interpretable model”. Select a model type and explain why (or how) it is “interpretable”.

    7. (g)

      The predictor variables are correlated and may have unknown non-linear effects. You need to build an “interpretable” model. The predictive performance “is not the primary concern”.

    8. (h)

      The clinical experts are asking for a model that is highly interpretable and has the best possible predictive performance. What do you tell the experts?

  6. 6.

    A data representation is a collection of features (variables) obtained by transforming the original variables. For example, a variable set obtained through dimensionality reduction is a (lower-dimensional) data representation. In this question, your goal is to create a data representation, appropriate under the following conditions.

    1. (a)

      Predictor variables are reasonably linear, have no interactions, and sufficient observations exist. We want the resulting variables to be orthogonal to each other.

    2. (b)

      Predictor variables are reasonably linear, have no interactions, and sufficient observations exist. We want the resulting variables to be “independent” of each other in some sense. Select a method and explain what “independent” means for that method.

    3. (c)

      Predictor variables are reasonably linear, have no interactions, and we wish to have a low-dimensional representation to visualize the data.

    4. (d)

      Predictor variables may have nonlinear effects, interactions and we wish to build a survival model or a classifier (we do not know yet) using the new representation.