Key words

1 Introduction

This chapter presents the main classic machine learning (ML) methods. There is a focus on supervised learning methods for classification and regression, but we also describe some unsupervised approaches. The chapter is meant to be readable by someone with no background in machine learning. It is nevertheless necessary to have some basic notions of linear algebra, probabilities, and statistics. If this is not the case, we refer the reader to Chapters 2 and 3 of [1].

The rest of this chapter is organized as follows. Rather than grouping methods by categories (for instance, classification or regression methods), we chose to present methods by increasing order of complexity. We first provide the notations in Subheading 2. We then describe a very intuitive family of methods, that of nearest neighbors (Subheading 3). We continue with linear regression (Subheading 4) and logistic regression (Subheading 5), the latter being a classification technique. We subsequently introduce the problem of overfitting (Subheading 6) as well as strategies to mitigate it (Subheading 7). Subheading 8 describes support vector machines (SVM). Subheading 9 explains how binary classification methods can be extended to a multi-class setting. We then describe methods which are specifically adapted to the case of normal distributions (Subheading 10). Decision trees and random forests are described in Subheading 11. We then briefly describe some unsupervised learning techniques, namely, for clustering (Subheading 12) and dimensionality reduction (Subheading 13). The chapter ends with a description of kernel methods which can be used to extend linear techniques to non-linear cases (Subheading 14). Box 1 summarizes the methods presented in this chapter, grouped by categories and then sorted in order of appearance.

Box 1: Main Classic ML Methods

  • Supervised learning

    • Classification: nearest neighbors, logistic regression, support vector machine (SVM), naive Bayes, linear discriminant analysis (LDA), quadratic discriminant analysis, tree-based models (decision tree, random forest, extremely randomized trees)

    • Regression: nearest neighbors, linear regression, support vector machine regression, tree-based models (decision tree, random forest, extremely randomized trees), kernel ridge regression

  • Unsupervised learning

    • Clustering: k-means, Gaussian mixture model

    • Dimensionality reduction: principal component analysis (PCA), linear discriminant analysis (LDA), kernel principal component analysis

2 Notations

Let n be the number of samples and p be the number of features. An input sample is thus a p-dimensional vector:

$$ \boldsymbol{x}=\left[\begin{array}{l}\hfill {x}_1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {x}_p\hfill \end{array}\right] $$

An output sample is denoted by y. Thus, a sample is (x, y). The dataset of n samples can then be summarized as an n × p matrix X representing the input data and an n-dimensional vector y representing the target data:

$$ \boldsymbol{X}=\left[\begin{array}{l}\hfill {\boldsymbol{x}}^{(1)}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{x}}^{(n)}\hfill \end{array}\right]=\left[\begin{array}{lll}\hfill {x}_1^{(1)}\hfill & \hfill \dots \hfill & \hfill {x}_p^{(1)}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {x}_1^{(n)}\hfill & \hfill \dots \hfill & \hfill {x}_p^{(n)}\hfill \end{array}\right]\kern2em ,\kern2em \boldsymbol{y}=\left[\begin{array}{l}\hfill {y}_1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {y}_n\hfill \end{array}\right] $$

The input space is denoted by \( I \), and the set of training samples is denoted by \( \mathcal{X} \).

In the case of regression, y is a real number. In the case of classification, y is a single label. More precisely, y can only take one of a finite set of values called labels. The set of possible classes (i.e., labels) is denoted by \( \mathcal{C}=\left\{{\mathcal{C}}_1,\dots, {\mathcal{C}}_q\right\} \), with q being the number of classes. As the values of the classes are not meaningful, when there are only two classes, the classes are often called the positive and negative classes. In this case and also for mathematical reasons, without loss of generality, we assume the values of the classes to be + 1 and − 1.

3 Nearest Neighbor Methods

One of the most intuitive approaches to machine learning is nearest neighbors. It is based on the following intuition: for a given input, its corresponding output is likely to be similar to the outputs of similar inputs. A real-life metaphor would be that if a subject has similar characteristics than other subjects who were diagnosed with a given disease, then this subject is likely to also be suffering from this disease.

More formally, nearest neighbor methods use the training samples from the neighborhood of a given point x, denoted by N(x), to perform prediction [2].

For regression tasks, the prediction is computed as a weighted mean of the target values in N(x):

$$ \hat{y}=\sum \limits_{{\boldsymbol{x}}^{(i)}\in N\left(\boldsymbol{x}\right)}{w}_i^{\left(\boldsymbol{x}\right)}{y}^{(i)} $$

where \( {w}_i^{\left(\boldsymbol{x}\right)} \) is the weight associated with x(i) to predict the output of x, with \( {w}_i^{\left(\boldsymbol{x}\right)}\ge 0\forall i \) and \( {\sum}_i{w}_i^{\left(\boldsymbol{x}\right)}=1 \).

For classification tasks, the predicted label corresponds to the label with the largest weighted sum of occurrences of each label:

$$ \hat{y}=\underset{\mathcal{C}}{\arg \kern0.2em \max}\sum \limits_{{\boldsymbol{x}}^{(i)}\in N\left(\boldsymbol{x}\right)}{w}_i^{\left(\boldsymbol{x}\right)}{\mathbf{1}}_{y^{(i)}={\mathcal{C}}_k} $$

A key parameter of nearest neighbor methods is the metric, denoted by d, that is, a mathematical function that defines dissimilarity. The metric is used to define the neighborhood of any point and can also be used to compute the weights.

3.1 Metrics

Many metrics have been defined for various types of input data such as vectors of real numbers, integers, or booleans. Among these different types, vectors of real numbers are one of the most common types of input data, for which the most commonly used metric is the Euclidean distance, defined as:

$$ \forall \boldsymbol{x},{\boldsymbol{x}}^{\prime}\in \kern.3em I,\kern0.3em \parallel \boldsymbol{x}-{\boldsymbol{x}}^{\prime }{\parallel}_2=\sqrt{\sum \limits_{j=1}^p{\left({x}_j-{x}_j^{\prime}\right)}^2} $$

The Euclidean distance is sometimes referred to as the “ordinary” distance since it is the one based on the Pythagorean theorem and that everyone uses in their everyday lives.

3.2 Neighborhood

The two most common definitions of the neighborhood rely on either the number of neighbors or the radius around the given point. Figure 1 illustrates the differences between both definitions.

Fig. 1
2 scatterplots of k-nearest neighbors, k equals 5 and radius neighbors, r equals 0.2. A few spots are connected with dotted lines in the left scatterplot and 2 circular shaded portions in the right scatterplot.

Different definitions of the neighborhood. On the left, the neighborhood of a given point is the set of its five nearest neighbors. On the right, the neighborhood of a given point is the set of points whose dissimilarity is lower than the radius. For a given input, its neighborhood may be different depending on the definition used. The Euclidean distance is used as the metric in both examples

The k-nearest neighbor method defines the neighborhood of a given point x as the set of the k closest points to x:

$$ N\left(\boldsymbol{x}\right)={\left\{{\boldsymbol{x}}^{(i)}\right\}}_{i=1}^k\kern1em \mathrm{with}\kern1em d\left(\boldsymbol{x},{\boldsymbol{x}}^{(1)}\right)\le \dots \le d\left(\boldsymbol{x},{\boldsymbol{x}}^{(n)}\right) $$

The radius neighbor method defines the neighborhood of a given point x as the set of points whose dissimilarity to x is smaller than the given radius, denoted by r:

$$ N\left(\boldsymbol{x}\right)=\left\{{\boldsymbol{x}}^{(i)}\in \mathcal{X}\kern0.3em |\kern0.3em d\left(\boldsymbol{x},{\boldsymbol{x}}^{(i)}\right)<r\right\} $$

3.3 Weights

The two most common approaches to compute the weights are to use:

  • Uniform weights (all the weights are equal):

    $$ \forall i,\kern0.3em {w}_i^{\left(\boldsymbol{x}\right)}=\frac{1}{\mid N\left(\boldsymbol{x}\right)\mid } $$
  • Weights inversely proportional to the dissimilarity:

    $$ \forall i,\kern0.3em {w}_i^{\left(\boldsymbol{x}\right)}=\frac{\frac{1}{d\left({\boldsymbol{x}}^{(i)},\boldsymbol{x}\right)}}{\sum_j\frac{1}{d\left({\boldsymbol{x}}^{(j)},\boldsymbol{x}\right)}}=\frac{1}{d\left({\boldsymbol{x}}^{(i)},\boldsymbol{x}\right){\sum}_j\frac{1}{d\left({\boldsymbol{x}}^{(j)},\boldsymbol{x}\right)}} $$

With uniform weights, every point in the neighborhood equally contributes to the prediction. With weights inversely proportional to the dissimilarity, closer points contribute more to the prediction than further points. Figure 2 illustrates the different decision functions obtained with uniform weights and weights inversely proportional to the dissimilarity for a 3-nearest neighbor classification model.

Fig. 2
3 scatterplots of training samples, uniform weights, and weights inversely proportional to the dissimilarity have spots in 4 colors. The scatterplots are divided into 4 irregular portions.

Impact of the definition of the weights on the prediction function of a 3-nearest neighbor classification model. When the weights are inversely proportional to the dissimilarity, the classifier is more subject to outliers since the predictions in the close neighborhood of any input are mostly dedicated by the label of this input, independently of the number of neighbors used. With uniform weights, the prediction function tends to be smoother

3.4 Neighbor Search

The brute-force method to compute the neighborhood for n points with p features is to compute the metric for each pair of inputs, which has a \( \mathcal{O}\left({n}^2p\right) \) algorithmic complexity (assuming that evaluating the metric for a pair of inputs has a complexity of \( \mathcal{O}(p) \), which is the case for most metrics). However, it is possible to decrease this algorithmic complexity if the metric is a distance, that is, if the metric d satisfies the following properties:

  1. 1.

    Non-negativity: ∀a, b, d(a, b) ≥ 0

  2. 2.

    Identity: ∀a, b, d(a, b) = 0 if and only if a = b

  3. 3.

    Symmetry: ∀a, b, d(a, b) = d(b, a)

  4. 4.

    Triangle inequality: ∀a, b, c, d(a, b) + d(b, c) ≥ d(a, c)

The key property is the triangle inequality, which has a simple interpretation: the shortest path between two points is a straight line. Mathematically, if a is far from c and c is close to b (i.e., d(a, c) is large and d(b, c) is small), then a is far from b (i.e., d(a, b) is large). This is obtained by rewriting the triangle inequality as follows:

$$ \forall \boldsymbol{a},\boldsymbol{b},\boldsymbol{c},\kern0.3em d\left(\boldsymbol{a},\boldsymbol{b}\right)\ge d\left(\boldsymbol{a},\boldsymbol{c}\right)-d\left(\boldsymbol{b},\boldsymbol{c}\right) $$

This means that it is not necessary to compute d(a, b) in this case. Therefore, the computational cost of a nearest neighbor search can be reduced to \( \mathcal{O}\left(n\log (n)p\right) \) or better, which is a substantial improvement over the brute-force method for large n. Two popular methods that take advantage of this property are the K-dimensional tree structure [3] and the ball tree structure [4].

4 Linear Regression

Linear regression is a regression model that linearly combines the features. Each feature is associated with a coefficient that represents the relative weight of this feature compared to the other features. A real-life metaphor would be to see the coefficients as the ingredients of a recipe: the key is to find the best balance (i.e., proportions) between all the ingredients in order to make the best cake.

Mathematically, a linear model is a model that linearly combines the features [5]:

$$ f\left(\boldsymbol{x}\right)={w}_0+\sum \limits_{j=1}^p{w}_j{x}_j $$

A common notation consists in including a 1 in x so that f(x) can be written as the dot product between the vector x and the vector w:

$$ f\left(\boldsymbol{x}\right)={w}_0\times 1+\sum \limits_{j=1}^p{w}_j{x}_j={\boldsymbol{x}}^{\top}\boldsymbol{w} $$

where the vector w consists of:

  • The intercept (also known as bias) w0

  • The coefficients (w1, …, wp), where each coefficient wj is associated with the corresponding feature xj

In the case of linear regression, f(x) is the predicted output:

$$ \hat{y}=f\left(\boldsymbol{x}\right)={\boldsymbol{x}}^{\top}\boldsymbol{w} $$

There are several methods to estimate the w coefficients. In this section, we present the oldest one which is known as ordinary least squares regression.

In the case of ordinary least squares regression, the cost function J is the sum of the squared errors on the training data (see Fig. 3):

$$ J\left(\boldsymbol{w}\right)=\sum \limits_{i=1}^n{\left({y}^{(i)}-{\hat{y}}^{(i)}\right)}^2=\sum \limits_{i=1}^n{\left({y}^{(i)}-{\boldsymbol{x}}^{(i)\top}\boldsymbol{w}\right)}^2=\parallel \boldsymbol{y}-\boldsymbol{Xw}{\parallel}_2^2 $$

One wants to find the optimal parameters w that minimize the cost function:

$$ {\boldsymbol{w}}^{\star }=\underset{\boldsymbol{w}}{\arg \kern0.2em \min }J\left(\boldsymbol{w}\right) $$

This optimization problem is convex, implying that any local minimum is a global minimum, and differentiable, implying that every local minimum has a null gradient. One therefore aims to find null gradients of the cost function:

$$ {\displaystyle \begin{array}{rlll}{\nabla}_{{\boldsymbol{w}}^{\star }}J& =0& & \\ {}\Rightarrow 2{\boldsymbol{X}}^{\top}\boldsymbol{X}{\boldsymbol{w}}^{\star }-2{\boldsymbol{X}}^{\top}\boldsymbol{y}& =0& & \\ {}\Rightarrow {\boldsymbol{X}}^{\top}\boldsymbol{X}{\boldsymbol{w}}^{\star }& ={\boldsymbol{X}}^{\top}\boldsymbol{y}& & \\ {}\Rightarrow {\boldsymbol{w}}^{\star }& ={\left({\boldsymbol{X}}^{\top}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\top}\boldsymbol{y}& & \end{array}} $$
Fig. 3
A graph of simple linear regression plots y versus x. It displays an increasing line where the values of prediction lie. The target data and prediction values are linked by dotted lines that represent the error.

Ordinary least squares regression. The coefficients (i.e., the intercept and the slope with a single predictor) are estimated by minimizing the sum of the squared errors

Ordinary least squares regression is one of the few machine learning optimization problems for which there exists a closed formula, i.e., the optimal solution can be computed using a finite number of standard operations such as addition, multiplication, and evaluations of well-known functions. A summary of linear regression can be found in Box 2.

Box 2: Linear Regression

  • Main idea: best hyperplane (i.e., line when p = 1, plane when p = 2) mapping the inputs and to the outputs.

  • Mathematical formulation: linear relationship between the predicted output \( \hat{y} \) and the input x that minimizes the sum of squared errors:

    $$ \hat{y}={w}_0^{\star }+\sum \limits_{j=1}^n{w}_j^{\star }{x}_j\kern1em \mathrm{with}\kern1em {\boldsymbol{w}}^{\star }=\underset{\boldsymbol{w}}{\arg \kern0.2em \min}\sum \limits_{i=1}^n{\left({y}^{(i)}-{\boldsymbol{x}}^{(i)\top}\boldsymbol{w}\right)}^2 $$
  • Regularization: can be penalized to avoid overfitting (ridge), to perform feature selection (lasso), or both (elastic-net). See Subheading 7.

5 Logistic Regression

Intuitively, linear regression consists in finding the line that best fits the data: the true output should be as close to the line as possible. For binary classification, one wants the line to separate both classes as well as possible: the samples from one class should all be in one subspace, and the samples from the other class should all be in the other subspace, with the inputs being as far as possible from the line.

Mathematically, for binary classification tasks, a linear model is defined by a hyperplane splitting the input space into two subspaces such that each subspace is characteristic of one class. For instance, a line splits a plane into two subspaces in the two-dimensional case, while a plane splits a three-dimensional space into two subspaces. A hyperplane is defined by a vector w = (w0, w1, …, wp), and f(x) = xw corresponds to the signed distance between the input x and the hyperplane w: in one subspace, the distance with any input is always positive, whereas in the other subspace, the distance with any input is always negative. Figure 4 illustrates the decision function in the two-dimensional case where both classes are linearly separable.

Fig. 4
A scatterplot features a diagonal line that splits the graph into two halves, each of which is shaded differently. The plot contains two shaded dots, each corresponding to its respective half of the graph.

Decision function of a logistic regression model. A logistic regression is a linear model, that is, its decision function is linear. In the two-dimensional case, it separates a plane with a line

The sign of the signed distance corresponds to the decision function of a linear binary classification model:

$$ \hat{y}=\operatorname{sign}\left(f\left(\boldsymbol{x}\right)\right)=\left\{\begin{array}{ll}+1\kern1em & \mathrm{if}\kern0.3em f\left(\boldsymbol{x}\right)>0\\ {}-1\kern1em & \mathrm{if}\kern0.3em f\left(\boldsymbol{x}\right)<0\end{array}\right. $$

The logistic regression model is a probabilistic linear model that transforms the signed distance to the hyperplane into a probability using the sigmoid function [6], denoted by \( \sigma (u)=\frac{1}{1+\exp \left(-u\right)} \).

Consider the linear model:

$$ f\left(\boldsymbol{x}\right)={\boldsymbol{x}}^{\top}\boldsymbol{w}={w}_0+\sum \limits_{i=j}^p{w}_j{x}_j $$

Then the probability of belonging to the positive class is:

$$ P\left(\mathrm{y}=+1|\mathbf{x}=\boldsymbol{x}\right)=\sigma \left(f\left(\boldsymbol{x}\right)\right)=\frac{1}{1+\exp \left(-f\left(\boldsymbol{x}\right)\right)} $$

and that of belonging to the negative class is:

$$ {\displaystyle \begin{array}{rl}P\left(\mathrm{y}=-1|\mathbf{x}=\boldsymbol{x}\right)& =1-P\left(\mathrm{y}=+1|\mathbf{x}=\boldsymbol{x}\right)\\ {}& =\frac{\exp \left(-f\left(\boldsymbol{x}\right)\right)}{1+\exp \left(-f\left(\boldsymbol{x}\right)\right)} \\ {} =\frac{1}{1+\exp \left(f\left(\boldsymbol{x}\right)\right)}& \\ {}P\left(\mathrm{y}=-1|\mathbf{x}=\boldsymbol{x}\right)& =\sigma \left(-f\left(\boldsymbol{x}\right)\right) \end{array}} $$

By applying the inverse of the sigmoid function, which is known as the logit function, one can see that the logarithm of the odds ratio is modeled as a linear combination of the features:

$$ \log \left(\frac{P\left(\mathrm{y}=+1|\mathbf{x}=\boldsymbol{x}\right)}{P\left(\mathrm{y}=-1|\mathbf{x}=\boldsymbol{x}\right)}\right)=\log \left(\frac{P\left(\mathrm{y}=+1|\mathbf{x}=\boldsymbol{x}\right)}{1-P\left(\mathrm{y}=+1|\mathbf{x}=\boldsymbol{x}\right)}\right)=f\left(\boldsymbol{x}\right) $$

The w coefficients are estimated by maximizing the likelihood function, that is, the function measuring the goodness of fit of the model to the training data:

$$ L\left(\boldsymbol{w}\right)=\prod \limits_{i=1}^nP\left(\mathrm{y}={y}^{(i)}|\mathbf{x}={\boldsymbol{x}}^{(i)};\boldsymbol{w}\right) $$

For computational reasons, it is easier to maximize the log-likelihood, which is simply the logarithm of the likelihood:

$$ {\displaystyle \begin{array}{rlll}\log \left(L\left(\boldsymbol{w}\right)\right)& =\sum \limits_{i=1}^n\log \left(P\left(\mathrm{y}={y}^{(i)}|\mathbf{x}={\boldsymbol{x}}^{(i)};\boldsymbol{w}\right)\right)& & \\ {}& =\sum \limits_{i=1}^n\log \left(\sigma \left({y}^{(i)}f\left({\boldsymbol{x}}^{(i)};\boldsymbol{w}\right)\right)\right)& & \\ {}& =\sum \limits_{i=1}^n-\log \left(1+\exp \left({y}^{(i)}{\boldsymbol{x}}^{(i)\top}\boldsymbol{w}\right)\right)& & \\ {}\log \left(L\left(\boldsymbol{w}\right)\right)& =-\sum \limits_{i=1}^n\log \left(1+\exp \left({y}^{(i)}{\boldsymbol{x}}^{(i)\top}\boldsymbol{w}\right)\right)& & \end{array}} $$

Finally, we can rewrite this maximization problem as a minimization problem by noticing that \( {\max}_{\boldsymbol{w}}\log \left(L\left(\boldsymbol{w}\right)\right)=-{\min}_{\boldsymbol{w}}-\log \left(L\left(\boldsymbol{w}\right)\right) \):

$$ \underset{\boldsymbol{w}}{\max}\log \left(L\left(\boldsymbol{w}\right)\right)=-\underset{\boldsymbol{w}}{\min}\sum \limits_{i=1}^n\log \left(1+\exp \left({y}^{(i)}{\boldsymbol{x}}^{(i)\top}\boldsymbol{w}\right)\right) $$

We can see that the w coefficients that maximize the likelihood are also the coefficients that minimize the sum of the logistic loss values, with the logistic loss being defined as:

$$ {\ell}_{\mathrm{logistic}}\left(y,f\left(\boldsymbol{x}\right)\right)=\log \left(1+\exp \left( yf\left(\boldsymbol{x}\right)\right)\right)\slash \log (2) $$

Unlike for linear regression, there is no closed formula for this minimization. One thus needs to use an optimization method such as gradient descent which was presented in Subheading 3 of Chap. 1. In practice, more sophisticated approaches such as quasi-Newton methods and variants of stochastic gradient descent are often used. The main concepts underlying logistic regression can be found in Box 3.

Box 3: Logistic Regression

  • Main idea: best hyperplane (i.e., line) that separates two classes.

  • Mathematical formulation: the signed distance to the hyperplane is mapped into the probability to belong to the positive class using the sigmoid function:

    $$ {\displaystyle \begin{array}{cc}f\left(\boldsymbol{x}\right)={w}_0+\sum \limits_{j=1}^n{w}_j{x}_j& \\ {}P\left(\mathrm{y}=+1|\mathbf{x}=\boldsymbol{x}\right)=\upsigma \left(f\left(\boldsymbol{x}\right)\right)=\frac{1}{1+\exp \left(-f\left(\boldsymbol{x}\right)\right)}& \end{array}} $$
  • Estimation: likelihood maximization.

  • Regularization: can be penalized to avoid overfitting (2 penalty), to perform feature selection (1 penalty), or both (elastic-net penalty).

6 Overfitting and Regularization

The original formulations of ordinary least squares regression and logistic regression are unregularized models, that is, the model is trained to fit the training data as much as possible. Let us consider a real-life example as it is very similar to human learning. If a person learns by heart the content of a book, they are able to solve the exercises in the book, but unable to apply the theoretical concepts to new exercises or real-life situations. If a person only quickly reads through the book, they are probably unable to solve neither the exercises in the book nor new exercises.

The corresponding concepts are known as overfitting and underfitting in machine learning. Overfitting occurs when a model fits too well the training data and generalizes poorly to new data. Oppositely, underfitting occurs when a model does not capture well enough the characteristics of the training data and thus also generalizes poorly to new data.

Overfitting and underfitting are related to frequently used terms in machine learning: bias and variance. Bias is defined as the expected (i.e., mean) difference between the true output and the predicted output. Variance is defined as the variability of the predicted output. For instance, let us consider a model predicting the age of a person from a picture. If the model always underestimates or overestimates the age, then the model is biased. If the model makes both large and small errors, then the model has a high variance.

Ideally, one would like to have a model with a small bias and a small variance. However, the bias of a model tends to increase when decreasing its variance, and the variance of the model tends to increase when decreasing its bias. This phenomenon is known as the bias-variance trade-off. Figure 5 illustrates this phenomenon. One can also notice it by computing the squared error between the true output y (fixed) and the predicted output \( \hat{\mathrm{y}} \) (random variable): its expected value is the sum of the squared bias of \( \hat{\mathrm{y}} \) and the variance of \( \hat{\mathrm{y}} \):

$$ {\displaystyle \begin{array}{rlll}\mathbbm{E}\left[{\left(y-\hat{\mathrm{y}}\right)}^2\right]& =\mathbbm{E}\left[{y}^2-2y\hat{\mathrm{y}}+{\hat{\mathrm{y}}}^2\right]& & \\ {}& ={y}^2-2y\mathbbm{E}\left[\hat{\mathrm{y}}\right]+\mathbbm{E}\left[{\hat{\mathrm{y}}}^2\right]& & \\ {}& ={y}^2-2y\mathbbm{E}\left[\hat{\mathrm{y}}\right]+\mathbbm{E}\left[{\hat{\mathrm{y}}}^2\right]+\mathbbm{E}{\left[\hat{\mathrm{y}}\right]}^2-\mathbbm{E}{\left[\hat{\mathrm{y}}\right]}^2& & \\ {}& ={\left(\mathbbm{E}\left[\hat{\mathrm{y}}\right]-y\right)}^2+\mathbbm{E}\left[{\hat{\mathrm{y}}}^2\right]-\mathbbm{E}{\left[\hat{\mathrm{y}}\right]}^2& & \\ {}& ={\left(\mathbbm{E}\left[\hat{\mathrm{y}}\right]-y\right)}^2+\mathbbm{E}\left[{\hat{\mathrm{y}}}^2-\mathbbm{E}{\left[\hat{\mathrm{y}}\right]}^2\right]& & \\ {}& ={\left(\mathbbm{E}\left[\hat{\mathrm{y}}\right]-y\right)}^2+\mathbbm{E}\left[{\hat{\mathrm{y}}}^2-2\mathbbm{E}{\left[\hat{\mathrm{y}}\right]}^2+\mathbbm{E}{\left[\hat{\mathrm{y}}\right]}^2\right]& & \\ {}& ={\left(\mathbbm{E}\left[\hat{\mathrm{y}}\right]-y\right)}^2+\mathbbm{E}\left[{\hat{\mathrm{y}}}^2-2\hat{\mathrm{y}}\mathbbm{E}\left[\hat{\mathrm{y}}\right]+\mathbbm{E}{\left[\hat{\mathrm{y}}\right]}^2\right]& & \\ {}& ={\left(\mathbbm{E}\left[\hat{\mathrm{y}}\right]-y\right)}^2+\mathbbm{E}\left[{\left(\hat{\mathrm{y}}-\mathbbm{E}\left[\hat{\mathrm{y}}\right]\right)}^2\right]& & \\ {}\mathbbm{E}\left[{\left(y-\hat{\mathrm{y}}\right)}^2\right]& =\underset{{\mathrm{bias}}^2}{\underbrace{{\left(\mathbbm{E}\left[\hat{\mathrm{y}}\right]-y\right)}^2}}+\underset{\mathrm{variance}}{\underbrace{\mathrm{Var}\left[\hat{\mathrm{y}}\right]}}& & \end{array}} $$
Fig. 5
5 graphs. A graph of error versus complexity, where the training and test sets decrease in underfitting, and the training set increases while the test set decreases in overfitting. 4 scatterplots of different combinations of bias and variance. Each plot has a circle with values in and around it.

Illustration of underfitting and overfitting. Underfitting occurs when a model is too simple and does not capture well enough the characteristics of the training data, leading to high bias and low variance. Oppositely, overfitting occurs when a model is too complex and learns the noise in the training data, leading to low bias and high variance

7 Penalized Models

Depending on the class of methods, there exist different strategies to tackle overfitting.

For neighbor methods, the number of neighbors used to define the neighborhood of any input and the strategy to compute the weights are the key hyperparameters to control the bias-variance trade-off. For models that are presented in the remaining sections of this chapter, we mention strategies to address the bias-variance trade-off in their respective sections. In this section, we present the most commonly used strategies for models whose parameters are optimized by minimizing a cost function defined as the mean loss values over all the training samples:

$$ \underset{\boldsymbol{w}}{\min }J\left(\boldsymbol{w}\right)\kern1em \mathrm{with}\kern1em J\left(\boldsymbol{w}\right)=\frac{1}{n}\sum \limits_{i=1}^n\ell \left({y}^{(i)},f\left({\boldsymbol{x}}^{(i)};\boldsymbol{w}\right)\right) $$

This is, for instance, the case of the linear and logistic regression methods presented in the previous sections.

7.1 Penalties

The main idea is to introduce a penalty term Pen(w) that will constraint the parameters w to have some desired properties. The most common penalties are the 2 penalty, the 1 penalty, and the elastic-net penalty.

7.1.1 2 Penalty

The 2 penalty is defined as the squared 2 norm of the w coefficients:

$$ {\ell}_2\left(\boldsymbol{w}\right)=\parallel \boldsymbol{w}{\parallel}_2^2=\sum \limits_{j=1}^p{w}_j^2 $$

The 2 penalty forces each coefficient wi not to be too large and makes the coefficients more robust to collinearity (i.e., when some features are approximately linear combinations of the other features).

7.1.2 1 Penalty

The 2 penalty forces the values of the parameters not to be too large, but does not incentivize to make small values tend to zero. Indeed, the square of a small value is even smaller. When the number of features is large, or when interpretability is important, it can be useful to make the model select the most important features. The corresponding metric is the 0 “norm” (which is not a proper norm in the mathematical sense), defined as the number of nonzero elements:

$$ {\ell}_0\left(\boldsymbol{w}\right)=\parallel \boldsymbol{w}{\parallel}_0=\sum \limits_{j=1}^p{\mathbf{1}}_{w_j\ne 0} $$

However, the 0 “norm” is neither differentiable nor convex (which are useful properties to solve an optimization problem, but this is not further detailed for the sake of conciseness). The best convex differentiable approximation of the 0 “norm” is the 1 norm (see Fig. 6), defined as the sum of the absolute values of each element:

$$ {\ell}_1\left(\boldsymbol{w}\right)=\parallel \boldsymbol{w}{\parallel}_1=\sum \limits_{j=1}^p\mid {w}_j\mid $$
Fig. 6
A diagram with the lines l 0, l 1, and l 2. l 0 forms a cross, and l 1 and l 2 form a diamond and circle connecting the lines of l 0, respectively.

Unit balls of the 0, 1, and 2 norms. For each norm, the set of points in \( {\mathbb{R}}^2 \) whose norm is equal to 1 is plotted. The 1 norm is the best convex approximation to the 0 norm. Note that the lines for the 0 norm extend to − and +  but are cut for plotting reasons

7.1.3 Elastic-Net Penalty

Both the 2 and 1 penalties have their upsides and downsides. In order to try to obtain the best of penalties, one can add both penalties in the objective function. The combination of both penalties is known as the elastic-net penalty:

$$ \mathrm{EN}\left(\boldsymbol{w},\alpha \right)=\alpha \parallel \boldsymbol{w}{\parallel}_1+\left(1-\alpha \right)\parallel \boldsymbol{w}{\parallel}_2^2 $$

where α ∈ [0, 1] is a hyperparameter representing the proportion of the 1 penalty compared to the 2 penalty.

7.2 New Optimization Problem

A natural approach would be to add a constraint to the minimization problem:

$$ \underset{\boldsymbol{w}}{\min }J\left(\boldsymbol{w}\right)\kern2em \mathrm{subject}\ \mathrm{to}\kern2em \mathrm{Pen}\left(\boldsymbol{w}\right)<c $$
(1)

which reads as “Find the optimal parameters that minimize the cost function J among all the parameters w that satisfy Pen(w) < c” for a positive real number c. Figure 7 illustrates the optimal solution of a simple linear regression task with different constraints. This figure also highlights the sparsity property of the 1 penalty (the optimal parameter for the horizontal axis is set to zero) that the 2 penalty does not have (the optimal parameter for the horizontal axis is small but different from zero).

Fig. 7
A heatmap with concentric circles plots w 2 versus w 1. It displays 3 different colored w star, l 1, and l 2. l 1 and l 2 form a diamond and circle in the w 2 and w 1.

Illustration of the minimization problem with a constraint on the penalty term. The plot represents the value of the loss function for different values of the two coefficients for a linear regression task. The black star indicates the optimal solution with no constraint. The green and orange stars indicate the optimal solutions when imposing a constraint on the 2 and 1 norms of the parameters w, respectively

Although this approach is appealing due to its intuitiveness and the possibility to set the maximum possible penalty on the parameters w, it leads to a minimization problem that is not trivial to solve. A similar approach consists in adding the regularization term in the cost function:

$$ \underset{\boldsymbol{w}}{\min }J\left(\boldsymbol{w}\right)+\lambda \times \mathrm{Pen}\left(\boldsymbol{w}\right) $$
(2)

where λ > 0 is a hyperparameter that controls the weights of the penalty term compared to the mean loss values over all the training samples. This formulation is related to the Lagrangian function of the minimization problem with the penalty constraint.

This formulation leads to a minimization problem with no constraint which is much easier to solve. One can actually show that Eqs. 1 and 2 are related: solving Eq. 2 for a given λ, whose optimal solution is denoted by \( {\boldsymbol{w}}_{\lambda}^{\star } \), is equivalent to solving Eq. 1 for \( c=\mathrm{Pen}\left({\boldsymbol{w}}_{\lambda}^{\star}\right) \). In other words, solving Eq. 2 for a given λ is equivalent to solving Eq. 1 for c whose value is only known after finding the optimal solution of Eq. 2.

Figure 8 illustrates the impact of the regularization term λ ×Pen(w) on the prediction function of a kernel ridge regression algorithm (see Subheading 14 for more details) for different values of λ. For high values of λ, the regularization term is dominating the mean loss value, making the prediction function not fitting well enough the training data (underfitting). For small values of λ, the mean loss value is dominating the regularization term, making the prediction function fitting too well the training data (overfitting). A good balance between the mean loss value and the regularization term is required to learn the best function.

Fig. 8
A set of 12 scatterplots for lambda equals 1000 to 0.00000001. The scatterplots also display a line that increases in fluctuation with respect to a decrease in the value of lambda.

Illustration of regularization. A kernel ridge regression algorithm is fitted on the training data (blue points) with different values of λ, which is the weight of the regularization in the cost function. The smaller the values of λ, the smaller the weight of the 2 regularization. The algorithm underfits (respectively, overfits) the data when the value of λ is too large (respectively, low)

Since linear regression is one of the oldest and best-known models, the aforementioned penalties were originally introduced for linear regression:

  • Linear regression with the 2 penalty is also known as ridge [7]:

    $$ \underset{\boldsymbol{w}}{\min}\parallel \boldsymbol{y}-\boldsymbol{Xw}{\parallel}_2^2+\lambda \parallel \boldsymbol{w}{\parallel}_2^2 $$

    As in ordinary least squares regression, there exists a closed formula for the optimal solution:

    $$ {\boldsymbol{w}}^{\star }={\left({\boldsymbol{X}}^{\top}\boldsymbol{X}+\lambda \boldsymbol{I}\right)}^{-1}{\boldsymbol{X}}^{\top}\boldsymbol{y} $$
  • Linear regression with the 1 penalty is also known as lasso [8]:

    $$ \underset{\boldsymbol{w}}{\min}\parallel \boldsymbol{y}-\boldsymbol{Xw}{\parallel}_2^2+\lambda \parallel \boldsymbol{w}{\parallel}_1 $$
  • Linear regression with the elastic-net penalty is also known as elastic-net [9]:

    $$ \underset{\boldsymbol{w}}{\min}\parallel \boldsymbol{y}-\boldsymbol{Xw}{\parallel}_2^2+\lambda \alpha \parallel \boldsymbol{w}{\parallel}_1+\lambda \left(1-\alpha \right)\parallel \boldsymbol{w}{\parallel}_2^2 $$

The penalties can also be added in other models such as logistic regression, support vector machines, artificial neural networks, etc.

8 Support Vector Machine

Linear and logistic regression take into account every training sample in order to find the best line, which is due to their corresponding loss functions: the squared error is zero only if the true and predicted outputs are equal, and the logistic loss is always positive. One could argue that the training samples whose outputs are “easily” well predicted are not relevant: only the training samples whose outputs are not “easily” well predicted or are wrongly predicted should be taken into account. The support vector machine (SVM) is based on this principle (please see Box 4 for an overview of the SVM).

Box 4: Support Vector Machine

  • Main idea: hyperplane (i.e., line) that maximizes the margin (i.e., the distance between the hyperplane and the closest inputs to the hyperplane).

  • Support vectors: only the misclassified inputs and the inputs well classified but with low confidence are taken into account.

  • Non-linearity: decision function can be non-linear with the use of non-linear kernels.

  • Regularization: 2 penalty.

8.1 Original Formulation

The original support vector machine was invented in 1963 and was a linear binary classification method [10]. Figure 9 illustrates the main concept of its original version. When both classes are linearly separable, there exist an infinite number of hyperplanes that separate both classes. The SVM finds the hyperplane that maximizes the margin, that is, the distance between the hyperplane and the closest points of both classes to the hyperplane, while linearly separating both classes.

Fig. 9
2 scatterplots. The scatterplot on the left has diagonal lines crossing each other with spots on either side. The scatterplot on the right has 3 diagonal lines with spots on either side. The diagonal lines have equal space between them. The lines also pass through a few points.

Support vector machine classifier with linearly separable classes. When two classes are linearly separable, there exist an infinite number of hyperplanes separating them (left). The decision function of the support vector machine classifier is the hyperplane that maximizes the margin, that is, the distance between the hyperplane and the closest points to the hyperplane (right). Support vectors are highlighted with a black circle surrounding them

The SVM was later updated to non-separable classes [11]. Figure 10 illustrates the role of the margin in this case. The dashed lines correspond to the hyperplanes defined by the equations xw = +1 and xw = −1. The margin is the distance between both hyperplanes and is equal to \( 2\slash \parallel \boldsymbol{w}{\parallel}_2^2 \). It defines which samples are included in the decision function of the model: a sample is included if and only if it is inside the margin or outside the margin and misclassified. Such samples are called support vectors and are illustrated in Fig. 10 with a black circle surrounding them. In this case, the margin can be seen a regularization term: the larger the margin is, the more support vectors are included in the decision function, the more regularized the model is.

Fig. 10
A scatterplot consists of two distinct spots and three vertical lines that extend diagonally from the left center to the right center of the plot, dividing it into two sections.

Decision function of a support vector machine classifier with a linear kernel when both classes are not strictly linearly separable. The support vectors are the training points within the margin of the decision function and the misclassified training points. The support vectors are highlighted with a black circle surrounding them

The loss function for the SVM is called the hinge loss and is defined as:

$$ {\ell}_{\mathrm{hinge}}\left(y,f\left(\boldsymbol{x}\right)\right)=\max \left(0,1- yf\left(\boldsymbol{x}\right)\right) $$

Figure 11 illustrates the curves of the logistic and hinge losses. The logistic loss is always positive, even when the point is accurately classified with high confidence (i.e., when yf(x) ≫ 0), whereas the hinge loss is equal to zero when the point is accurately classified with good confidence (i.e., when yf(x) ≥ 1). One can see that a sample (x, y) is a support vector if and only if yf(x) ≥ 1, that is, if and only if hinge(y, f(x)) = 0.

Fig. 11
A line graph of l of y, f of x versus y f of x plots 2 decreasing lines for logistic loss and hinge loss. The lines gradually decrease and remain stable, and they also overlap in a small portion.

Binary classification losses. The logistic loss is always positive, even when the point is accurately classified with high confidence (i.e., when yf(x) ≫ 0), whereas the hinge loss is equal to zero when the point is accurately classified with good confidence (i.e., when yf(x) ≥ 1)

The optimal w coefficients for the original version are estimated by minimizing an objective function consisting of the sum of the hinge loss values and a 2 penalty term (which is inversely proportional to the margin):

$$ \underset{\boldsymbol{w}}{\min}\sum \limits_{i=1}^n\max \left(0,1-{y}^{(i)}{\boldsymbol{x}}^{(i)\top}\boldsymbol{w}\right)+\frac{1}{2C}\parallel \boldsymbol{w}{\parallel}_2^2 $$

8.2 General Formulation with Kernels

The SVM was later updated to non-linear decision functions with the use of kernels [12].

In order to have a non-linear decision function, one could map the input space \( I \) into another space (often called the feature space), denoted by \( \mathcal{G} \), using a function denoted by ϕ:

$$ {\displaystyle \begin{array}{rlll}\phi :I& \to \mathcal{G}& & \\ {}\boldsymbol{x}& \mapsto \phi \left(\boldsymbol{x}\right)& & \end{array}} $$

The decision function would still be linear (with a dot product), but in the feature space:

$$ f\left(\boldsymbol{x}\right)=\phi {\left(\boldsymbol{x}\right)}^{\top}\boldsymbol{w} $$

Unfortunately, solving the corresponding minimization problem is not trivial:

$$ \underset{\boldsymbol{w}}{\min}\sum \limits_{i=1}^n\max \left(0,1-{y}^{(i)}\phi {\left({\boldsymbol{x}}^{(i)}\right)}^{\top}\boldsymbol{w}\right)+\frac{1}{2C}\parallel \boldsymbol{w}{\parallel}_2^2 $$
(3)

Nonetheless, two mathematical properties make the use of non-linear transformations in the feature space possible: the kernel trick and the representer theorem.

The kernel trick asserts that the dot product in the feature space can be computed using only the points from the input space and a kernel function, denoted by K:

$$ \forall \boldsymbol{x},{\boldsymbol{x}}^{\prime}\in \kern0.3em I,\kern0.3em \phi {\left(\boldsymbol{x}\right)}^{\top}\phi \left({\boldsymbol{x}}^{\prime}\right)=K\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right) $$

The representer theorem [13, 14] asserts that, under certain conditions on the kernel K and the feature space \( \mathcal{G} \) associated with the function ϕ, any minimizer of Eq. 3 admits the following form:

$$ f=\sum \limits_{i=1}^n{\alpha}_iK\left(\cdot, {\boldsymbol{x}}^{(i)}\right) $$

where α solves:

$$ \underset{\boldsymbol{\alpha}}{\min}\sum \limits_{i=1}^n\max \left(0,1-{y}^{(i)}{\left[\boldsymbol{K}\boldsymbol{\alpha } \right]}_i\right)+\frac{1}{2C}{\boldsymbol{\alpha}}^{\top}\boldsymbol{K}\boldsymbol{\alpha } $$

where K is the n × n matrix consisting of the evaluations of the kernel on all the pairs of training samples: ∀i, j ∈{1, …, n}, Kij = K(x(i), x(j)).

Because the hinge loss is equal to zero if and only if yf(x) is greater than or equal to 1, only the training samples (x(i), y(i)) such that y(i)f(x(i)) < 1 have a nonzero αi coefficient. These points are the so-called support vectors, and this is why they are the only training samples contributing to the decision function of the model:

$$ {\displaystyle \begin{array}{cc}\mathrm{SV}=\left\{i\in \left\{1,\dots, n\right\}\kern0.3em |\kern0.3em {\alpha}_i\ne 0\right\}& \\ {}f\left(\boldsymbol{x}\right)=\sum \limits_{i=1}^n{\alpha}_iK\left(\boldsymbol{x},{\boldsymbol{x}}^{(i)}\right)=\sum \limits_{i\in \mathrm{SV}}{\alpha}_iK\left(\boldsymbol{x},{\boldsymbol{x}}^{(i)}\right)& \end{array}} $$

The kernel trick and the representer theorem show that it is more practical to work with the kernel K instead of the mapping function ϕ. Popular kernel functions include:

  • The linear kernel:

    $$ K\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right)={\boldsymbol{x}}^{\top }{\boldsymbol{x}}^{\prime } $$
  • The polynomial kernel:

    $$ K\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right)={\left(\gamma \kern0.3em {\boldsymbol{x}}^{\top }{\boldsymbol{x}}^{\prime }+{c}_0\right)}^d\kern1em \mathrm{with}\kern1em \gamma >0,\kern0.3em {c}_0\ge 0,\kern0.3em d\in {\mathbb{N}}^{\ast } $$
  • The sigmoid kernel:

    $$ K\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right)=\tanh \left(\gamma \kern0.3em {\boldsymbol{x}}^{\top }{\boldsymbol{x}}^{\prime }+{c}_0\right)\kern1em \mathrm{with}\kern1em \gamma >0,\kern0.3em {c}_0\ge 0 $$
  • The radial basis function (RBF) kernel:

    $$ K\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right)=\exp \left(-\gamma \kern0.3em \parallel \boldsymbol{x}-{\boldsymbol{x}}^{\prime }{\parallel}_2^2\right)\kern1em \mathrm{with}\kern1em \gamma >0 $$

The linear kernel yields a linear decision function and is actually identical to the original formulation of the SVM (one can show that there is a mapping between the α and w coefficients). Non-linear kernels allow for non-linear, more complex, decision functions. This is particularly useful when the data is not linearly separable, which is the most common use case. Figure 12 illustrates the decision function and the margin of a SVM classification model for four different kernels.

Fig. 12
4 scatterplots of the linear kernel, polynomial kernel, R B F kernel, and sigmoid kernel plot 2 types of spots with a line and 2 dotted lines diagonally dividing the graph in different patterns.

Impact of the kernel on the decision function of a support vector machine classifier. A non-linear kernel allows for a non-linear decision function

The SVM was also extended to regression tasks with the use of the ε-insensitive loss. Similar to the hinge loss, which is equal to zero for points that are correctly classified and outside the margin, the ε-insensitive loss is equal to zero when the error between the true target value and the predicted value is not greater than ε:

$$ {\ell}_{\varepsilon -\mathrm{insensitive}}\left(y,f\left(\boldsymbol{x}\right)\right)=\max \left(0,|y-f\left(\boldsymbol{x}\right)|-\varepsilon \right) $$

The objective function for the SVM regression method combines the values of ε-insensitive loss of the training points and the 2 penalty:

$$ \underset{\boldsymbol{w}}{\min}\sum \limits_{i=1}^n\max \left(0,\left|{y}^{(i)}-\phi {\left({\boldsymbol{x}}^{(i)}\right)}^{\top}\boldsymbol{w}\right|-\varepsilon \right)+\frac{1}{2C}\parallel \boldsymbol{w}{\parallel}_2^2 $$

Figure 13 illustrates the curves of three regression losses. The squared error loss takes very small values for small errors and very high values for high errors, whereas the absolute error loss takes small values for small errors and high values for high errors. Both losses take small but nonzero values when the error is small. On the contrary, the ε-insensitive loss is null when the error is small and otherwise equal to the absolute error loss minus ε.

Fig. 13
A line graph of l of y, y hat versus y minus y hat displays lines for M S E, M A E, and epsilon minus insensitive loss. The lines initially decrease first and then increase. The lowest part is marked with an arrow of negative epsilon to positive epsilon.

Regression losses. The squared error loss takes very small values for small errors and very large values for large errors, whereas the absolute error loss takes small values for small errors and large values for large errors. Both losses take small but nonzero values when the error is small. On the contrary, the ε-insensitive loss is null when the error is small and otherwise equal the absolute error loss minus ε. When computed over several samples, the squared and absolute error losses are often referred to as mean squared error (MSE) and mean absolute error (MAE), respectively

9 Multiclass Classification

The classification methods that we presented so far, logistic regression and support vector machines, are binary classifiers: they can only be used when there are only two possible outcomes. However, in practice, it is common to have more than two possible outcomes. For instance, differential diagnosis of brain disorders is often between several, and not only two, diseases.

Several strategies have been proposed to extend any binary classification method to multiclass classification tasks. They all rely on transforming the multiclass classification task into several binary classification tasks. In this section, we present the most commonly used strategies: one-vs-rest, one-vs-one, and error correcting output code [15]. Figure 14 illustrates the main ideas of these approaches. But first, we present a natural extension of logistic regression to multiclass classification tasks which is often referred to as multinomial logistic regression [5].

Fig. 14
A list of codes. One versus rest lists 5 numerical combinations, and one versus one lists 10 combinations. It results in a list of output codes.

Main approaches to convert a multiclass classification task into several binary classification tasks. In the one-vs-rest approach, each class is associated with a binary classification model that is trained to separate this class from all the other classes. In the one-vs-one approach, a binary classifier is trained on each pair of classes. In the error correcting output code approach, the classes are (randomly) split into two groups, and a binary classifier is trained for each split

9.1 Multinomial Logistic Regression

For binary classification, logistic regression is characterized by a hyperplane: the signed distance to the hyperplane is mapped into the probability of belonging to the positive class using the sigmoid function. However, for multiclass classification, a single hyperplane is not enough to characterize all the classes. Instead, each class \( {\mathcal{C}}_k \) is characterized by a hyperplane wk, and, for any input x, one can compute the signed distance xwk between the input x and the hyperplane wk. The signed distances are mapped into probabilities using the softmax function, defined as \( \mathrm{softmax}\left({x}_1,\dots, {x}_q\right)=\left(\frac{\exp \left({x}_1\right)}{\sum_{j=1}^q\exp \left({x}_j\right)},\dots, \frac{\exp \left({x}_q\right)}{\sum_{j=1}^q\exp \left({x}_j\right)}\right) \), as follows:

$$ \forall k\in \left\{1,\dots, q\right\},\kern0.3em P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)=\frac{\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k\right)}{\sum_{j=1}^q\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{w}}_j\right)} $$

The coefficients (wk)1≤kq are still estimated by maximizing the likelihood function:

$$ L\left({\boldsymbol{w}}_1,\dots, {\boldsymbol{w}}_q\right)=\prod \limits_{i=1}^n\prod \limits_{k=1}^qP{\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}={\boldsymbol{x}}^{(i)}\right)}^{{\mathbf{1}}_{y^{(i)}={\mathcal{C}}_k}} $$

which is equivalent to minimizing the negative log-likelihood:

$$ {\displaystyle \begin{array}{rlll}& -\log \left(L\left({\boldsymbol{w}}_1,\dots, {\boldsymbol{w}}_q\right)\right)& & \\ {}& =-\sum \limits_{i=1}^n\sum \limits_{k=1}^q{\mathbf{1}}_{y^{(i)}={\mathcal{C}}_k}\log \left(P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}={\boldsymbol{x}}^{(i)}\right)\right)& & \\ {}& =\sum \limits_{i=1}^n-\sum \limits_{k=1}^q{\mathbf{1}}_{y^{(i)}={\mathcal{C}}_k}\log \left(\frac{\exp \left({\boldsymbol{x}}^{\left(\boldsymbol{i}\right)\mathbf{\top}}{\boldsymbol{w}}_k\right)}{\sum_{j=1}^q\exp \left({\boldsymbol{x}}^{\left(\boldsymbol{i}\right)\mathbf{\top}}{\boldsymbol{w}}_j\right)}\right)& & \\ {}& =\sum \limits_{i=1}^n{\ell}_{\mathrm{cross}-\mathrm{entropy}}\left({y}^{(i)},\mathrm{softmax}\left({\boldsymbol{x}}^{\left(\boldsymbol{i}\right)\mathbf{\top}}{\boldsymbol{w}}_1,\dots, {\boldsymbol{x}}^{\left(\boldsymbol{i}\right)\mathbf{\top}}{\boldsymbol{w}}_q\right)\right)& & \end{array}} $$

where cross entropy is known as the cross-entropy loss and is defined, for any label y and any vector of probabilities (π1, …, πq), as:

$$ {\ell}_{\mathrm{cross}-\mathrm{entropy}}\left(y,\left({\pi}_1,\dots, {\pi}_q\right)\right)=-\sum \limits_{k=1}^q{\mathbf{1}}_{y={\mathcal{C}}_k}\log {\pi}_k $$

This loss is commonly used to train artificial neural networks on classification tasks and is equivalent to the logistic loss in the binary case.

Figure 15 illustrates the impact of the strategy used to handle a multiclass classification task on the decision function.

Fig. 15
A set of 4 scatterplots of multinomial, one versus rest, one versus one, and output code. The scatterplots have 4 different spots and are irregularly divided into 4 parts.

Illustration of the impact of the strategy used to handle a multiclass classification task on the decision function of a logistic regression model

9.2 One-vs-Rest

A strategy to transform a multiclass classification task into several binary classification tasks is to fit a binary classifier for each class: the positive class is the given class, and the negative class consists of all the other classes merged into a single class. This strategy is known as one-vs-rest. The advantage of this strategy is that each class is characterized by a single model, so that it is possible to gain deeper knowledge about the class by inspecting its corresponding model. A consequence is that the predictions for new samples take into account the confidence of the models: the predicted class for a new input is the class for which the corresponding model is the most confident that this input belongs to its class. The one-vs-rest strategy is the most commonly used strategy and usually a good default choice.

9.3 One-vs-One

Another strategy is to fit a binary classifier for each pair of classes: this strategy is known as one-vs-one. The advantage of this strategy is that the classes in each binary classification task are “pure”, in the sense that different classes are never merged into a single class. However, the number of binary classifiers that needs to be trained is larger for the one-vs-one strategy (\( \frac{1}{2}q\left(q-1\right) \)) than for the one-vs-rest strategy (q). Nonetheless, for the one-vs-one strategy, the number of training samples in each binary classification task is smaller than the total number of samples, which makes training each binary classifier usually faster. Another drawback is that this strategy is less interpretable compared to the one-vs-rest strategy, as the predicted class corresponds to the class obtaining the most votes (i.e., winning the most one-vs-one matchups), which does not take into account the confidence in winning each matchup.Footnote 1 For instance, winning a one-vs-one matchup with 0.99 probability gives the same result as winning the same matchup with 0.51 probability, i.e., one vote.

9.4 Error Correcting Output Code

A substantially different strategy, inspired by the theory of error correction code, consists in merging a subset of classes into one class and the other subset into the other class, for each binary classification task. This data is often called the code book and can be represented as a matrix whose rows correspond to the classes and whose columns correspond to the binary classification tasks. The matrix consists only of − 1 and + 1 values that represent the corresponding label for each class and for each binary task.Footnote 2 For any input, each binary classifier returns the score (or probability) associated with the positive class. The predicted class for this input is the class whose corresponding vector is the most similar to the vector of scores, with similarity being assessed with the Euclidean distance (the lower, the more similar). There exist advanced strategies to define the code book, but it has been argued than a random code book usually gives as good results as a sophisticated one [16].

10 Decision Functions with Normal Distributions

Normal distributions are popular distributions because they are commonly found in nature. For instance, the distribution of heights and birth weights of human beings can be approximated using normal distributions. Moreover, normal distributions are particularly easy to work with from a mathematical point of view. For these reasons, a common model consists in assuming that the training input vectors are independently sampled from normal distributions.

A possible classification model consists in assuming that, for each class, all the corresponding inputs are sampled from a normal distribution with mean vector μk and covariance matrix Σk:

$$ \forall i\kern0.3em \mathrm{such}\kern0.34em \mathrm{that}\kern0.3em {y}^{(i)}={\mathcal{C}}_k,\kern0.3em {\boldsymbol{x}}^{(i)}\sim \mathcal{N}\left({\boldsymbol{\mu}}_k,{\boldsymbol{\Sigma}}_k\right) $$

Using the probability density function of a normal distribution, one can compute the probability density of any input x associated with the distribution \( \mathcal{N}\left({\boldsymbol{\mu}}_k,{\boldsymbol{\Sigma}}_k\right) \) of class \( {\mathcal{C}}_k \):

$$ {p}_{\mathbf{x}\mid \mathrm{y}={\mathcal{C}}_k}\left(\boldsymbol{x}\right)=\frac{1}{\sqrt{{\left(2\pi \right)}^p\mid {\boldsymbol{\Sigma}}_k\mid }}\exp \left(-\frac{1}{2}{\left[\boldsymbol{x}-{\boldsymbol{\mu}}_k\right]}^{\top }{\boldsymbol{\Sigma}}_k^{-1}\left[\boldsymbol{x}-{\boldsymbol{\mu}}_k\right]\right) $$

With such a probabilistic model, it is easy to compute the probability that a sample belongs to class \( {\mathcal{C}}_k \) using Bayes rule:

$$ P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)=\frac{p_{\mathbf{x}\mid \mathrm{y}={\mathcal{C}}_k}\left(\boldsymbol{x}\right)P\left(\mathrm{y}={\mathcal{C}}_k\right)}{p_{\mathbf{x}}\left(\boldsymbol{x}\right)} $$

With normal distributions, it is mathematically easier to work with log-probabilities:

$$ {\displaystyle \begin{array}{rlll}& \log P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)& & \\ {}& \kern1em =\log {p}_{\mathbf{x}\mid \mathrm{y}={\mathcal{C}}_k}\left(\boldsymbol{x}\right)+\log P\left(\mathrm{y}={\mathcal{C}}_k\right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& & \\ {}& \kern1em =-\frac{1}{2}{\left[\boldsymbol{x}-{\boldsymbol{\mu}}_k\right]}^{\top }{\boldsymbol{\Sigma}}_k^{-1}\left[\boldsymbol{x}-{\boldsymbol{\mu}}_k\right]-\frac{1}{2}\log \mid {\boldsymbol{\Sigma}}_k\mid +\log P\left(\mathrm{y}={\mathcal{C}}_k\right)& & \\ {}& \kern2em -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& & \\ {}& \kern1em =-\frac{1}{2}{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}_k^{-1}\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}_k^{-1}{\boldsymbol{\mu}}_k& & \\ {}& \kern2em -\frac{1}{2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\Sigma}}_k^{-1}{\boldsymbol{\mu}}_k-\frac{1}{2}\log \mid {\boldsymbol{\Sigma}}_k\mid +\log P\left(\mathrm{y}={\mathcal{C}}_k\right)& & \\ {}& \kern2em -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& \end{array}} $$
(4)

It is also possible to make further assumptions on the covariance matrices that lead to different models. In this section, we present the most commonly used ones: naive Bayes, linear discriminant analysis, and quadratic discriminant analysis. Figure 16 illustrates the covariance matrices and the decision functions for these models in the two-dimensional case.

Fig. 16
A set of 4 scatterplots. 1. Naive Bayes with different conditional variances, 2. Naive Bayes with identical conditional variances, 3. linear discriminant analysis, and 4. quadratic discriminant analysis. The plots have 2 different spots in and around a circle. The plot is divided into two 2 parts.

Illustration of decision functions with normal distributions. A two-dimensional covariance matrix can be represented as an ellipse. In the naive Bayes model, the features are assumed to be independent and to have the same variance conditionally to the class, leading to covariance matrices being represented as circles. When the covariance matrices are assumed to be identical, the decision functions are linear instead of quadratic

10.1 Naive Bayes

The naive Bayes model assumes that, conditionally to each class \( {\mathcal{C}}_k \), the features are independent and have the same variance \( {\sigma}_k^2 \):

$$ \forall k,\kern0.3em {\boldsymbol{\Sigma}}_k={\sigma}_k^2{\boldsymbol{I}}_p $$

Equation 4 can thus be further simplified:

$$ {\displaystyle \begin{array}{rlll}& \log P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)& & \\ {}& =-\frac{1}{2{\sigma}_k^2}{\boldsymbol{x}}^{\top}\boldsymbol{x}+\frac{1}{\sigma_k^2}{\boldsymbol{x}}^{\top }{\boldsymbol{\mu}}_k-\frac{1}{2{\sigma}_k^2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\mu}}_k-\log {\sigma}_k+\log P\left(\mathrm{y}={\mathcal{C}}_k\right)& & \\ {}& \kern2em -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& & \\ {}& ={\boldsymbol{x}}^{\top }{\boldsymbol{W}}_k\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k+{w}_{0k}+s& & \end{array}} $$

where:

  • \( {\boldsymbol{W}}_k=-\frac{1}{2{\sigma}_k^2}{\boldsymbol{I}}_p \) is the matrix of the quadratic term for class \( {\mathcal{C}}_k \).

  • \( {\boldsymbol{w}}_k=\frac{1}{\sigma_k^2}{\boldsymbol{\mu}}_k \) is the vector of the linear term for class \( {\mathcal{C}}_k \).

  • \( {w}_{0k}=-\frac{1}{2{\sigma}_k^2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\mu}}_k-\log {\sigma}_k+\log P\left(\mathrm{y}={\mathcal{C}}_k\right) \) is the intercept for class \( {\mathcal{C}}_k \).

  • \( s=-\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right) \) is a term that does not depend on class \( {\mathcal{C}}_k \).

Therefore, naive Bayes is a quadratic model. The probabilities for input x to belong to each class \( {\mathcal{C}}_k \) can then easily be computed:

$$ P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)=\frac{\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{W}}_k\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k+{w}_{0k}\right)}{\sum_{j=1}^k\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{W}}_j\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{w}}_j+{w}_{0j}\right)} $$

With the naive Bayes model, it is relatively common to have the conditional variances \( {\sigma}_k^2 \) to all be equal:

$$ \forall k,{\boldsymbol{\Sigma}}_k={\sigma}_k^2{\boldsymbol{I}}_p={\sigma}^2{\boldsymbol{I}}_p $$

In this case, Eq. 4 can be even further simplified:

$$ {\displaystyle \begin{array}{rlll}& \log P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)& & \\ {}& =-\frac{1}{2{\sigma}^2}{\boldsymbol{x}}^{\top}\boldsymbol{x}+\frac{1}{\sigma^2}{\boldsymbol{x}}^{\top }{\boldsymbol{\mu}}_k-\frac{1}{2{\sigma}^2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\mu}}_k-\log {\sigma}_k+\log P\left(\mathrm{y}={\mathcal{C}}_k\right)& & \\ {}& \kern2em -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& & \\ {}& ={\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k+{w}_{0k}+s& & \end{array}} $$

where:

  • \( {\boldsymbol{w}}_k=\frac{1}{\sigma^2}{\boldsymbol{\mu}}_k \) is the vector of the linear term for class \( {\mathcal{C}}_k \).

  • \( {w}_{0k}=-\frac{1}{2{\sigma}^2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\mu}}_k+\log P\left(\mathrm{y}={\mathcal{C}}_k\right) \) is the intercept for class \( {\mathcal{C}}_k \).

  • \( s=-\frac{1}{2{\sigma}^2}{\boldsymbol{x}}^{\top}\boldsymbol{x}-\log \sigma -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right) \) is a term that does not depend on class \( {\mathcal{C}}_k \).

In this case, naive Bayes becomes a linear model.

10.2 Linear Discriminant Analysis

Linear discriminant analysis (LDA) makes the assumption that all the covariance matrices are identical but otherwise arbitrary:

$$ \forall k,\kern0.3em {\boldsymbol{\Sigma}}_k=\boldsymbol{\Sigma} $$

Therefore, Eq. 4 can be further simplified:

$$ {\displaystyle \begin{array}{rlll}& \log P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)& & \\ {}& =-\frac{1}{2}{\left[\boldsymbol{x}-{\boldsymbol{\mu}}_k\right]}^{\top }{\boldsymbol{\Sigma}}^{-1}\left[\boldsymbol{x}-{\boldsymbol{\mu}}_k\right]-\frac{1}{2}\log \mid \boldsymbol{\Sigma} \mid +\log P\left(\mathrm{y}={\mathcal{C}}_k\right)& & \\ {}& \kern2em -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& & \\ {}& =-\frac{1}{2}\left({\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}^{-1}\boldsymbol{x}-{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\mu}}_k-{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\Sigma}}^{-1}\boldsymbol{x}+{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\mu}}_k\right)& & \\ {}& \kern2em -\frac{1}{2}\log \mid \boldsymbol{\Sigma} \mid +\log P\left(\mathrm{y}={\mathcal{C}}_k\right)-\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& & \\ {}& =-{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\mu}}_k-\frac{1}{2}{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}^{-1}\boldsymbol{x}-\frac{1}{2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\mu}}_k+\log P\left(\mathrm{y}={\mathcal{C}}_k\right)-\frac{1}{2}\log \mid \boldsymbol{\Sigma} \mid & & \\ {}& \kern2em -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right)& & \\ {}& ={\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k+{w}_{0k}+s& & \end{array}} $$

where:

  • wk = Σ−1μk is the vector of coefficients for class \( {\mathcal{C}}_k \).

  • \( {w}_{0k}=-\frac{1}{2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\mu}}_k+\log P\left(\mathrm{y}={\mathcal{C}}_k\right) \) is the intercept for class \( {\mathcal{C}}_k \).

  • \( s=-\frac{1}{2}{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}^{-1}\boldsymbol{x}--\frac{1}{2}\log \mid \boldsymbol{\Sigma} \mid -\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right) \) is a term that does not depend on class \( {\mathcal{C}}_k \).

Therefore, linear discriminant analysis is a linear model. When Σ is diagonal, linear discriminant analysis is identical to naive Bayes with identical conditional variances.

The probabilities for input x to belong to each class \( {\mathcal{C}}_k \) can then easily be computed:

$$ P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)=\frac{\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k+{w}_{0k}\right)}{\sum_{j=1}^k\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{w}}_j+{w}_{0j}\right)} $$

10.3 Quadratic Discriminant Analysis

Quadratic discriminant analysis makes no assumption on the covariance matrices Σk that can all be arbitrary. Equation 4 can be written as:

$$ {\displaystyle \begin{array}{l}\log P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)\\ {}=-\frac{1}{2}{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}_k^{-1}\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{\Sigma}}_k^{-1}{\boldsymbol{\mu}}_k-\frac{1}{2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\Sigma}}_k^{-1}{\boldsymbol{\mu}}_k-\frac{1}{2}\log \mid {\boldsymbol{\Sigma}}_k\mid \\ {}\kern1.4em +\log P\left(\mathrm{y}={\mathcal{C}}_k\right)-\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right) \\ {}={\boldsymbol{x}}^{\top }{\boldsymbol{W}}_k\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k+{w}_{0k}+s\end{array}} $$

where:

  • \( {\boldsymbol{W}}_k=-\frac{1}{2}{\boldsymbol{\Sigma}}_k^{-1} \) is the matrix of the quadratic term for class \( {\mathcal{C}}_k \).

  • \( {\boldsymbol{w}}_k={\boldsymbol{\Sigma}}_k^{-1}{\boldsymbol{\mu}}_k \) is the vector of the linear term for class \( {\mathcal{C}}_k \).

  • \( {w}_{0k}=-\frac{1}{2}{\boldsymbol{\mu}}_k^{\top }{\boldsymbol{\Sigma}}_k^{-1}{\boldsymbol{\mu}}_k-\frac{1}{2}\log \mid {\boldsymbol{\Sigma}}_k\mid +\log P\left(\mathrm{y}={\mathcal{C}}_k\right) \) is the intercept for class \( {\mathcal{C}}_k \).

  • \( s=-\frac{p}{2}\log \left(2\pi \right)-\log {p}_{\mathbf{x}}\left(\boldsymbol{x}\right) \) is a term that does not depend on class \( {\mathcal{C}}_k \).

Therefore, quadratic discriminant analysis is a quadratic model.

The probabilities for input x to belong to each class \( {\mathcal{C}}_k \) can then easily be computed:

$$ P\left(\mathrm{y}={\mathcal{C}}_k|\mathbf{x}=\boldsymbol{x}\right)=\frac{\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{W}}_k\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{w}}_k+{w}_{0k}\right)}{\sum_{j=1}^k\exp \left({\boldsymbol{x}}^{\top }{\boldsymbol{W}}_j\boldsymbol{x}+{\boldsymbol{x}}^{\top }{\boldsymbol{w}}_j+{w}_{0j}\right)} $$

11 Tree-Based Methods

11.1 Decision Tree

Binary decisions based on conditional statements are frequently used in everyday life because they are intuitive and easy to understand. Figure 17 illustrates a general approach when someone is ill. Depending on conditional statements (severity of symptoms, ability to quickly consult a specialist), the decision (consult your general practitioner or a specialist, or call for emergency services) is different. Models with such an architecture are often used in machine learning and are called decision trees.

Fig. 17
A decision tree illustrates how to respond to the severity of symptoms. If the symptoms are mild, it suggests consulting a general practitioner. On the other hand, if the symptoms are severe, it recommends consulting a doctor or calling emergency services.

A general thought process when being ill. Depending on conditional statements (severity of symptoms, ability to quickly consult a specialist), the decision (consult your general practitioner or a specialist, or call for emergency services) is different

A decision tree is an algorithm containing only conditional statements and can be represented with a tree [17]. This graph consists of:

  • Decision nodes for all the conditional statements

  • Branches for the potential outcomes of each decision node

  • Leaf nodes for the final decision

Figure 18 illustrates a decision tree and its corresponding decision function. For a given sample, the final decision is obtained by following its corresponding path, starting at the root node.

Fig. 18
Left. A decision tree of x 1 greater than negative 6.26 that leads to y hat equals plus 1 for yes and y hat equals negative 1 for no. Right, A scatterplot of x 2 versus x 1 plots 2 different values with a vertical line between them.

A decision tree: (left) the rules learned by the decision tree and (right) the corresponding decision function

A decision tree recursively partitions the feature space in order to group samples with the same labels or similar target values. At each node, the objective is to find the best (feature, threshold) pair so that both subsets obtained with this split are the most pure, that is, homogeneous. To do so, the best (feature, threshold) pair is defined as the pair that minimizes an impurity criterion.

Let \( \mathcal{S}\subseteq \mathcal{X} \) be a subset of training samples. For classification tasks, the distribution of the classes, that is, the proportion of each class, is used to measure the purity of the subset. Let pk be the proportion of samples from class \( {\mathcal{C}}_k \) in a given partition:

$$ {p}_k=\frac{1}{\mid \mathcal{S}\mid}\sum \limits_{y\in \mathcal{S}}{\mathbf{1}}_{y={\mathcal{C}}_k} $$

Popular impurity criteria for classification tasks include:

  • Gini index: ∑ kpk(1 − pk)

  • Entropy: \( -\sum \limits_k{p}_k\log \left({p}_k\right) \)

  • Misclassification: 1 −maxkpk

Figure 19 illustrates the values of the Gini index and the entropy for a single class \( {\mathcal{C}}_k \) and for different proportions of samples pk. One can see that the entropy function takes larger values than the Gini index, especially for pk < 0.8. Since the sum of the proportions is equal to 1, most classes only represent a small proportion of the samples. Therefore, a simple interpretation is that entropy is more discriminative against heterogeneous subsets than the Gini index. Misclassification only takes into account the proportion of the most common class and tends to be even less discriminative against heterogeneous subsets than both entropy and Gini index.

Fig. 19
A line graph of impurity versus p k plots 2 concave down curves for the Gini index and entropy. The curves begin at (0, 0) and end at (1,0). The value of entropy is higher than the Gini index. Values are approximated.

Illustration of Gini index and entropy. The entropy function takes larger values than the Gini index, especially for pk < 0.8, which thus is more discriminative against heterogeneous subsets (when most classes only represent only a small proportion of the samples) than Gini index

For regression tasks, the mean error from a reference value (such as the mean or the median) is often used as the impurity criterion:

  • Mean squared error: \( \frac{1}{\mid \mathcal{S}\mid}\sum \limits_{y\in \mathcal{S}}{\left(y-\overline{y}\right)}^2\kern1em \mathrm{with}\kern1em \overline{y}=\frac{1}{\mid \mathcal{S}\mid}\sum \limits_{y\in \mathcal{S}}y \)

  • Mean absolute error: \( \frac{1}{\mid \mathcal{S}\mid}\sum \limits_{y\in \mathcal{S}}\mid y-{\mathrm{median}}_{\mathcal{S}}(y)\mid \)

Theoretically, a tree can grow until every leaf node is perfectly pure. However, such a tree would have a lot of branches and would be very complex, making it prone to overfitting. Several strategies are commonly used to limit the size of the tree. One approach consists in growing the tree with no restriction and then pruning the tree, that is, replacing subtrees with nodes [17]. Other popular strategies to limit the complexity of the tree are usually applied while the tree is grown and include setting:

  • A maximum depth for the tree

  • A minimum number of samples required to be at an internal node

  • A minimum number of samples required to split a given partition

  • A maximum number of leaf nodes

  • A maximum number of features considered (instead of all the features) to find the best split

  • A minimum impurity decrease to split an internal node

11.2 Random Forest

One limitation of decision trees is their simplicity. Decision trees tend to use a small fraction of the features in their decision function. In order to use more features in the decision tree, growing a larger tree is required, but large trees tend to suffer from overfitting, that is, having a low bias but a high variance. One solution to decrease the variance without much increasing the bias is to build an ensemble of trees with randomness, hence the name random forest [18]. An overview of random forests can be found in Box 5.

In a bid to have trees that are not perfectly correlated (thus building actually different trees), each tree is built using only a subset of the training samples obtained with random sampling. Moreover, for each decision node of each tree, only a subset of the features are considered to find the best split.

The final prediction is obtained by averaging the predictions of each tree. For classification tasks, the predicted class is either the most commonly predicted class (hard-voting) or the one with the highest mean probability estimate (soft-voting) across the trees. For regression tasks, the predicted value is usually the mean of the predicted values across the trees.

Box 5: Random Forest

  • Random forest: ensemble of decision trees with randomness introduced to build different trees

  • Decision tree: algorithm containing only conditional statements and represented with a tree

  • Regularization: maximum depth for each tree, minimum number of samples required to split a given partition, etc.

11.3 Extremely Randomized Trees

Even though random forests involve randomness in sampling both the samples and the features, trees inside a random forest tend to be correlated, thus limiting the variance decrease. In order to decrease even more the variance of the model (while possibly increasing its bias) by growing less correlated trees, extremely randomized trees introduce more randomness [19]. Instead of looking for the best split among all the candidate (feature, threshold) pairs, one threshold is drawn at random for each candidate feature, and the best of these randomly generated thresholds is chosen as the splitting rule.

12 Clustering

So far, we have presented classic machine learning methods for classification and regression, which are the main components of supervised learning. Each input x(i) had an associated output y(i). In this section, we present clustering, which is an unsupervised machine learning task. In unsupervised learning, only the inputs x(i) are available, with no associated outputs. As the ground truth is not available, the objective is to extract information from the input data without supervising the learning process with the output data.

Clustering consists in finding groups of samples such that:

  • Samples from the same group are similar.

  • Samples from different groups are different.

For instance, clustering can be used to identify disease subtypes for heterogeneous diseases such as Alzheimer’s disease and Parkinson’s disease.

In this section, we present two of the most common clustering methods: the k-means algorithm and the Gaussian mixture model.

12.1 k-means

The k-means algorithm divides a set of n samples, denoted by \( \mathcal{X} \), into a set of k disjoint clusters, each denoted by \( {\mathcal{X}}_j \), such that \( \mathcal{X}=\left\{{\mathcal{X}}_1,\dots, {\mathcal{X}}_k\right\} \).

Figure 20 illustrates the concept of this algorithm. Each cluster \( {\mathcal{X}}_j \) is characterized by its centroid, denoted by μj, that is, the mean of the samples in this cluster:

$$ {\boldsymbol{\mu}}_j=\frac{1}{\mid {\mathcal{X}}_j\mid}\sum \limits_{{\boldsymbol{x}}^{(i)}\in {\mathcal{X}}_j}{\boldsymbol{x}}^{(i)} $$

The centroids fully define the set of clusters because each sample is assigned to the cluster whose centroid is the closest.

Fig. 20
A scatterplot of k-mean plots clusters 1, 2, and 3 with the corresponding centroid of clusters 1, 2, and 3. The centroid of the cluster plots mu with the cluster values around connected by dotted lines.

Illustration of the k-means algorithm. The objective of the algorithm is to find the centroids that minimize the within-cluster sum-of-squares criterion. In this example, the inertia is approximately equal to 184.80 and is the lowest possible inertia, meaning that the represented centroids are optimal

The k-means algorithm aims at finding centroids that minimize the inertia, also known as within-cluster sum-of-squares criterion:

$$ \underset{\left\{{\boldsymbol{\mu}}_1,\dots, {\boldsymbol{\mu}}_k\right\}}{\min}\sum \limits_{j=1}^k\sum \limits_{{\boldsymbol{x}}^{(i)}\in {\mathcal{X}}_j}\parallel {\boldsymbol{x}}^{(i)}-{\boldsymbol{\mu}}_j{\parallel}_2^2 $$

The original algorithm used to find the centroids is often referred to as Lloyd’s algorithm [20] and is presented in Algorithm 1. After initializing the centroids, a two-step loop is repeated until convergence (when the centroids are identical for two consecutive iterations) consisting of:

  1. 1.

    The assignment step, where the clusters are updated based on the current centroids

  2. 2.

    The update step, where the centroids are updated based on the current clusters

When clusters are well-defined, a point from a given cluster is likely to stay in this cluster. Therefore, the assignment step can be sped up thanks to the triangle inequality by keeping track of lower and upper bounds for distances between points and centers, at the cost of higher memory usage [21].

Algorithm 1 Lloyd’s algorithm (aka naive k-means algorithm)

Even though the k-means algorithm is one of the simplest and most used clustering methods, it has several downsides that should be kept in mind.

First, the number of clusters k is a hyperparameter. Setting a value much different from the actual number of clusters may yield poor clusters.

Second, the inertia is not a convex function. Although Lloyd’s algorithm is guaranteed to converge, it may converge to a local minimum that is not a global minimum. Figure 21 illustrates the convergence to such centroids. Several strategies are often applied to address this issue, including sophisticated centroid initialization [22] and running the algorithm numerous times and keeping the best run (i.e., the one yielding the lowest inertia).

Fig. 21
A set of 5 scatterplots of inertia equals 184.80, 623.67, 953.91, 952.08, and 613.62 plots mu in 3 different colors with spots around.

Illustration of the convergence of the k-means algorithm to bad local minima. In the upper figure, the algorithm converged to a global minimum because the inertia is equal to the minimum possible value (184.80); thus, the obtained clusters are optimal. In the four other figures, the algorithm converged to a local minima that are not global minima because the inertias are higher than the minimum possible value; thus, the obtained clusters are suboptimal

Third, inertia makes the assumption that the clusters are convex and isotropic. The k-means algorithm may yield poor results when this assumption does not hold, such as with elongated clusters or manifolds with irregular shapes.

Fourth, the Euclidean distance tends to be inflated (i.e., the ratio of the distances of the nearest and farthest neighbors to a given target is close to 1) in high-dimensional spaces, making inertia a poor criterion in such spaces [23]. One can alleviate this issue by running a dimensionality reduction method such as principal component analysis prior to the k-means algorithm.

12.2 Gaussian Mixture Model

A mixture model makes the assumption that each sample is generated from a mixture of several independent distributions.

Let k be the number of distributions. Each distribution Fj is characterized by its probability of being picked, denoted by πj, and its density pj with parameters θj, denoted by pj(⋅; θj). Let Δ = (Δ1, …, Δk) be a vector-valued random variable such that:

$$ \sum \limits_{j=1}^k{\Delta}_j=1\kern1em \mathrm{and}\kern1em \forall j\in \left\{1,\dots, k\right\},\kern0.3em P\left({\Delta}_j=1\right)=1-P\left({\Delta}_j=0\right)={\pi}_j $$

and (x1, …, xk) be independent random variables such that xj ∼ Fj. The samples are assumed to be generated from a random variable x with density px such that:

$$ {\displaystyle \begin{array}{cc}\mathrm{x}=\sum \limits_{j=1}^k{\Delta}_j{\mathrm{x}}_j& \\ {}\forall \boldsymbol{x}\in \mathcal{X},\kern0.3em {p}_{\mathrm{x}}\left(\boldsymbol{x},\boldsymbol{\theta} \right)=\sum \limits_{j=1}^k{\pi}_j{p}_j\left(\boldsymbol{x};{\boldsymbol{\theta}}_j\right)& \end{array}} $$

A Gaussian mixture model is a particular case of a mixture model in which each distribution Fj is a Gaussian distribution with mean vector μj and covariance matrix Σj:

$$ \forall j\in \left\{1,\dots, k\right\},\kern0.3em {F}_j=\mathcal{N}\left({\boldsymbol{\mu}}_j,{\boldsymbol{\Sigma}}_j\right) $$

Figure 22 illustrates the learned distributions from a Gaussian mixture model.

Fig. 22
A scatterplot of the Gaussian mixture model plots clusters 1, 2, and 3 with the corresponding mean vector of distribution and covariance of distribution.

Gaussian mixture model. For each estimated distribution, the mean vector and the ellipsis consisting of all the points within one standard deviation of the mean are plotted

The objective is to find the parameters θ that maximize the likelihood, with \( \boldsymbol{\theta} =\left({\left\{{\boldsymbol{\mu}}_j\right\}}_{j=1}^k,{\left\{{\boldsymbol{\Sigma}}_j\right\}}_{j=1}^k,{\left\{{\pi}_j\right\}}_{j=1}^k\right) \):

$$ L\left(\boldsymbol{\theta} \right)=\prod \limits_{i=1}^n{p}_X\left({\boldsymbol{x}}^{(i)};\boldsymbol{\theta} \right) $$

For computational reasons, it is easier to maximize the log-likelihood:

$$ \log \left(L\left(\boldsymbol{\theta} \right)\right)=\sum \limits_{i=1}^n\log \left({p}_X\left({\boldsymbol{x}}^{(i)};\boldsymbol{\theta} \right)\right)=\sum \limits_{i=1}^n\log \left(\sum \limits_{j=1}^k{\pi}_j{p}_j\left(\boldsymbol{x};{\boldsymbol{\theta}}_j\right)\right) $$

Because the density pX(⋅; θ) is a weighted sum of Gaussian densities, the expression cannot be further simplified.

In order to solve this maximization problem, an algorithm called expectation-maximization (EM) is often applied [24]. Algorithm 2 describes the main concepts of this algorithm. After initializing the parameters of each distribution, a two-step loop is repeated until convergence (when the parameters are stable over consecutive loops):

  • The expectation step, in which the probability for each sample x(i) to have been generated from distribution Fj is computed

  • The maximization step, in which the probability and the parameters of each distribution are updated

Because it is impossible to know which samples have been generated by each distribution, it is also impossible to directly maximize the log-likelihood, which is why we compute its expected value using the posterior probabilities, hence the name expectation step. The second step simply consists in maximizing the expected log-likelihood, hence the name maximization step.

Algorithm 2 Expectation-maximization algorithm for Gaussian mixture models

Lloyd’s and EM algorithms have a lot of similarities. In the first step, the assignment step assigns each sample to its closest cluster, whereas the expectation step computes the probability for each sample to have been generated from each distribution. In the second step, the update step computes the centroid of each cluster as the mean of the samples in a given cluster, while the maximization step updates the probability and the parameters of each distribution as a weighted average over all the samples. For these reasons, the k-means algorithm is often referred to as a hard-voting clustering method, as opposed to the Gaussian mixture model which is referred to as a soft-voting clustering method.

The Gaussian mixture model has several advantages over the k-means algorithm.

First, the use of normal distribution densities instead of Euclidean distances dwindles the inflation issue in high-dimensional spaces. Second, the Gaussian mixture model includes covariance matrices, allowing for clusters with elliptical shapes, while the k-means algorithm only includes centroids, forcing clusters to have circular shapes.

Nonetheless, the Gaussian mixture model also has several drawbacks, sharing a few with the k-means algorithm.

First, the number of distributions k is a hyperparameter. Setting a value much different from the actual number of clusters may yield poor clusters. Second, the log-likelihood is not a concave function. Like Lloyd’s algorithm, the EM algorithm is guaranteed to converge, but it may converge to a local maximum that is not a global maximum. Several strategies are often applied to address this issue, including sophisticated centroid initialization [22] and running the algorithm numerous times and keeping the best run (i.e., the one yielding the highest log-likelihood). Third, the Gaussian mixture model has more parameters than the k-means algorithm. Therefore, it usually requires more samples to accurately estimate its parameters (in particular the covariance matrices) than the k-means algorithm.

13 Dimensionality Reduction

Dimensionality reduction consists in finding a good mapping from the input space into a space of lower dimension. Dimensionality reduction can either be unsupervised or supervised.

13.1 Principal Component Analysis

For exploratory data analysis, it may be interesting to investigate the variances of the p features and the \( \frac{1}{2}p\left(p-1\right) \) covariances or correlations. However, as the value of p increases, this process becomes growingly tedious. Moreover, each feature may explain a small proportion of the total variance. It may be more desirable to have another representation of the data where a small number of features explain most of the total variance, in other words to have a coordinate system adapted to the input data.

Principal component analysis (PCA) consists in finding a representation of the data through principal components [25]. The principal components are a sequence of unit vectors such that the ith vector is the best approximation of the data (i.e., maximizing the explained variance) while being orthogonal to the first i − 1 vectors.

Figure 23 illustrates principal component analysis when the input space is two-dimensional. On the upper figure, the training data in the original space is plotted. Both features explain about the same amount of the total variance, although one can clearly see that both features are strongly correlated. Principal component analysis identifies a new Cartesian coordinate system based on the input data. On the lower figure, the training data in the new coordinate system is plotted. The first dimension explains much more variance than the second dimension.

Fig. 23
2 scatterplots. Scatterplot 1 displays feature 2, 47.51% versus feature 1, 52.49%. Scatterplot 2 illustrates dimension 2 of 5.45% versus dimension 1 of 94.55%. They have values that lie diagonally and horizontally, respectively.

Illustration of principal component analysis. On the upper figure, the training data in the original space (blue points with black axes) is plotted. Both features explain about the same amount of the total variance, although one can clearly see a linear pattern. Principal component analysis learns a new Cartesian coordinate system based on the input data (red axes). On the lower figure, the training data in the new coordinate system is plotted (green points with red axes). The first dimension explains much more variance than the second dimension

13.1.1 Full Decomposition

Mathematically, given an input matrix \( \boldsymbol{X}\in {\mathbb{R}}^{n\times p} \) that is centered (i.e., the mean value of each column X:,j is equal to zero), the objective is to find a matrix \( \boldsymbol{W}\in {\mathbb{R}}^{p\times p} \) such that:

  • W is an orthogonal matrix, i.e., its columns are unit vectors and orthogonal to each other.

  • The new representation of the input data, denoted by T, consists of the coordinates in the Cartesian coordinate system induced by W (whose columns form an orthogonal basis of \( {\mathbb{R}}^p \) with the Euclidean dot product):

    $$ \boldsymbol{T}=\boldsymbol{XW} $$
  • Each column of W maximizes the explained variance.

Each column wi = W:,i is a principal component. Each input vector x is transformed into another vector t using a linear combination of each feature with the weights from the W matrix:

$$ \boldsymbol{t}={\boldsymbol{x}}^{\top}\boldsymbol{W} $$

The first principal component w(1) is the unit vector that maximizes the explained variance:

$$ {\displaystyle \begin{array}{l}{\boldsymbol{w}}_1=\underset{\parallel \boldsymbol{w}\parallel =1}{\arg\ \max}\left\{\sum \limits_{i=1}^n{\boldsymbol{x}}^{(i)\top}\boldsymbol{w}\parallel \right\}\\ {}=\underset{\parallel \boldsymbol{w}\parallel =1}{\arg\ \max}\left\{\parallel \boldsymbol{Xw}\parallel \right\}\\ {}=\underset{\parallel \boldsymbol{w}\parallel =1}{\arg\ \max}\left\{{\boldsymbol{w}}^{\top }{\boldsymbol{X}}^{\top}\boldsymbol{Xw}\parallel \right\}\\ {}{\boldsymbol{w}}_1=\underset{\boldsymbol{w}\in {\mathbb{R}}^p}{\arg\ \max}\left\{\frac{{\boldsymbol{w}}^{\top }{\boldsymbol{X}}^{\top}\boldsymbol{Xw}}{{\boldsymbol{w}}^{\top}\boldsymbol{w}}\right\} \end{array}} $$

As XX is a positive semi-definite matrix, a well-known result from linear algebra is that w(1) is the eigenvector associated with the largest eigenvalue of XX.

The kth component is found by subtracting the first k − 1 principal components from X:

$$ {\hat{\boldsymbol{X}}}_k=\boldsymbol{X}-\sum \limits_{s=1}^{k-1}\boldsymbol{X}{\boldsymbol{w}}^{(s)}{\boldsymbol{w}}^{(s)\top } $$

and then finding the unit vector that explains the maximum variance from this new data matrix:

$$ {\boldsymbol{w}}_k=\underset{\parallel \boldsymbol{w}\parallel =1}{\arg\ \max}\left\{\parallel {\hat{\boldsymbol{X}}}_k\boldsymbol{w}\parallel \right\}=\underset{\boldsymbol{w}\in {\mathbb{R}}^p}{\arg\ \max}\left\{\frac{{\boldsymbol{w}}^{\top }{\hat{X}}_k^{\top }{\hat{\boldsymbol{X}}}_k\boldsymbol{w}}{{\boldsymbol{w}}^{\top}\boldsymbol{w}}\right\} $$

One can show that the eigenvector associated with the kth largest eigenvalue of the XX matrix maximizes the quantity to be maximized.

Therefore, the matrix W is the matrix whose columns are the eigenvectors of the XX matrix, sorted by descending order of their associated eigenvalues.

13.1.2 Truncated Decomposition

Since each principal component iteratively maximizes the remaining variance, the first principal components explain most of the total variance, while the last ones explain a tiny proportion of the total variance. Therefore, keeping only a subset of the ordered principal components usually gives a good representation of the input data.

Mathematically, given a number of dimensions l, the new representation is obtained by truncating the matrix of principal components W to only keep the first l columns, resulting in the submatrix W:,:l:

$$ \overset{\sim}{\boldsymbol{T}}=\boldsymbol{X}{\boldsymbol{W}}_{:,:l} $$

Figure 24 illustrates the use of principal component analysis as dimensionality reduction. The Iris flower dataset consists of 50 samples for each of 3 iris species (setosa, versicolor, and virginica) for which 4 features were measured, the length and the width of the sepals and petals, in centimeters. The projection of each sample on the first two principal components is shown in this figure.

Fig. 24
A scatterplot of dimension 2 of 5.31% versus dimension 1 of 92.46% displays scattered dots for the Setosa, Versicolor, and Virginica.

Illustration of principal component analysis as a dimensionality reduction technique. The Iris flower dataset consists of 50 samples for each of 3 iris species (setosa, versicolor, and virginica) for which 4 features were measured, the length and the width of the sepals and petals, in centimeters. The projection of each sample on the first two principal components is shown in this figure. The first dimension explains most of the variance (92.46%)

13.2 Linear Discriminant Analysis

In Subheading 10, we introduced linear discriminant analysis (LDA) as a classification method. However, it can also be used as a supervised dimensionality reduction method. LDA fits a multivariate normal distribution for each class \( {\mathcal{C}}_k \), so that each class is characterized by its mean vector \( {\boldsymbol{\mu}}_k\in {\mathbb{R}}^p \) and has the same covariance matrix \( \Sigma \in {\mathbb{R}}^{p\times p} \). However, a set of k points lies in a space of dimension at most k − 1. For instance, a set of 2 points lies on a line, while a set of 3 points lies on a plane. Therefore, the subspace induced by the k mean vectors μk can be used as dimensionality reduction.

There exists another formulation of linear discriminant analysis which is equivalent and more intuitive for dimensionality reduction. Linear discriminant analysis aims to find a linear projection so that the classes are separated as much as possible (i.e., projections of samples from a same class are close to each other, while projections of samples from different classes are far from each other).

Mathematically, the objective is to find the matrix \( \boldsymbol{W}\in {\mathbb{R}}^{p\times l} \) (with l ≤ k − 1) that maximizes the between-class scatter while also minimizing the within-class scatter:

$$ \underset{\boldsymbol{W}}{\max}\mathrm{tr}\left({\left({\boldsymbol{W}}^{\top }{\boldsymbol{S}}_w\boldsymbol{W}\right)}^{-1}\left({\boldsymbol{W}}^{\top }{\boldsymbol{S}}_b\boldsymbol{W}\right)\right) $$

The within-class scatter matrix Sw summarizes the diffusion between the mean vector μk of class \( {\mathcal{C}}_k \) and all the inputs x(i) belonging to class \( {\mathcal{C}}_k \), over all the classes:

$$ {\boldsymbol{S}}_w=\sum \limits_{k=1}^q\sum \limits_{y^{(i)}={\mathcal{C}}_k}\left[{\boldsymbol{x}}^{(i)}-{\boldsymbol{\mu}}_k\right]{\left[{\boldsymbol{x}}^{(i)}-{\boldsymbol{\mu}}_k\right]}^{\top } $$

The between-class scatter matrix Sb summarizes the diffusion between all the mean vectors:

$$ {\boldsymbol{S}}_b=\sum \limits_{k=1}^q{n}_k\left[{\boldsymbol{\mu}}_k-\boldsymbol{\mu} \right]{\left[{\boldsymbol{\mu}}_k-\boldsymbol{\mu} \right]}^{\top } $$

where nk is the proportion of samples belonging to class \( {\mathcal{C}}_k \) and \( \boldsymbol{\mu} ={\sum}_{k=1}^q{n}_k{\boldsymbol{\mu}}_k=\frac{1}{n}{\sum}_{i=1}^n{\boldsymbol{x}}^{(i)} \) is the mean vector over all the input vectors.

One can show that the W matrix consists of the first l eigenvectors of the matrix \( {\boldsymbol{S}}_w^{-1}{\boldsymbol{S}}_b \) with the corresponding eigenvalues being sorted in descending order. Just as in principal component analysis, the corresponding eigenvalues can be used to determine the contribution of each dimension. However, the criterion for linear discriminant analysis is different from the one from principal component analysis: it is to maximizing the separability of the classes instead of maximizing the explained variance.

Figure 25 illustrates the use of linear discriminant analysis as a dimensionality reduction technique. We use the same Iris flower dataset as in Fig. 24 illustrating principal component analysis. The projection of each sample on the learned two-dimensional space is shown, and one can see that the first (horizontal) axis is more discriminative of the three classes with linear discriminant analysis than with principal component analysis.

Fig. 25
A scatterplot of dimension 2 of 0.88% versus dimension 1 of 99.12% displays scattered dots for the Setosa, Versicolor, and Virginica.

Illustration of linear discriminant analysis as a dimensionality reduction technique. The Iris flower dataset consists of 50 samples for each of 3 iris species (setosa, versicolor, and virginica) for which 4 features were measured, the length and the width of the sepals and petals, in centimeters. The projection of each sample on the learned two-dimensional space is shown in this figure

14 Kernel Methods

Kernel methods allow for generalizing linear models to non-linear models with the use of kernel functions.

As mentioned in Subheading 8, the main idea of kernel methods is to first map the input data from the original input space to a feature space and then perform dot products in this feature space. Under certain assumptions, an optimal solution of the minimization problem of the cost function admits the following form:

$$ f=\sum \limits_{i=1}^n{\alpha}_iK\left(\cdot, {\boldsymbol{x}}^{(i)}\right) $$

where K is the kernel function which is equal to the dot product in the feature space:

$$ \forall \boldsymbol{x},{\boldsymbol{x}}^{\prime}\in \kern0.3em I,\kern0.3em K\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right)=\phi {\left(\boldsymbol{x}\right)}^{\top}\phi \left({\boldsymbol{x}}^{\prime}\right) $$

As this term frequently appears, we denote by K the n × n symmetric matrix consisting of the evaluations of the kernel on all the pairs of training samples:

$$ \forall i,j\in \left\{1,\dots, n\right\},\kern0.3em {K}_{ij}=K\left({\boldsymbol{x}}^{(i)},{\boldsymbol{x}}^{(j)}\right) $$

In this section, we present the extension of two models previously introduced in this chapter, ridge regression and principal component analysis, with kernel functions.

14.1 Kernel Ridge Regression

Kernel ridge regression combines ridge regression with the kernel trick and thus learns a linear function in the space induced by the respective kernel and the training data [2]. For non-linear kernels, this corresponds to a non-linear function in the original input space.

Mathematically, the objective is to find the function f with the following form:

$$ f=\sum \limits_{i=1}^n{\alpha}_iK\left(\cdot, {\boldsymbol{x}}^{(i)}\right) $$

that minimizes the sum of squared errors with a 2 penalization term:

$$ \underset{f}{\min}\sum \limits_{i=1}^n{\left({y}^{(i)}-f\Big({\boldsymbol{x}}^{(i)}\right)}^2+\lambda \parallel f{\parallel}^2 $$

The cost function can be simplified using the specific form of the possible functions:

$$ {\displaystyle \begin{array}{rlll}& \sum \limits_{i=1}^n\Big({y}^{(i)}-f{\left({\boldsymbol{x}}^{(i)}\right)}^2+\lambda \parallel f{\parallel}^2& & \\ {}& =\sum \limits_{i=1}^n{\left({y}^{(i)}-\sum \limits_{j=1}^n{\alpha}_jk\left({\boldsymbol{x}}^{(j)},{\boldsymbol{x}}^{(i)}\right)\right)}^2+\lambda {\parallel \sum \limits_{i=1}^n{\alpha}_iK\left(\cdot, {\boldsymbol{x}}^{(i)}\right)\parallel}^2& & \\ {}& =\sum \limits_{i=1}^n{\left({y}^{(i)}-{\boldsymbol{\alpha}}^{\top }{\boldsymbol{K}}_{:,i}\right)}^2+\lambda {\boldsymbol{\alpha}}^{\top}\boldsymbol{K}\boldsymbol{\alpha } & & \\ {}& =\parallel \boldsymbol{y}-\boldsymbol{K}\boldsymbol{\alpha } {\parallel}_2^2+\lambda {\boldsymbol{\alpha}}^{\top}\boldsymbol{K}\boldsymbol{\alpha } & & \\ {}\end{array}} $$

Therefore, the minimization problem is:

$$ \underset{\boldsymbol{\alpha}}{\min}\parallel \boldsymbol{y}-\boldsymbol{K}\boldsymbol{\alpha } {\parallel}_2^2+\lambda {\boldsymbol{\alpha}}^{\top}\boldsymbol{K}\boldsymbol{\alpha } $$

for which a solution is given by:

$$ {\boldsymbol{\alpha}}^{\star }={\left(\boldsymbol{K}+\lambda \boldsymbol{I}\right)}^{-1}\boldsymbol{y} $$

Figure 8 illustrates the prediction function of a kernel ridge regression method with a radial basis function kernel. The prediction function is non-linear as the kernel is non-linear.

14.2 Kernel Principal Component Analysis

As mentioned in Subheading 13, principal component analysis consists in finding the linear orthogonal subspace in the original input space such that each principal component explains the most variance. The optimal solution is given by the first eigenvectors of XX with the corresponding eigenvalues being sorted in descending order.

With kernel principal component analysis, the objective is to find the linear orthogonal subspace in the feature space such that each principal component in the feature space explains the most variance [26]. The solution is given by the first l eigenvectors (αk)1≤kl of the K matrix with the corresponding eigenvalues being sorted in descending order. The eigenvectors are normalized in order to be unit vectors in the feature space.

Finally, the projection of any input x in the original space on the kth component can be computed as:

$$ \phi {\left(\boldsymbol{x}\right)}^{\top }{\boldsymbol{\alpha}}_k=\sum \limits_{i=1}^n{\alpha}_{ki}K\left(\boldsymbol{x},{\boldsymbol{x}}^{(i)}\right) $$

Figure 26 illustrates the projection of some non-linearly separable classification data with principal component analysis and with kernel principal component analysis with a non-linear kernel. The projected input data becomes linearly separable using kernel principal component analysis, whereas the projected input data using (linear) principal component analysis remains non-linearly separable.

Fig. 26
3 scatterplots of training data, projection with principal component analysis, and projection with kernel principal component analysis. They plot 2 different color spots. The top and middle scatterplots have spots in 2 circles. The bottom scatterplot has 2 different color spots on either side.

Illustration of kernel principal component analysis. Some non-linearly separable training data is plotted (top). The projected training data using principal component analysis remains non-linearly separable (middle). The projected training data using kernel principal component analysis (with a non-linear kernel) becomes linearly separable (bottom)

15 Conclusion

In this chapter, we described the main classic machine learning methods. Due to space constraints, the description of some of them was brief. The reader who seeks more details can refer to [5, 6]. All these approaches are implemented in the scikit-learn Python library [27]. A common point of the approaches presented in this chapter is that they use as input a set of given or pre-extracted features. On the contrary, deep learning approaches often provide an end-to-end learning setup within which the features are learned. These techniques are covered in Chaps. 36.