## 1 Clustering

Clustering is a learning task that aims to decompose a given set of observations into subgroups (clusters) based on data similarity, such that observations in the same cluster are more closely related to each other than observations in different clusters. It is an unsupervised learning task, since it identifies structures in unlabeled datasets, and a classification task, since it can give a label to observations according to the cluster they are assigned to. For a more detailed description of supervised and unsupervised learning please refer to the previous chapter.

This work focuses on the following questions:

• Can we identify distinct patterns even if the class labels are not provided?

• How are the different patterns represented across different outcomes?

In order to address these questions, we will start by providing a description of the basic concepts underlying k-means clustering, which is the most well known and simple clustering algorithm. We will show how the algorithm works using 2D data as an example, perform clustering of time series and use the information gained with clustering to train predictive models. The k-means clustering algorithm is described next.

### 1.1 K-means Clustering Algorithm

Consider a (training) dataset composed of $$N$$ observations:

\begin{aligned} x_1, x_2,..., x_N \end{aligned}

Initialize K centroids $$\mu _1, \mu _2, ..., \mu _K$$ randomly.

Repeat until convergence:

1. 1.

Cluster assignment

Assign each $$x_i$$ to the nearest cluster. For every $$i$$ do:

\begin{aligned} \underset{j}{argmin}\left\| x_i-\mu _j \right\| ^2, \end{aligned}

where $$j=1,2,...,K$$.

2. 2.

Cluster updating

Update the cluster centroids $$\mu _j$$. For every $$j$$ do:

\begin{aligned} \mu _j = \frac{1}{N_j}\left[ x_1^j + x_2^j + ... + x_{N_j}^j \right] , \end{aligned}

where $$N_j$$ is the number of observations assigned to cluster $$j$$, $$k=1,2, ..., N_j$$, and $$x_k^j$$ represents observation $$k$$ assigned to cluster $$j$$. Each new centroid corresponds to the mean of the observations assigned in the previous step.

### 1.2 Exemplification with 2D Data

Although pairwise plots did not reveal any interesting patterns, some clusters might have emerged after the data were transformed. You can re-run the code for pairwise plots between transformed features, but note that it will be time consuming due to the high dimensionality of the dataset. Features ‘max mean BP’ and ‘mean heart rate’ were chosen for illustrative purposes. The dataset is plotted below:

The number of clusters (K) must be provided before running k-means. It is not easy to guess the number of clusters just by looking at the previous plot, but for the purpose of understanding how the algorithm works 3 clusters are used. As usual, the ‘random_state’ parameter is predefined. Note that it does not matter which value is defined; the important thing is that this way we guarantee that when using the predefined value we will always get the same results.

The next example shows how to perform k-means using ‘sklearn’.

The attribute ‘labels_’ gives the labels that indicate to which cluster each observation belongs, and ‘cluster_centers_’ gives the coordinates of cluster centers representing the mean of all observations in the cluster. Using these two attributes it is possible to plot the cluster centers and the data in each cluster using different colors to distinguish the clusters:

The algorithm is simple enough to be implemented using a few lines of code. If you want to see how the centers converge after a number of iterations, you can use the code below, which is an implementation of the k-means clustering algorithm step by step.

Please refer to the online material in order to visualize the plot. It shows the position of the cluster centers at each iteration, until convergence to the final centroids. The trajectory of the centers depends on the cluster initialization; because the initialization is random, the centers might not always converge to the same position.

### 1.3 Time Series Clustering

Time series analysis revealed distinct and interesting patterns across survivors and non-survivors. Next, k-means clustering is used to investigate patterns in time series. The goal is to stratify patients according to their evolution in the ICU, from admission to t = 48 h, for every variable separately. Note that at this point we are back to working with time series information instead of constructed features.

For this particular task and type of algorithm, it is important to normalize data for each patient separately. This will allow a comparison between time trends rather than a comparison between the magnitude of observations. In particular, if the data is normalized individually for each patient, clustering will tend to group together patients that (for example) started with the lowest values and ended up with the highest values, whereas if the data is not normalized, the same patients might end up in different clusters because of the magnitude of the signal, even though the trend is similar.

Missing data is filled forward, i.e., missing values are replaced with the value preceding it (the last known value at any point in time). If there is no information preceding a missing value, these are replaced by the following values.

#### 1.3.1 Visual Inspection of the Best Number of Clusters for Each Variable

In the next example, k-means clustering is performed for glasgow coma scale (GCS), for a varying number of clusters (K). Only the training data is used to identify the clusters. The figures show, by order of appearance: cluster centers, percentage of non-survivors in each cluster, and cluster centers and training data in each cluster.

The goal of plotting cluster centers, mortality distribution and data in each cluster is to visually inspect the quality of the clusters. Another option would be to use quantitative methods, typically known as cluster validity indices, that automatically give the “best” number of clusters according to some criteria (e.g., cluster compacteness, cluster separation). Some interesting findings are:

• $$K = 2$$

• shows two very distinct patterns, similar to what was found by partitioning by mortality;

• but, we probably want more stratification.

• $$K = 3$$

• 2 groups where GCS is improving with time;

• 1 group where GCS is deteriorating;

• yes, this is reflected in terms of our ground truth labels, even though we did not provide that information to the clustering. Mortality$$>60\%$$ in one cluster versus $$30\%$$ and $$28\%$$ in the other two clusters.

• $$K = 4$$

• one more “bad” cluster appears.

• $$K = 5$$

• Clusters 2 and 4 have similar patterns and similar mortality distribution. GCS is improving with time;

• Clusters 3 and 5 have similar mortality distribution. GCS is slowly increasing or decreasing with time;

• Cluster 1 is the “worst” cluster. Mortality is close to $$70\%$$.

In summary, every $$K$$ from 2 to 5 gives an interesting view of the evolution of GCS and its relation with mortality. For the sake of simplicity, this analysis is only shown for GCS. You can investigate on your own the cluster tendency for other variables and decide what is a good number of clusters for all of them. For now, the following K is used for each variable:

#### 1.3.2 Training and Testing

In this work, cluster labels are used to add another layer of information to the machine learning models. During the training phase, clustering is performed on the time series from the training set. Cluster centers are created and training observations are assigned to each cluster. During the test phase, test observations are assigned to one of the clusters defined in the training phase. Each observation is assigned to the most similar cluster, i.e., to the cluster whose center is at a smaller distance. These observations are not used to identify clusters centers.

The next example trains/creates and tests/assigns clusters using the ‘clustering’ function previously defined. Cluster labels are stored in ‘cluster_labels_train’ and ‘cluster_labels_test’.

Clustering allowed us to stratify patients according to their physiological evolution during the first 48 h in the ICU. Since cluster centers reflect the cluster tendency, it is possible to investigate the relationship between distinct physiological patterns and mortality and ascertain to if the relationship is expected. For example, cluster 4 and cluster 5 in glucose are more or less symmetric: in cluster 4, patients start with low glucose, which increases over time until it decreases again; in cluster 5, patients start with high glucose, which decreases over time until it increases again. In the first case, mortality is approximately $$45\%$$ and in the second case it is approximately $$30\%$$. Although this is obviously not enough to predict mortality, it highlights a possible relationship between the evolution of glucose and mortality. If a certain patient has a pattern of glucose similar to cluster 4, there may be more reason for concern than if they express the pattern in cluster 5.

By now, some particularities of the type of normalization performed can be noted:

• It hinders interpretability;

• It allows the algorithm to group together patients that did not present significant changes in their physiological state through time, regardless of the absolute value of the observations.

We have seen how clustering can be used to stratify patients, but not how it can be used to predict outcomes. Predictive models that use the information provided by clustering are investigated next. Models are created for the extracted features together with cluster information. This idea is represented in Fig. 10.1.

### 1.4 Normalization

Normalization, or scaling, is used to ensure that all features lie between a given minimum and maximum value, often between zero and one. The maximum and minimum values of each feature should be determined during the training phase and the same values should be applied during the test phase.

The next example is used to normalize the features extracted from the time series.

Normalization is useful when solving for example least squares or functions involving the calculation of distances. Contrary to what was done in clustering, the data is normalized for all patients together and not for each patient individually, i.e., the maximum and minimum values used for scaling are those found in the entire training set.

The next example uses the ‘preprocessing’ package from ‘sklearn’, which performs exactly the same:

### 1.5 Concatenate Predicted Clustering Labels with Extracted Features

In the next example, the ‘get_dummies’ function from ‘pandas’ is used to get dummy variables for the cluster labels obtained through k-means. The idea is to use binary cluster labels, i.e., features indicating “yes/no belongs to cluster k”, as input to the models. This will provide an extra level of information regarding the clinical temporal evolution of the patient in a multidimensional space.

You can add a ‘drop_first’ parameter to the ‘get_dummies’ function to indicate if you want to exclude one category, i.e., whether to get k–1 dummies out of k categorical levels by removing the first level. Because we will perform feature selection, this option does not need to be selected.

The next example prints the number of observations in the training and test sets, total number of features and a snapshot of the data.

The dataset is now composed of a mixture of summary statistics obtained through simple operations and clustering. Cluster labels are categorized as ‘CL’. For example, ‘CL 0.0’ corresponds to cluster 1, ‘CL 1.0’ to cluster 2 and so on.

In the following “Part 3—Supervised Learning”, classification models will be created in order to predict mortality.