Machine Learning for Patient Stratiﬁcation and Classiﬁcation Part 2: Unsupervised Learning with Clustering

Machine Learning for Phenotyping is composed of three chapters and aims to introduce clinicians to machine learning (ML). It provides a guideline through the basic concepts underlying machine learning and the tools needed to easily implement it using the Python programming language and Jupyter notebook documents. It is divided into three main parts: part 1—data preparation and analysis; part 2— unsupervised learning for clustering and part 3—supervised learning for classiﬁcation.

as an example, perform clustering of time series and use the information gained with clustering to train predictive models. The k-means clustering algorithm is described next.

Cluster assignment
Assign each x i to the nearest cluster. For every i do: where j = 1, 2, ..., K .

Cluster updating
Update the cluster centroids µ j . For every j do: where N j is the number of observations assigned to cluster j, k = 1, 2, ..., N j , and x j k represents observation k assigned to cluster j. Each new centroid corresponds to the mean of the observations assigned in the previous step.

Exemplification with 2D Data
Although pairwise plots did not reveal any interesting patterns, some clusters might have emerged after the data were transformed. You can re-run the code for pairwise plots between transformed features, but note that it will be time consuming due to the high dimensionality of the dataset. Features 'max mean BP' and 'mean heart rate' were chosen for illustrative purposes. The dataset is plotted below: In [26]: x1 = 'max mean BP' x2 = 'mean heart rate' sns.lmplot(x1, x2, data_transf_inv, hue="mortality", fit_reg=False); The number of clusters (K) must be provided before running k-means. It is not easy to guess the number of clusters just by looking at the previous plot, but for the purpose of understanding how the algorithm works 3 clusters are used. As usual, the 'random_state' parameter is predefined. Note that it does not matter which value is defined; the important thing is that this way we guarantee that when using the predefined value we will always get the same results.
The next example shows how to perform k-means using 'sklearn'. The attribute 'labels_' gives the labels that indicate to which cluster each observation belongs, and 'cluster_centers_' gives the coordinates of cluster centers representing the mean of all observations in the cluster. Using these two attributes it is possible to plot the cluster centers and the data in each cluster using different colors to distinguish the clusters: The algorithm is simple enough to be implemented using a few lines of code. If you want to see how the centers converge after a number of iterations, you can use the code below, which is an implementation of the k-means clustering algorithm step by step.
In [29]: # The following code was adapted from http://jonchar.net/notebooks/k-means/ import time from IPython import display K = 3 def initialize_clusters(points, k): """Initializes clusters as k randomly selected coordinates.""" return points[np.random.randint(points.shape[0], size=k)] def get_distances(centroid, points): """Returns the distance between centroids and observations.""" return np. Please refer to the online material in order to visualize the plot. It shows the position of the cluster centers at each iteration, until convergence to the final centroids. The trajectory of the centers depends on the cluster initialization; because the initialization is random, the centers might not always converge to the same position.

Time Series Clustering
Time series analysis revealed distinct and interesting patterns across survivors and non-survivors. Next, k-means clustering is used to investigate patterns in time series. The goal is to stratify patients according to their evolution in the ICU, from admission to t = 48 h, for every variable separately. Note that at this point we are back to working with time series information instead of constructed features.
For this particular task and type of algorithm, it is important to normalize data for each patient separately. This will allow a comparison between time trends rather than a comparison between the magnitude of observations. In particular, if the data is normalized individually for each patient, clustering will tend to group together patients that (for example) started with the lowest values and ended up with the highest values, whereas if the data is not normalized, the same patients might end up in different clusters because of the magnitude of the signal, even though the trend is similar.
Missing data is filled forward, i.e., missing values are replaced with the value preceding it (the last known value at any point in time). If there is no information preceding a missing value, these are replaced by the following values.
In [30]: # Now we are going to pivot the table in order to have rows corresponding to unique # ICU stays and columns corresponding to hour since admission. This will be used for clustering def clustering(variable, ids_clustering, K, *args): """Return data for clustering, labels attributted to training observations and if *args is provided return labels attributted to test observations"""

Visual Inspection of the Best Number of Clusters for Each Variable
In the next example, k-means clustering is performed for glasgow coma scale (GCS), for a varying number of clusters (K). Only the training data is used to identify the clusters. The figures show, by order of appearance: cluster centers, percentage of non-survivors in each cluster, and cluster centers and training data in each cluster. The goal of plotting cluster centers, mortality distribution and data in each cluster is to visually inspect the quality of the clusters. Another option would be to use quantitative methods, typically known as cluster validity indices, that automatically give the "best" number of clusters according to some criteria (e.g., cluster compacteness, cluster separation). Some interesting findings are: • K = 2 -shows two very distinct patterns, similar to what was found by partitioning by mortality; -but, we probably want more stratification.
• K = 3 -2 groups where GCS is improving with time; -1 group where GCS is deteriorating; -yes, this is reflected in terms of our ground truth labels, even though we did not provide that information to the clustering. Mortality> 60% in one cluster versus 30% and 28% in the other two clusters.
• K = 5 -Clusters 2 and 4 have similar patterns and similar mortality distribution. GCS is improving with time; -Clusters 3 and 5 have similar mortality distribution. GCS is slowly increasing or decreasing with time; -Cluster 1 is the "worst" cluster. Mortality is close to 70%.
In summary, every K from 2 to 5 gives an interesting view of the evolution of GCS and its relation with mortality. For the sake of simplicity, this analysis is only shown for GCS. You can investigate on your own the cluster tendency for other variables and decide what is a good number of clusters for all of them. For now, the following K is used for each variable:

Training and Testing
In this work, cluster labels are used to add another layer of information to the machine learning models. During the training phase, clustering is performed on the time series from the training set. Cluster centers are created and training observations are assigned to each cluster. During the test phase, test observations are assigned to one of the clusters defined in the training phase. Each observation is assigned to the most similar cluster, i.e., to the cluster whose center is at a smaller distance. These observations are not used to identify clusters centers.
The next example trains/creates and tests/assigns clusters using the 'clustering' function previously defined. Cluster labels are stored in 'cluster_labels_train' and 'cluster_labels_test'. Clustering allowed us to stratify patients according to their physiological evolution during the first 48 h in the ICU. Since cluster centers reflect the cluster tendency, it is possible to investigate the relationship between distinct physiological patterns and mortality and ascertain to if the relationship is expected. For example, cluster 4 and cluster 5 in glucose are more or less symmetric: in cluster 4, patients start with low glucose, which increases over time until it decreases again; in cluster 5, patients start with high glucose, which decreases over time until it increases again. In the first case, mortality is approximately 45% and in the second case it is approximately 30%. Although this is obviously not enough to predict mortality, it highlights a possible relationship between the evolution of glucose and mortality. If a certain patient has a pattern of glucose similar to cluster 4, there may be more reason for concern than if they express the pattern in cluster 5.
By now, some particularities of the type of normalization performed can be noted: • It hinders interpretability; • It allows the algorithm to group together patients that did not present significant changes in their physiological state through time, regardless of the absolute value of the observations.
We have seen how clustering can be used to stratify patients, but not how it can be used to predict outcomes. Predictive models that use the information provided by clustering are investigated next. Models are created for the extracted features together with cluster information. This idea is represented in Fig. 10.1.

Normalization
Normalization, or scaling, is used to ensure that all features lie between a given minimum and maximum value, often between zero and one. The maximum and minimum values of each feature should be determined during the training phase and the same values should be applied during the test phase.
The next example is used to normalize the features extracted from the time series. Normalization is useful when solving for example least squares or functions involving the calculation of distances. Contrary to what was done in clustering, the data is normalized for all patients together and not for each patient individually, i.e., the maximum and minimum values used for scaling are those found in the entire training set.

Concatenate Predicted Clustering Labels with Extracted Features
In the next example, the 'get_dummies' function from 'pandas' is used to get dummy variables for the cluster labels obtained through k-means. The idea is to use binary cluster labels, i.e., features indicating "yes/no belongs to cluster k", as input to the models. This will provide an extra level of information regarding the clinical temporal evolution of the patient in a multidimensional space. You can add a 'drop_first' parameter to the 'get_dummies' function to indicate if you want to exclude one category, i.e., whether to get k-1 dummies out of k categorical levels by removing the first level. Because we will perform feature selection, this option does not need to be selected.  The dataset is now composed of a mixture of summary statistics obtained through simple operations and clustering. Cluster labels are categorized as 'CL'. For example, 'CL 0.0' corresponds to cluster 1, 'CL 1.0' to cluster 2 and so on.
In the following "Part 3-Supervised Learning", classification models will be created in order to predict mortality.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.