Abstract
Machine Learning for Phenotyping is composed of three chapters and aims to introduce clinicians to machine learning (ML). It provides a guideline through the basic concepts underlying machine learning and the tools needed to easily implement it using the Python programming language and Jupyter notebook documents. It is divided into three main parts: part 1—data preparation and analysis; part 2—unsupervised learning for clustering, and part 3—supervised learning for classification.
Keywords
- Machine learning
- Phenotyping
- Data preparation
- Data analysis
- Unsupervised learning
- Clustering
- Supervised learning
- Classification
- Clinical informatics
Download chapter PDF
1 Learning Objectives and Approach
It is recommended that you follow this chapter using the jupyter notebook provided in https://github.com/cmsalgado/book_chapter, so that you can experiment with the code on your own. Since we are not implementing most of the algorithms and models, you will be able to follow the notebook even if you have no prior experience with coding. We will cover a large number of well-known ML algorithms, starting with the simplest (logistic regression and k-nearest neighbors) and ending with more complex ones (random forest), without delving in much detail into the underlying mathematical theory. Instead, we will focus on explaining the intuitions behind algorithms so that you understand how they learn and how you can use and interpret them for any new problem.
A real-world clinical dataset will be used. It was constructed based on the code provided in https://github.com/YerevaNN/mimic3-benchmarks for the prediction of hospital mortality using data collected during the first two days in the ICU. The data was extracted from the MIMIC-III Clinical Database, which is a large, publicly-available database comprising de-identified electronic health records of approximately 60 000 ICU admissions. Patients stayed in the intensive care unit (ICU) of the Beth Israel Deaconess Medical Center between 2001 and 2012. MIMIC-III database is described in:
Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Scientific Data (2016).
Before dwelling into ML, this chapter presents the data preparation phase, which consists of a series of steps required to transform raw data from electronic health records into structured data that can be used as input to the learning algorithms. This is a crucial part of any ML project. First, it is important to understand what the algorithm is supposed to learn so that you can select appropriate information/variables to describe the problem. In other words, you want to avoid what is frequently referred to as “garbage in, garbage out”. Second, since it is usually not handed in a format that is ready for straightaway ML, chances are that you will need to spend some time preprocessing and analyzing the data. You will see that in the end, your results are highly dependent on decisions made during this part. It does not matter to keep trying to optimize the ML algorithm if you have poor data, your algorithm will not be able to learn. In this scenario, you probably gain more by going back to the stages of data extraction and preprocessing and rethink your decisions; it is better to have good data and a simple algorithm than poor data and a very complex algorithm. In particular, the data preparation phase consists of:
-
Exploratory data analysis
-
Variable selection
-
Identification and exclusion of outliers
-
Data aggregation
-
Inclusion criteria
-
Feature construction
-
Data partitioning.
The machine learning phase consists of:
-
Patient stratification through unsupervised learning
-
k-means clustering
-
-
Patient classification through supervised learning
-
Feature selection
-
Logistic regression
-
Decision trees
-
Random forest.
-
2 Prerequisites
In order to run the code provided in this Chapter, you should have Python and the following Python libraries installed:
-
NumPy: fundamental package for scientific computing with Python. Provides a fast numerical array structure.
-
Pandas: provides high-performance, easy-to-use data structures and data analysis tools.
-
Scikit-learn: essential machine learning package in Python. Provides simple and efficient tools for data mining and data analysis.
-
matplotlib: basic plotting library.
-
seaborn: visualization library based on matplotlib. It provides a high-level interface for drawing statistical graphics.
-
IPython: for interactive data visualization.
-
Jupyter: for interactive computing.
It is highly recommended that you install Anaconda, which already has the above packages and more included. Additionally, in order to be able to visualize decision trees, you should install pydot and graphviz. Next, we will import the python libraries that we will need for now and MIMIC-III data.
2.1 Import Libraries and Data
The next example shows the Python code that imports Numpy, Pandas, Matplotlib and IPython libraries:

The next example shows the Python code that loads the data as a Pandas DataFrame, which is a tabular data structure that makes it easy to do all kinds of manipulations such as arithmetic operations and grouping. The unique ICU stay ID is used as the DataFrame index, for facilitating data manipulation. This is achived by setting the ‘index_col’ parameter to the name of the index column.

3 Exploratory Data Analysis
A quick preview of ‘data’ can be obtained using the ‘head’ function, which prints the first 5 rows of any given DataFrame:


It tells us that the dataset contains information regarding patient demographics: age, gender, weight, height, mortality; physiological vital signs: diastolic blood pressure, systolic blood pressure, mean blood pressure, temperature, respiratory rate; lab tests: glucose, pH; scores: glasgow coma scale. Each observation/row is associated with a time stamp (column ‘hours’), indicating the number of hours since ICU admission where the observation was made. Each icustay has several observations for the same variable/column.
We can print the number of ICU stays by calculating the length of the unique indexes, number of missing data using the ‘info’ function and summary statistics using the ‘describe’ function:



The dataset consists of 21,139 unique ICU stays and 1,461,282 observations. All columns with the exception of ‘hours’, ‘mortality’ and ‘day’ have missing information. Looking at the maximum and minimum values it is possible to spot the presence of outliers (e.g. max glucose). Both missing data and outliers are very common in ICU databases and need to be taken into consideration before applying ML algorithms.
4 Variable Selection
In general, you should take into consideration the following criteria when deciding whether to include or exclude variables:
-
adding more variables tends to decrease the sample size, because fewer patients are likely to have all of them collected at the same time;
-
selecting a high number of variables might bias the dataset towards the selection of a specific group of patients whose characteristics required the measurement of those specific variables;
-
variables should be independent with minimal correlation;
-
the number of observations should be significantly higher than the number of variables, in order to avoid the curse of dimensionality.
Rejecting variables with an excessive number of missing values is usually a good rule of thumb. However, it might also lead to the reduction of predictive power and ability to detect significant differences. For these reasons, there should be a trade-off between the potential value of the variable in the model and the amount of data available. We already saw the amount of missing data for every column, but we still do not know how much information is missing at the patient level. In order to do so, we are going to aggregate data by ICU stay and look at the number of non-null values, using the ‘groupby’ function together with the ‘mean’ operator. This will give an indication of how many ICU stays have at least one observation for each variable.
Note that one patient might have multiple ICU stays. In this work, for the sake of simplicity, we will consider every ICU stay as an independent sample.


Based on the previous information, some decisions are made:
-
Height can be discarded due to the high amount of missing data;
-
Weight and height are typically used in combination (body mass index), since individually they typically provide low predictive power. Therefore, weight was discarded;
-
The other variables will be kept. Let us start analyzing time-variant variables and set aside age and gender for now:

These decisions are specific for this case. Other criteria could have been used, for example, checking the correlation between all variables and excluding one variable out of every pair with high correlation in order to reduce redundancy.
5 Data Preprocessing
5.1 Outliers
5.1.1 Data Visualization
We already saw that outliers are present in the dataset, but we need to take a closer look at data before deciding how to handle them. Using the ‘seaborn’ library and the ‘boxplot’ function we can easily create one boxplot for every variable. Seaborn is a visualization library based on matplotlib that provides a high-level interface for drawing statistical graphics.


In some cases, the outliers are so deviant from the norm that it is not even possible to visualize the distribution of data (minimum, first quartile, median, third quartile, maximum) using boxplots. There are other plot types you can create to investigate the presence of outliers. Simply plotting all points using a scatter plot, or using violin plots are some of the options.
5.1.2 Exclusion
Ideally, we should keep extreme values related to the patients’ poor health condition and exclude impossible values (such as negative temperature) and probable outliers (such as heart rate above 250 beats/min). In order to do so, values that fall outside boundaries defined by expert knowledge are excluded. This way, we avoid excluding extreme (but correct/possible) values.


5.1.3 Data Visualization After Outliers Exclusion
The same code can be used to verify the data distribution after exclusion of outliers. The ‘stripplot’ function allows to visualize the underlying distribution and the number of observations. Setting x = ‘mortality’ shows the boxplots partitioned by outcome.


5.2 Aggregate Data by Hour
As mentioned before, the dataset contains information regarding the first 2 days in the ICU. Every observation is associated with a time stamp, indicating the number of hours elapsed between ICU admission and the time when the observation was collected (e.g., 0.63 h). To allow for ease of comparison, individual data is condensed into hourly observations by selecting the median value of the available observations within each hour. First, the ‘floor’ operator is applied in order to categorize the hours in 48 bins (hour 0, hour 1, ..., hour 47). Then, the ‘groupby’ function with the ‘median’ operator is applied in order to get the median heart rate for each hour of each ICU stay:

The ‘groupby’ will create as many indexes as groups defined. In order to facilitate the next operations, a single index is desirable. In the next example, the second index (column ‘hour’) is excluded and kept as a DataFrame column. Note that the first index corresponds to level 0 and second index to level 1. Therefore, in order to exclude the second index and keep it as a column, ‘level’ should be set to 1 and ‘drop’ to False.

The next example shows the vital signs for a specific ICU stay (ID = 200001). Consecutive hourly observations are connected by line.


5.3 Select Minimum Number of Observations
We decided to keep all time-variant variables available. However, and as you can see in the previous example, since not all variables have a hourly sampling rate, a lot of information is missing (coded as NaN). In order to train ML algorithms it is important to decide how to handle the missing information. Two options are: to replace the missing information with some value or to exclude the missing information. In this work, we will avoid introducing bias resultant from replacing missing values with estimated values (which is not the same as saying that this is not a good option in some situations). Instead, we will focus on a complete case analysis, i.e., we will include in our analysis only those patients who have complete information.
Depending on how we will create the feature set, complete information can have different meanings. For example, if we want to use one observation for every hour, complete information is to have no missing data for every t = 0 to t = 47, which would lead to the exclusion of the majority of data. In order to reduce the size of the feature space, one common approach is to use only some portions of the time series. This is the strategy that will be followed in this work. Summary statistics, including the mean, maximum, minimum and standard deviation will be used to extract relevant information from the time series. In this case, it is important to define the minimum length of the time series before starting to select portions of it. One possible approach is to use all patients who have at least one observation per variable. Since, the summary statistics have little meaning if only one observation is available, a threshold of two observations is used.
In the following function, setting ‘min_num_meas = 2’ means that we are selecting ICU stays where each variable was recorded at least once at two different hours. Again, we are using the ‘groupby’ function to aggregate data by ICU stay, and the ‘count’ operator to count the number of observations for each variable. We then excluded ICU stays where some variable was recorded less than 2 times. Section 9.7 will show how to extract features from the time series.


It is always important to keep track of the size of data while making decisions about inclusion/exclusion criteria. We started with a database of around 60,000 ICU stays, imported a fraction of those that satisfied some criteria, in a total of 21,140 ICU stays, and are now looking at 6,931 ICU stays.
6 Data Analysis
6.1 Pairwise Ploting
One of the most used techniques in exploratory data analysis is the pairs plot (also called scatterplot). This technique allows to see both the distribution of single variables as well as relationships between every two variables. It is easily implemented in Python using the ‘seaborn’ library. The next example shows how to plot the pairwise relationships between variables and the histograms of single variables partitioned by outcome (survival vs non-survival). The argument ‘vars’ is used to indicate the set of variables to plot and ‘hue’ to indicate the use of different markers for each level of the ‘hue’ variable. A subset of data is used: ‘dropna(axis = 0, how = ‘any’)’ excludes all rows containing missing information.


Unfortunately, this type of plot only allows to see relationships in a 2D space, which in most cases is not enough to find any patterns or trends. Nonetheless, it is still able to tell us important aspects of data; if not for showing promising directions for data analysis, to provide a means to check data’s integrity. Things to highlight are:
-
Hypoxic patients generally have lower SBPs;
-
SBP correlates with MAP, which is a nice test of the data’s integrity;
-
Fever correlates with increasing tachycardia, also as expected.
6.2 Time Series Ploting
In order to investigate time trends, it is useful to visualize the mean HR partitioned by outcome from t = 0 to t = 47. In order to easily perform this task, the DataFrame needs to be restructured.
The next function takes as input a pandas DataFrame and the name of the variable and transposes/pivots the DataFrame in order to have columns corresponding to time (t = 0 to t = 47) and rows corresponding to ICU stays. If ‘filldata’ is set to 1, the function will fill missing information using the forward fill method, where NaNs are replaced by the value preceding it. In case no value is available for forward fill, NaNs are replaced by the next value in the time series. The function ‘fillna’ with the method argument set to ‘ffill’ and ‘bfill’, allows us to easily perform these two actions. If ‘filldata’ is set to 0 no missing data imputation is performed.

The next script plots the average HR for every variable in the dataset. At this point, ‘filldata’ is set to 0. In the section named Time Series Clustering of Part II of this tutorial, ‘filldata’ will be set to 1 in order to perform clustering of time series.



The physiological deterioration or improvement over time is very different between survivors and non-survivors. While using the pairwise plot we could not see any differences between the groups, this type of plot reveals very clear differences. Several observations can be made:
-
Diastolic BP
-
higher in the survival group;
-
rapidly decreasing during the first 10 h, especially in the survival group, and increasing at a lower rate thereafter;
-
-
Glasgow coma scale
-
higher in the survival group, increasing over time;
-
steady around 8 in the non-survival group;
-
similar between both groups at admission, but diverging thereafter;
-
-
Glucose
-
decreasing over time in both groups;
-
-
Heart rate
-
lower in the survival group;
-
-
Mean BP - similar to diastolic BP;
-
Oxygen saturation
-
higher in the survival group;
-
low variation from t = 0 to t = 48 h;
-
-
Respiratory rate
-
lower in the survival group, slowly increasing over time;
-
steady around 20 in the non-survival group;
-
-
Systolic BP—similar to diastolic and mean BP;
-
Temperature
-
low variation from t = 0 to t = 48 h;
-
slightly increasing during the first 10 h;
-
-
pH
-
Increasing over time in both groups;
-
pH< 7.35 (associated with metabolic acidosis) during the first 10 h in the non-survival group.
-
Most of these graphs have fairly interesting trends, but we would not consider the oxygen saturation or temperature graphs to be clinically relevant.
7 Feature Construction
The next step before ML is to extract relevant features from the time series. As already mentioned, the complete time series could be used for ML, however, missing information would have to be filled or excluded. Also, using the complete time series would result in \(48 \text {h}\times 11 \text {variables}=528\) features, which would make the models difficult to interpret and could lead to overfitting. There is a simpler solution, which is to use only a portion of the information available, ideally the most relevant information for the prediction task.
Feature construction addresses the problem of finding the transformation of variables containing the greatest amount of useful information. In this chapter, simple operations will be used to construct/extract important features from the time series:
-
Maximum
-
Minimum
-
Standard deviation
-
Mean
Other summary statistics or time-series snapshots could have been used, for example, the median, time elapsed between maximum and minimum, time elapsed between baseline and maximum, difference between baseline and minimum, and others. Alternatively, other techniques can be used for dimensionality reduction, such as principal component analysis (PCA) and autoencoders described in the previous Chapter.
The maximum, minimum, standard deviation and mean summarize the worst, best, variation and average patient’ condition from t = 0 to t = 47h. In the proposed exercises you will do this for each day separately, which will increase the dataset dimensionality but hopefully will allow the extraction of more useful information. Using the ‘groupby’ function to aggregate data by ICU stay, together with the ‘max’, ‘min’, ‘std’ and ‘mean’ operators, these features can be easily extracted:




A DataFrame containing one row per ICU stay was obtained, where each column corresponds to one feature. We are one step closer to building the models. Next, we are going to add the time invariant information—age and gender—to the dataset.


8 Data Partitioning
In order to assess the performance of the models, data can be divided into training, test and validation sets as exemplified in Fig. 9.1. This is known as the holdout validation method. The training set is used to train/build the learning algorithm; the validation (or development) set is used to tune parameters, select features, and make other decisions regarding the learning algorithm and the test set is used to evaluate the performance of the algorithm, but not to make any decisions regarding the learning algorithm architecture or parameters.
For simplicity, data is divided into two sets, one for training and another for testing. Later, when performing feature selection, the training set will be divided into two sets, for training and validation.
Scikit-learn is the essential machine learning package in Python. It provides simple and efficient tools for data mining and data analysis. The next example shows how to use the ‘train_test_split’ function from ‘sklearn’ library to randomly assign observations to each set. The size of the sets can be controlled using the ‘test_size’ parameter, which defines the size of the test set and which in this case is set to 20%. When using the ‘train_test_split’ function, it is important to set the ‘random_state’ parameter so that later the same results can be reproduced.

It is useful to create a function that prints the size of data in each set:

In cases where the data is highly imbalanced, it might be a good option to force an oversampling of the minority class, or an undersampling of the majority class so that the model is not biased towards the majority class. This should be performed on the training set, whereas the test set should maintain the class imbalance found on the original data, so that when evaluating the final model a true representation of data is used.
For the purpose of facilitating clustering interpretability, undersampling is used. However, as a general rule of thumb, and unless the dataset contains a huge number of observations, oversampling is preferred over undersampling because it allows keeping all the information in the training set. In any case, selecting learning algorithms that account for class imbalance might be a better choice.
The next example shows how to undersample the majority class, given a desired size of the minority class, controlled by the parameter ‘perc_class1’. If ‘perc_class1’>0, undersampling is performed in order to have a balanced training set. If ‘perc_class1’ = 0, no balancing is performed.


In the following “Part 2 - Unsupervised Learning with Clustering”, clustering will be used to identify patterns in the dataset.
Acknowledgements
This work was supported by the Portuguese Foundation for Science and Technology, through IDMEC, under LAETA, project UID/EMS/50022/2019 and LISBOA-01-0145-FEDER-031474 supported by Programa Operacional Regional de Lisboa by FEDER and FCT.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2020 The Author(s)
About this chapter
Cite this chapter
Salgado, C.M., Vieira, S.M. (2020). Machine Learning for Patient Stratification and Classification Part 1: Data Preparation and Analysis. In: Celi, L., Majumder, M., Ordóñez, P., Osorio, J., Paik, K., Somai, M. (eds) Leveraging Data Science for Global Health. Springer, Cham. https://doi.org/10.1007/978-3-030-47994-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-47994-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47993-0
Online ISBN: 978-3-030-47994-7
eBook Packages: Computer ScienceComputer Science (R0)