Diving Deeper into Models

Pre-requisites to better understand the chapter: knowledge of the major steps and procedures of developing a clinical prediction model.Logical position of the chapter with respect to the previous chapter: in the last chapters, you have learned how to develop and validate a clinical prediction model. You have been learning logistic regression as main algorithm to build the model. However, several different more complex algorithms can be used to build a clinical prediction model. In this chapter, the main machine learning based algorithms will be presented to you.Learning objectives: you will be presented with the definitions of: machine learning, supervised and unsupervised learning. The major algorithms for the last two categories will be introduced.


What Is Machine Learning?
Machine learning is an application of Artificial Intelligence (AI). AI refers to the ability to program computers (or more in general machines) that are able to solve complicated and usually very time-consuming tasks [1]. An example of a time consuming and complicated task is the extraction of useful information (data mining) from a large amount of unstructured clinical data ('big data').

How Do We Use Machine Learning in Clinical Prediction
Modelling?
As you learned in previous chapters, after having prepared your data, you develop a clinical prediction model based on available data. In the model, particular properties of your data ('features') will be used to predict your outcome of interest. A particular statistical algorithm is used to learn the 'features' that are most representative and relate them to the predicted outcome. In the previous chapter, only the logistic regression algorithm has been presented to you. However, more complex machine learning-based algorithms exist. These algorithms can be divided into two categories: supervised and unsupervised [2].

Supervised Algorithms
These algorithms apply when learning from 'labelled' data to predict future outcomes [3]. To understand what we mean by labelled data, let us considering the following example. Suppose we are building a model that takes as input some clinical data from the patients (e.g. age, sex, tumor staging) and aims at predicting if a patient will be alive or not (binary outcome) 1 year after receiving the treatment therapy. In our training dataset, the clinical outcome (alive or dead after a certain elapse of time) information is available. These are labelled data. In supervised learning, the analysis starts from a known training dataset and the algorithm is used to infer the predictions. The algorithm compares its output to the 'labels' in order to modify it accordingly to match the expected values.

Unsupervised Algorithms
Unsupervised algorithms are used when the training data is not classified or labelled [4]. A common example of unsupervised learning is trying to cluster a population and see if the clusters share common properties ('features'). This common approach is used in marketing analysis, to see for example if different products might be assigned to different clusters. In summary, the goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. Unsupervised problems can then still be divided into two subgroups: 1. Clustering: the goal is to discover groups that share common features; 2. Association: used to describe rules that explain large portions of data. For example, still from the marketing analysis: people in a certain cluster buy a product X and more likely will also buy a certain product Y.

Semi-supervised Algorithms
We refer to semi-supervised algorithms when the number of input data is much greater than the number of output labelled data in the training set [5]. A good example could be a large data set of images, where only few of them are labelled (e.g. dog, cat). Despite this kind of learning problem is not often mentioned, most of the real life machine learning classification problems fall into this category. In fact, labelling data is a very time-consuming task. Imagine in fact a doctor that has to annotate (i.e. delineate anatomical or pathological structures) on hundreds of patients' scans, which you would like to use as your training dataset.
We will now provide an overview of the main algorithms for each presented category.

Support Vector Machines (SVMs)
SVMs can used for both classification and regression, despite being mostly used in classification problems [6]. The SVM are based on an n-dimensional space where n is the number of features you would like to use as an input. Imagine plotting all your data in the hyperspace, where each point correspond to the n-dimensional feature vector of your input data. Therefore, for example, if you have 100 input data and 10 features, 100 vectors of dimension 10 will constitute the hyperspace. The SVM will try to find the hyperplane / hyperplanes that separate your data into two (or more) classes.
How to find the best hyperplane? If we look at Fig. 9.1a, three possible hyperplanes separate the classes. The key point to be considered is: "Choose the hyperplane that maximizes the separation between classes". Now, in Fig. 9.1a it is easy to affirm that the correct answer is line B. However, what should we choose if we look at Fig. 9.1b?
The definition of margin will help us. In SVMs the margin is defined as the distance between the nearest data point or class (called the "support vector") and hyper-plane. With this definition in mind, it becomes clear that the best solution in Fig. 9.1b is line C. However, we only have considered problems where the classes were easily separable by linear hyperplanes. What happens if the problem is more complicated like shown in Fig. 9.1c? It is clear that we cannot have a linear hyperplane to separate the classes, but visually it seems that a hyper-circle might work. This relates to the concept of kernel. A SVMs kernel function gives the possibility to transform non-linear spaces ( Fig. 9.1c) into linear spaces (Figs. 9.1a and b). Most common available SVMs computational packages [7] [8] offer different kernels from the most famous radial basis function-based kernel to higher order polynomials or sigmoid functions.

What are the most important parameters in a SVM?
-Kernel: the kernel is one of the most important parameters to be chosen. SVMs offer easier and more complicated kernels. Our suggestion to choose the kernel, is to plot the data projected on some features axis in order to have a visual ideal if the problem can be solved by choosing a linear kernel. In general, we discourage to start using more complicated kernels from the beginning, since they can easily lead to high probability of overfitting. It could be a good idea to start with a quadratic polynomial and then increase in complexity. Please keep into consideration that, in general, complexity also increases computational time (and required computational power). -Soft margin constant (C): the "C" parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a largermargin separating hyperplane, even if that hyperplane misclassifies more points.

Advantages/disadvantages of SVMs
Advantages 1. SVMs can be a useful tool in the case of non-regularity in the data, for example, when the data are not regularly distributed or have an unknown distribution, due to SVMs kernel flexibility. 2. Due to kernels transformations, also not linear classification problems can be solved

SVM delivers an unique solution since the optimization problem is convex Disadvantages
1. SVM is a non parametric technique, so the results might lack transparency ("black box"). For example, using a Gaussian kernel each features have a different importance (e.g. different weights in the kernel), therefore it is not trivial to find a straightforward relation between the features and the classification output, like what happens by using logistic regression

Random Forests (RF)
RFs are part of the algorithms called decision trees. In decision trees, the goal is to create a prediction model that predicts an output by combining different input variables [9]. In the decision tree, each node corresponds to one of the input variables, Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. Why random forests are called random? The term random is justified by the fact that the random forest algorithm trains different decision trees by using different subsets of the training data. The RF algorithm is depicted in Fig. 9 tree is split by using random selected features from the data. Therefore, the randomness generates models that are not correlated to each other.

What are the most important parameters in a RF?
-Maximum features: this is the maximum number of features that a RF is allowed to try in each individual tree. To be noted: increasing the maximum number of features usually increases the models' performance, but this is not always true since it decreases the diversity of individual trees in the RF. -Number of estimators: the number of built trees build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. We suggest keeping this parameter as large as possible to optimize the performances. -Minimum sample leaf size: the leaf is the end of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. Most of the available studies suggest to keep a value larger than 50.

Advantages
-The chance of overfitting decreases, since several different decision trees are used in the learning procedure. This corresponds to train and combine different models. -RFs apply pretty well when a particular distribution of the data is not required.
For example, no data normalization is needed. -Parallelization: the training of multiple trees can be parallelized (for example through different computational slots)

Disadvantages
-RFs usually might suffer from smaller training datasets -Interpretability: RF is more a predictive tool than a descriptive tool. It is difficult to see or understand the relationship between the response and the independent variables -The time to train a RF algorithms might be longer compared to other algorithms.
Also, in the case of a categorical variable, the time complexity increases exponentially

Artificial Neural Networks (ANNs)
Finding an agreed definition of ANNs is not that easy. In fact, most of the literature studies only provided graphical representations of ANN. The most used definition is the one by Haykin [10], who defines an ANN architecture as a massively parallelized combination of very simple learning units that acquire knowledge during the training and store the knowledge by updating their connections to the other simple units.
Often, ANNs have been compared to biological neural networks. Again, activities of biological neurons can then compared to 'activities' in processing elements in ANNs. The process of training an ANN becomes then very similar on the process of learning in our brain: some inputs are applied to neurons, which change their weights during the training to produce the most optimized output.
One of the most important concepts in ANNs is their architecture. With the word architecture we mean how the ANN is structured, meaning how many neurons are present and how they are connected.
A typical architecture of ANNs is shown in Fig. 9.3: the input layer is characterized by input neurons, in our case the number of features we would like to input for our model. The output layer corresponds to the desired output of our model. In case of binary classification problems, for example, the output layer will only have two output neurons, but in case of multiple classifications, the number of output neurons can increase up to number of classes. In between, there is a 'hidden layer', where the number of hidden neurons can vary from few to thousands. Sometimes, in more complicated architectures, there might be several hidden layers.
ANNs are also classified according to the flow of the information: 1. Feed-forward neural networks: information travels only in one direction from input to output. 2. Recurrent neural networks: data flows in multiple directions. These are the most common used ANNs due to their capability of learning complex tasks such as for example handwriting or language recognition.
There is a 'hidden layer', where the number of hidden neurons can vary from few to thousands. Sometimes, in more complicated architectures, there might be several hidden layers.  [11]. The major feature of CNNs is that they can automatically learn significant features. In fact, each layer, during the training procedure, learns which the most representative features are. However, this might lead to use 'deep learning' as a 'black-box'. It is out of the topic of this chapter to dive into properties of CNNs. However, for the interested readers we suggest following references [12,13]. drop out is a technique to avoid overfitting [14]. We suggest setting dropout between 20% and 50%. -Activation function: activation functions are used to introduce nonlinearity to models. The most common activation function is the rectifier activation function, but we suggest to use the sigmoid function for binary predictions and the softmax function for multi class classification. -Network weight initialization: these are the weights used between neurons when starting the training. The most common used is to initiate weights from an uniform distribution.

Advantages/disadvantages of ANNs
Advantages -In principle, every kind of data can be used to feed an ANN. No particular pre-processing of the data is required, but it is still suggested to use data that are normalized [15]. In addition, due to the complex structure of their architectures, ANNs can catch complex non linear relationships between independent and dependent variables -Ability to detect all possible interactions between predictor variables: the hidden layer has the power to detect interrelations between all the input variables. For example, when important relations are not modelled directly into a logistic regression model, neural networks are expected to perform better than logistic regression.

Disadvantages
-'Black box' approach: in a logistic regression model, the developer is able to verify which variables are most predictable by looking at the coefficients of the odds ratios. Neural networks, compared to logistic regression are black boxes. In fact, after setting up the training data and the network parameters, the ANNs 'learn' by themselves which input variables are the most important. It is therefore impossible to determine how the variables contribute to the output. There is interest in the community to develop regression-like techniques to examine the connection weights and the relations between input features [16]. -Large computational power required: with a standard personal computer and with back propagation activated the training of a network might require from hours to some months compared to logistic regression. -Prone to overfitting: the ability of an ANN to model interactions and nonlinearity implicitly might represent a risk of overfitting. Suggestions to limit overfitting are: limiting the number of hidden layers and hidden neurons, adding a penalty function to the activation function for large weights, or limiting the amount of training using cross validation [17].

K-means
The goal of this algorithm is to find groups (or clusters) in data. The number of chosen group is defined by the variable K. The basic idea of the algorithm is to iteratively assign the data point to one of the K groups based on the features used as input [18]. The groups are assigned by similarities in the features values. The outputs of the algorithm are the centroids of each cluster K, and the labels for training data (each data point).
The algorithm workflow can be summarized as: -data assignment step: each data is assigned to the nearest centroid based on the squared Euclidean distance -centroid update step: the centroids are recomputed by taking the mean of data points assigned to a specific centroid's cluster.
The algorithm iterates between those two steps until the optimal solution (i.e. no data points change clusters) is found. Please note that there result is not a local optimum. The algorithm workflow is depicted in Fig. 9.4.

What are the most important parameters in k-means?
-Number of clusters K: there is no pre-defined rule to find the optimal number of K. Our suggestion is to iterate the algorithm several times and compare the results to find the best parameter. One of the most common metrics used to compare results is the mean distance between the data points and their corresponding cluster centroids. However, this metric cannot be used as only indicator. In fact, increasing the number of K will always decrease the distance until the extreme case where the distance is zero (K = number of data points). We suggest to plot the mean distance as a function of K; then the 'elbow point', where the rate of decrease sharply shifts can used to determine K. Additional more advanced methods can be the silhouette method [19], and the G-means algorithm [20].

Advantages/disadvantages of k-means
Advantages -In case of a large amount of data, K-means is the faster algorithm between the families of unsupervised algorithms used for clustering problems. For example, it is faster than hierarchical clustering. However, increasing K might increase the computational time.

Disadvantages
-It is in general difficult to predict the optimal K and results might be strongly affected by different Ks -If there is a big unbalance between data (or high number of outliers) the algorithm does not work well

Hierarchical Clustering
Compared to k-means, hierarchical clustering starts by assigning all data points as their own cluster. The basic idea of the algorithm is to build hierarchies and then assign points to clusters [21]. The workflow can be summarized as: -Assign each data point to its own cluster -Find the closest pair of cluster using Euclidean distance and merge them into one single cluster -Calculate distance between two nearest clusters and combine until all items are clustered in to a single cluster.
What are the most important parameters in hierarchical clustering? -Number of clusters: again, there is no general recipe on how to find the optimal number of clusters. However, we suggest to notice which vertical lines can be cut by horizontal line without intersecting a cluster and covers the maximum distance. This can be done by building a dendrogram [22]. An example of dendrogram is shown in Fig. 9.5.

Advantages/disadvantages of hierarchical clustering
Advantages -The dendogram as algorithm output is quite understandable and easy to visualize it.

Disadvantages
-Compared to k-means, Time complexity of at least O(n 2 log n) is required, where 'n' is the number of data points -Much more sensitive to noise outliers

Conclusion
-Two major classes of machine learning algorithms exist: supervised and unsupervised learning. The first class is mainly used to predict outcomes by using some input features, the second class is used to cluster 'unlabeled data'. -There is no recipe for choosing a specific algorithm and there is no perfect algorithm. You have been presented to major advantages and disadvantages of all the listed algorithms. It is useful to remember that an extreme complexity of the algorithm might increase the risk of overfitting -We recommend the user to focus on a very careful preparation of the data before building a model (see previous chapters). In fact, a recent review [23] pointed out how classification algorithms suffer from quality of the input data.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.