Fundamentals of Clinical Data Science pp 101120  Cite as
Prediction Modeling Methodology
Abstract
In the previous chapter, you have learned how to prepare your data before you start the process of generating a predictive model. In this chapter, you will learn how to make a predictive model using very common regression techniques and how to evaluate the performance of a model. In the next chapter we will then look at more advanced machine learning techniques that have become increasingly popular in recent years.
Keywords
Hypothesis testing Null hypothesis pvalue α level Confidence level Type I error Type II error Linear regression Logistic regression Biasvariance tradeoff Dimensionality reduction Feature selection Feature extraction Regularization Performance metrics Rsquared Brier score Confusion matrix Prevalence Accuracy Positive predictive value Negative predictive value Sensitivity Specificity F1score Receiver operating characteristic Area under the curve Model discrimination Model calibration HosmerLemeshow test Calibration plot Internal validation External validation Random split MonteCarlo crossvalidation kfold cross calibration Data leakage Model validation Crossvalidation8.1 Statistical Hypothesis Testing
A statistical hypothesis is a statement that can be tested by collecting data and making observations. Before you start data collection and perform your research, you need to formulate your hypothesis. An example hypothesis could be for instance: “If I increase the prescribed radiation dose to the tumor, this will also lead to an increase of sideeffects in surrounding healthy tissues”. The purpose of statistical hypothesis testing is to find out whether the observations are meaningful or can be attributed to noise or chance.
The null hypothesis (often denoted H_{0}) generally states the currently accepted fact. Often it is formulated in such a way that two measured values have no relation with each other. The alternative hypothesis, H_{1}, states that there is in fact a relation between the two values. Rejecting or disproving the null hypothesis gives support to the belief that there is a relation between the two values.
To quantify the probability that a measured value originates from the distribution stated under the null statistical hypothesis tests are used that produce a pvalue (e.g., Ztest or student’s ttest). The pvalue gives the probability of obtaining a value equal to or greater than the observed value if the null hypothesis is true. A high pvalue indicates that the observed value is likely under the null assumption, vice versa a low pvalue indicates that the observed value is unlikely given the null hypothesis, which can lead to its rejection.

The pvalue is not the probability that the null hypothesis is true

The pvalue is not the probability of falsely rejecting the null hypothesis (type I error, see below)

A low pvalue does not prove the alternative hypothesis
Confidence levels serve a similar purpose as the α level, and by definition the confidence level + α level = 1. So an α level of 0.05 corresponds to a 95% confidence level.
8.1.1 Types of Error
The two types of errors that can be made regarding the acceptance or rejection of the null hypothesis
Null hypothesis truth  

True  False  
Null hypothesis decision  Fail to reject  Correct  Type II error (false negative) 
Reject  Type I error (false positive)  Correct 
8.2 Creating a Prediction Model Using Regression Techniques
8.2.1 Prediction Modeling Using Linear and Logistic Regression
A prediction model tries to stratify patients for their probability of having a certain outcome. The model then allows you to identify patients that have an increased chance of an event and this may lead to treatment adaptations for the individual patient. For instance, if a patient has an increased chance of a tumor recurrence the doctor may opt for a more aggressive treatment, or, if a patient has a high risk of getting a sideeffect a milder treatment might be indicated.
The outcome variable of the prediction model can be anything, e.g., the risk of getting a side effect, the chance of surviving at a certain time point, or the probability of having a tumor recurrence. We can distinguish outcome variables into continuous variables or categorical variables. Continuous variables are described by numerical values and regression models are used to predict them, e.g., linear regression . Categorical variables are restricted to a limited number of classes or categories and we use classification models for their prediction. If the outcome has two categories this is referred to as binary classification and typical techniques are decision trees and logistic regression (somewhat confusingly, this regression method is well suited for classification due to its function shape).
8.2.2 Software and Courses for Prediction Modeling
There are many different software packages available for generating prediction models, all of them with different advantages and disadvantages. Some packages are codebased and programming skills are required, e.g., Python, R or Matlab. There are integrated development environments available for improved productivity, like RStudio for R, and Spyder for Python. Additionally, they can have rich opensource libraries tailored specifically towards machine learning, for instance Caret for R [3] and Scikitlearn for Python [4]. Other packages have graphical user interfaces and being able to program is not mandatory, like SPSS, SAS or Orange. Some packages are only commercially available, but many are opensource and have a large user base for support.
A nonexhaustive overview of available software packages for prediction modeling and some of their features
Name  Reference  Coding required  Development environments  Opensource  Learn more (books/tutorials) 

R  [6]  Yes  RStudio [7]  Yes  [8] 
Python  [9]  Yes  Spyder [10] Jupyter notebooks [11]  Yes  [12] 
Matlab  [13]  Yes  Matlab  No  [14] 
SPSS  [15]  No  N/A  No  [16] 
SAS  [17]  No  N/A  Partly (students)  [18] 
Orange  [19]  No  Visual workflows  Yes  
Weka  [20]  No  N/A  Yes  [21] 
Rapidminer  [22]  No  Visual workflows  Partly 
Free online courses for prediction modeling and machine learning
Course  Organizer/link 

Machine learning  Andrew Ng, Stanford University, Coursera 
Machine learning  Tom Mitchell, Carnegie Mellon University 
Learning from data  Yaser AbuMostafa, California Institute of Technology 
Machine learning  Nando de Freitas, University of Oxford 
https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/ 
8.2.3 A Short Word on Modeling TimetoEvent Outcomes
Many studies are interested not only in predicting a certain outcome, but additionally take into account the time it takes for this outcome to occur. This is referred to as timetoevent analysis and a typical example is survival analysis. KaplanMeier curves are widely used for investigation of the influence of categorical variables [23], whereas Cox regression (or sometimes called Cox proportional hazards model) additionally allows the investigation of quantitative variables [24].
8.3 Creating a Model That Performs Well Outside the Training Set
8.3.1 The BiasVariance Tradeoff
The biasvariance tradeoff explains the difficulty of a generated prediction model to generalize beyond the training set, i.e. perform well in an independent test set (also called the outofsample performance). The error of a model in an independent test set can be shown to be decomposable into a reducible component and an irreducible component. The irreducible component cannot be diminished, it will always be present no matter how good the model will be fitted to the training data. The origin of the irreducible error can, for instance, be an unmeasured but yet important variable for the outcome that is to be predicted.
The reducible error can be further decomposed into the error due to variance and the error due to bias [2]. The variance is the error due to the amount of overfitting done during model generation. If you use a very flexible algorithm, e.g., an advanced machine learning algorithm with lots of freedom to follow the data points in the training set very closely, this is more likely to overfit the data. The error in the training set will be small, but the error in the test set will be large. Another way to look at this is that a high variance will result in very different models during training if the model is fitted using different training sets.
Bias relates to the error due to the assumptions made by the algorithm that is chosen for model generation. If a linear algorithm is chosen, i.e. a linear relation between the inputs and the outcome is assumed, this may cause large errors (large bias) if the underlying true relation is far from linear. Algorithms that are more flexible (e.g., neural networks) result in less bias since they can match the underlying true but complex relations more closely.

Flexible algorithms have low bias since they can more accurately match the underlying true relation, but have high variance since they are susceptible to overfitting.

Inflexible algorithms have low variance since they are less likely to overfit, but have high bias due to their problems of matching the underlying true relationship.
8.3.2 Techniques for Making a General Model

Lowered chance of overfitting and improved model generalization

Increased model interpretability (depending on the method of dimensionality reduction)

Faster computation times and reduced storage needs
There are many useful dimensionality reduction techniques. The first category of methods to consider is that of feature selection, where we limit ourselves to a subset of the most important features prior to model generation. Firstly, if a feature has a large fraction of missing values it is unlikely to be predictive of the outcome and can often be safely removed. In addition, if a feature has zero or near zero variance, i.e. its values are all highly similar, this again indicates that the feature is likely to be irrelevant. Another simple step is to investigate the interfeature correlation, e.g., by calculating the Pearson or Spearman correlation matrix. Features that are highly correlated with each other are redundant for predicting the outcome (multicollinearity). Even though a group of highly correlated features may all be predictive of the outcome, it is sufficient to only select a single feature as the others provide no additional information.
Traditionally, further feature selection is then performed by applying stepwise regression. In each step a feature is either added or removed and a regression model is fitted and evaluated based on some selection criterion. There are many choices for the criterion to choose between models, e.g., the Bayesian information criterion or the Akaike information criterion, both of which quantify the measure of fit of models and additionally add a penalty term for complex models comprising more parameters [26]. In forward selection, one starts with no features and the feature that improves the model the most is added to the model. This process is repeated until no significant improvement is observed. In backward elimination, one starts with a model containing all features, and features are removed that decrease the model performance the least, until no features can be removed without significantly decreasing performance.
With feature selection we limit ourselves to a subset of features that are already present in the dataset and this is a special case of dimensionality reduction [27]. In feature extraction the number of features are reduced by replacing the existing features by fewer artificial features which are combinations of the existing features. Popular techniques for feature extraction are principle component analysis, linear discriminant analysis and autoencoders [25].
More advanced machine learning algorithms often contain embedded methods for reducing model complexity to improve generalizability. An example is regularization where each added feature also comes with an added penalty or cost [8]. The addition of a feature may increase the model performance but, if the added cost is too high, it will not be included in the final model. This effectively performs feature selection and prevents overfitting. The severity of the cost is a hyperparameter that can be tuned (e.g., through crossvalidation, see paragraph “Techniques for internal validation”). Popular regularization methods for logistic regression are LASSO (or L1 regularization) [28], ridge (or L2 regularization), or a combination of both using Elastic Net [29]. The main difference between L1 and L2 regularization is that in L1 regularization the coefficients of unimportant parameters shrink to zero, effectively performing feature selection and simplifying the final model.
8.4 Model Performance Metrics
8.4.1 General Performance Metrics
The performance of a prediction model is evaluated by the calculation of performance metrics. We want our model to have high discriminative ability, i.e. high probabilities should be predicted for observations having positive classes (e.g., alive after 2 years or treatment) and low probabilities for negative classes (dead after 2 years of treatment). There is no general best performance metric for model evaluation as this depends strongly on the underlying data as well as the intended application of the model.
Other oftenused overall performance metrics are Rsquared measures of goodness of fit (or R^{2}, also called the coefficient of determination). The R^{2} can be interpreted as the amount of variance in the data that is explained by the model (explained variation). Higher R^{2}s correspond to better models. Examples are Cox and Snell’s R^{2} or Nagelkerke’s R^{2}. Rsquared values are mainly used in regression models; for classification models it is more appropriate to look at performance metrics derived from the confusion matrix.
Another popular overall performance measure is the Brier score (or mean squared error) and it is defined as the average of the square of the difference between the predictions and observations. A low Brier score indicates that predictions match observations and we are dealing with a good model.
8.4.2 Confusion Matrix
Confusion matrix showing predictions and observations. Many useful performance metrics are derived from the values in the confusion matrix
Observation  
True  False  
Prediction  True  True positive (TP)  False positive (FP)  →  Positive predictive value (PPV) 
False  False negative (FN)  True negative (TN)  →  Negative predictive value (NPV)  
↓  ↓  
Sensitivity (TPR)  Specificity (TNR) 
True positives, called hits, are cases that are correctly classified. True negatives are correctly rejected. False positives, or false alarm, are equivalent to a type I error. False negatives, or misses, are equivalent to a type II error.
8.4.3 Performance Metrics Derived from the Confusion Matrix
If we want to consider characteristics not of the population but of the prediction model when applied as a clinical test, we can evaluate sensitivity and specificity. Sensitivity, or True Positive Rate (TPR, or sometimes called recall or probability of detection), is defined as the probability of the model to make a positive prediction for the entire group of positive observations. It is a measure of avoiding false negatives, i.e. not missing any diseased patients.
8.4.4 Model Discrimination: Receiver Operating Characteristic and Area Under the Curve
By evaluating different thresholds for rounding our model predictions, we can determine many sensitivity and specificity pairs. If we plot the sensitivity versus (1 – specificity) for all these pairs, i.e. the true positive rate versus the false positive rate, we obtain the Receiver Operating Characteristic curve (ROC) [30]. This curve can give great insight into model discrimination performance. It allows for determining the optimal sensitivity/specificity pair of a model so that it can support decision making, and also allows comparison of different models with each other.
8.4.5 Model Calibration
Historically, the focus in evaluating model performance has primarily been on discriminative performance, e.g., by calculating R^{2} metrics, confusion matrix metrics and performing ROC/AUC analysis. Model calibration is however as important as discrimination and should always be evaluated and reported. Model calibration refers to the agreement between subgroups of predicted probabilities and their observed frequencies. For example, if we collect 100 patients for which our model predicts 10% chance of having the outcome, and we find that in reality 10 patients actually have the outcome, then our model is well calibrated. Since the predicted probabilities can drive decisionmaking it is clear that we want the predictions to match the observed frequencies.
A widelyused (but no so effective) way of determining model calibration is by performing the HosmerLemeshow test for goodness of fit of logistic regression models. The test evaluates the correspondence between predictions and observations by dividing the probability range [01] into n subgroups. Typically, 10 subgroups are chosen, but this number is arbitrary and can have a big influence on the final pvalue of the test.
8.5 Validation of a Prediction Model
8.5.1 The Importance of Splitting Training/Test Sets
In the previous paragraphs different metrics for evaluation of model performance have been discussed. As briefly discussed in paragraph “The biasvariance tradeoff” it is important to compute performance metrics not on the training dataset but on data that was not seen during the generation of the model, i.e. a test or validation set. This will ensure that you are not mislead into thinking you have a good performing model, while it may in fact be heavily overfitted on the training data. Overfitting means that the model is trained too well on the training set and starts to follow the noise in the data. This generally happens if we allow too many parameters in the final model. The performance on the training set is good, but on new data the model will fail. Underfitting corresponds to models that are too simplistic and do not follow the underlying patterns in the data, again resulting in poor performance in unseen data.
Properly evaluating your model on new/unseen data will improve the generalizability of the model. We differentiate between internal validation, where the dataset is split into a training set for model generation and a test set for model validation, and external validation, where the complete dataset is used for model generation and separate/other datasets are available for model validation.
8.5.2 Techniques for Internal Validation
The advantage of kfold crossvalidation is that each data point is used in a test set only once, whereas in Monte Carlo crossvalidation it can be selected multiple times (and other points are not selected for a test set at all), possibly introducing bias. The disadvantage of kfold crossvalidation is that it only evaluates a limited number of splits whereas Monte Carlo crossvalidation evaluates as many split as you desire by increasing the number of iterations (although you could iterate the entire kfold crossvalidation procedure as well which is commonly called repeated kfold crossvalidation).
Note that in both Monte Carlo crossvalidation and kfold crossvalidation we are generating many models instead of a single final model, e.g., because the feature selection algorithm might select different features or the regression produces different coefficients due to different training data. Crossvalidation is used to identify the best method (i.e. data preprocessing, algorithm choice etc.) that is to be used to construct your final model. When you have identified the optimal method you can then train your model accordingly on all the available data.
A common mistake in any method where the dataset is split into training and test sets is to allow data leakage to occur [37]. This refers to using any data or information during model generation that is not part of the training set and can result in overfitting and overly optimistic model performance. It can happen for example when you do feature selection on the total dataset before applying the split. In general it is advised to perform any data preprocessing steps after the data has been split and using only information available in the training set.
8.5.3 External Validation
The true test of a prediction model is to evaluate its performance under external validation, or separate datasets from the training dataset. Preferably, this is performed on new data acquired from a different institution. It will indicate the generalizability of the model and show whether it is overfitted on the training data. If this can be performed on multiple external validation sets, this further strengthens the acceptance of the prediction model under evaluation.
It has to be noted that if the datasets intended to be used for external validation are collected by the same researchers that built the original prediction model, this is still not an independent validation. Independent external validation, by other researchers, is the ultimate test of the model generalizability. This requires open and transparent reporting of the prediction model, of inclusion and exclusion criteria for the training cohort and of data preprocessing steps. Additionally, it is encouraged to make the training data publicly available as this allows other researchers to verify your methodology and results and greatly improves reproducibility.
8.6 Summary Remarks
8.6.1 What Has Been Learnt
In this chapter you have learnt about the importance of the biasvariance tradeoff in prediction modeling applications. You have learnt how to generate a simple logistic regression model and what metrics are available to evaluate its performance. It is important to not limit the evaluation to model discrimination only, but also include calibration as well. Finally, we have discussed the importance of separating training and test sets so that we protect ourselves from overfitting. Internal validation strategies such as crossvalidation are discussed, and the ultimate test of a prediction model, independent external validation, has been emphasized.
8.6.2 Further Reading
The field of prediction modeling and machine learning is extremely broad and in this chapter we have only scratched the surface. A good place to start with further reading on the many aspects of prediction modeling is the book “Clinical Prediction Models – A Practical Approach to Development, Validation, and Updating” by Steyerberg [38]. If you are looking to improve your knowledge and simultaneously improve your practical modeling skills the book “An Introduction to Statistical Learning – with Applications in R” by James et al. is highly recommended [8]. Finally, if you want to go indepth and understand the underlying principles of the many machine learning algorithms the goto book is “The Elements of Statistical Learning – Data Mining, Inference, and Prediction” by Hastie et al. [39].
References
 1.Goodman S. A dirty dozen: twelve Pvalue misconceptions. Semin Hematol. 2008;45(3):135–40.CrossRefGoogle Scholar
 2.Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S. Hypothesis testing, type I and type II errors. Ind Psychiatry J. 2009;18(2):127–31.CrossRefGoogle Scholar
 3.Kuhn M, et al. Caret: classification and regression training, 2016.Google Scholar
 4.Pedregosa F, et al. Scikitlearn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.Google Scholar
 5.Anaconda distribution: The most popular Python/R data science distribution. [Online]. Available: https://www.anaconda.com/distribution/.
 6.R: The R Project for Statistical Computing. [Online]. Available: https://www.rproject.org/.
 7.RStudio: Open source and enterpriseready professional software for R, RStudio. [Online]. Available: https://www.rstudio.com/products/rstudio/.
 8.James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning – with applications in R. 1st ed. New York: Springer; 2013.CrossRefGoogle Scholar
 9.Python: The official home of the Python Programming Language. [Online]. Available: https://www.python.org/.
 10.Spyder: The Scientific PYthon Development EnviRonment. [Online]. Available: https://pythonhosted.org/spyder/index.html.
 11.Jupyter: Opensource web application for live coding, data visualizations, numerical simulation, statistical modeling and more. [Online]. Available: http://jupyter.org/.
 12.Müller A, Guido S. Introduction to machine learning with Python. Sebastopol: O’Reilly Media; 2016.Google Scholar
 13.Matlab: The easiest and most productive software environment for engineers and scientists. [Online]. Available: https://www.mathworks.com/products/matlab.html.
 14.Murphy P. Machine learning: a probabilistic perspective. Cambridge: The MIT Press; 2012.Google Scholar
 15.SPSS: The world’s leading statistical software used to solve business and research problems by means of adhoc analysis, hypothesis testing, geospatial analysis and predictive analytics. [Online]. Available: https://www.ibm.com/analytics/spssstatisticssoftware.
 16.George D, Mallery P. IBM SPSS statistics 23 step by step: Pearson Education; 2016.Google Scholar
 17.SAS: SAS/STAT Stateoftheart statistical analysis software for making sound decisions. [Online]. Available: https://www.sas.com/en_us/software/stat.html.
 18.SAS/STAT® 13.1 User’s Guide. SAS Institute Inc, 2013.Google Scholar
 19.Orange: Open source machine learning and data visualization for novice and expert. [Online]. Available: https://orange.biolab.si/.
 20.Weka: Data mining software in Java. [Online]. Available: https://www.cs.waikato.ac.nz/ml/index.html.
 21.Witten I, Frank E, Hall M, Pal C. Data mining: practical machine learning tools and techniques. Burlington: Morgan Kaufmann; 2016.Google Scholar
 22.RapidMiner Studio: Visual workflow designer for data scientists. [Online]. Available: https://rapidminer.com/products/studio/.
 23.Efron B. Logistic regression, survival analysis, and the KaplanMeier curve. J Am Stat Assoc. 1988;83(402):414–25.CrossRefGoogle Scholar
 24.Walters SJ. Analyzing time to event outcomes with a Cox regression model. Wiley Interdiscip Rev Comput Stat. 2012;4(3):310–5.CrossRefGoogle Scholar
 25.van der Maaten LJP, Postma EO, van den Herik HJ. Dimensionality reduction: a comparative review. Tilburg University Technical Report TiCC TR 2009005; 2009.Google Scholar
 26.Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods. 2012;17(2):228–43.CrossRefGoogle Scholar
 27.Dash M, Liu H. Feature selection for classification. Intell Data Anal. 1997;1(3):131–56.CrossRefGoogle Scholar
 28.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.Google Scholar
 29.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301–20.CrossRefGoogle Scholar
 30.Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.CrossRefGoogle Scholar
 31.Steyerberg EW, et al. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology. 2010;21(1):128–38.CrossRefGoogle Scholar
 32.Moons KGM, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart. 2012;98(9):683–90.CrossRefGoogle Scholar
 33.Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35(29):1925–31.CrossRefGoogle Scholar
 34.Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 33(3):517–35.CrossRefGoogle Scholar
 35.Xu QS, Liang YZ. Monte Carlo cross validation. Chemom Intell Lab Syst. 2001;56(1):1–11.CrossRefGoogle Scholar
 36.Rodriguez JD, Perez A, Lozano JA. Sensitivity analysis of kfold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell. 2010;32(3):569–75.CrossRefGoogle Scholar
 37.Luo W, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res. 2016;18(12)CrossRefGoogle Scholar
 38.Steyerberg E. Clinical prediction models: a practical approach to development, validation, and updating. New York: SpringerVerlag; 2009.CrossRefGoogle Scholar
 39.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. New York: SpringerVerlag; 2001.CrossRefGoogle Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.