Validation of Prediction Models

  • E.W. Steyerberg
Part of the Statistics for Biology and Health book series (SBH)


The purpose of a predictive model is to provide valid outcome predictions for new patients. Essentially, the data set to develop a model is not of interest other than to learn for the future. Validation hence is an important aspect of the process of predictive modelling. An important distinction is between internal and external validation. We discuss internal and external validation techniques in this chapter, with illustrations in case studies.


External Validation Lynch Syndrome Testicular Cancer Validation Sample Independent Validation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

17.1 17.1 Internal vs. External Validation, and Validity

A general framework for validation and validity concepts is shown in Fig. 17.1. We develop a model within a representative sample of patients from an underlying population. This underlying population has specific characteristics, e.g. a specific hospital with a certain profile of how patients come to this hospital. By necessity, the sample is historic in nature, although we generally will aim for recent data, which are representative of current practice. At least we should determine the internal validity (or “reproducibility”) of our predictive model for this underlying population. We do so by testing the model in our development sample (“internal validation”). Internal validation is the process of determining internal validity. Internal validation assesses validity for the setting where the development data originated from.
Fig. 17.1

A conceptual framework on internal vs. external validition, and validity. We consider a superpopulation, consisting of several subpopulations (referred to as “settings”). We develop a model in sample 1 from setting 1. Internal validation is the process of determining internal validity for setting 1. External validation is the process of determining generalizability to settings 2 to i

Another aspect is the external validity (or “generalizability” / “transportability”) of the prediction model to populations that are “plausibly related.”222 Generalizability is a desired property from both a scientific and practical perspective. Scientifically speaking hypotheses and theories are stronger when their generalizability is larger. Practically, we hope to be able to validly apply a prediction model to our specific setting.

The definition of “plausibly related” populations is not self-evident, and requires subject knowledge and expert judgment on epidemiological study design aspects. We consider “plausibly related” as that populations can be thought of as parts of a “superpopulation” (Fig. 17.1). We could also state that we consider populations that would be reasonable to apply the previously developed model to. Populations will be slightly different, e.g. treated at different hospitals or in different time frames. Various aspects may differ between these populations, e.g. the selection of patients (e.g. referral centre vs. more standard setting), and definitions of predictors and outcome. For example, a superpopulation could be formed by “patients with an acute MI,” with the GUSTO-I data representing one population, defined by the inclusion criteria for this trial, the participating centres, and the time of accrual.

We learn about external validity by testing the model in other samples (sample 2 to i in Fig. 17.1, “external validation”). These samples are fully independent from the development data and originate from different but plausibly related settings. The more often the model is externally validated and the more diverse these settings, the more confidence we gain in the generalizability of the model. This is similar to the approach to assessing any scientific hypothesis.222

17.2 17.2 Internal Validation Techniques

Several techniques are available to assess internal validity. Some of the most common techniques in medical research are discussed here (Table 17.1).
Table 17.1

Overview of characteristics of some techniques for internal validation






Original 100%

Original 100%



50–67% of original

Independent 50–33%





2 × 50% – 10×90% of original

Independent 2×50% – 10×10%



N × (N − 1 of original)

Independent N×1 patient



Bootstrap sample of size N

Original 100%


a More stable cross-validation results are obtained by repeating the cross-validation many times, e.g. 50 times (“multi-fold cross-validation”)

17.2.1 17.2.1 Apparent Validation

With apparent validation, model performance is assessed directly in the sample where it was derived from (Fig. 17.2). Naturally this leads to an optimistic estimate of performance (biased assessment), since model parameters were optimized for the sample. However, we use 100% of the available data to develop the model, and 100% of the data to test the model. Hence, the procedure gives optimistic but stable estimates of performance.
Fig 17.2

Apparent validition refers to assessing model performance in the sample where the model was derived from

17.2.2 17.2.2 Split-Sample Validation

With split-sample validation, the sample is randomly divided into two groups. This very classical approach is inspired by the design of an external validation study. However, the split in derivation and test set is at random in split-sample validation. In one group the model is created (e.g. 50% of the data) and in the other the model performance is evaluated (e.g. the other 50% of the data, Fig. 17.3). Typical splits are as 50%:50% or 2/3:1/3.
Fig. 17.3

Split-sample validation refers to assessing model performance in a random part of the sample, with model development in the other part

Several aspects need attention when a split-sample validation is performed. If samples are split fully at random, substantial imbalances may occur with respect to distribution of predictors and the outcome. For example, if we perform split-sample validation with a small subsample from GUSTO-I (n=429), the average incidence of 30-day mortality is 5.6% (24/429), but it may easily be 4% in a 50% random part and 7% in another part. Similarly, the distribution of predictors may vary. For predictors with skewed distributions the consequences may be even worse. For example, a random development sample may not contain any patient with shock, which occurred in only 1.6% (7/429). A practical possibility is to stratify the random sampling by outcome and relevant predictors.

The drawbacks of split-sample methods are numerous.174,292,374 One major objection is related to variance. Only part of the data is used for model development, leading to less-stable model results compared with development with all development data. Also, the validation part is relatively small, leading to unreliable assessment of model performance. Further, the investigator may be unlucky in the split; the model may show a very poor performance in the random validation part. It is not more than human that the investigator is tempted to repeat the splitting process until more favorable results are found. Another objection is related to bias. We obtain an assessment of the performance when a part of the data is used, while we want to know the performance of a model based on the full sample.

In sum, split-sample validation is a classical but inefficient approach to model validation. It dates from the time before efficient but computer-intensive methods were available, such as bootstrapping.108 Simulation studies have shown that rather large sample sizes are required to make split-sample validation reasonable.413 But with large samples, the apparent validity is already a good indicator of model performance. Hence, we may conclude that split-sample validation is a method that works when we do not need it. It should be replaced in medical research by more efficient internal validation techniques, and by attempts of external validation.

17.2.3 17.2.3 Cross-Validation

Cross-validation is an extension of split-sample validation, aiming for more stability (Fig. 17.4). A prediction model is again tested on a random part that was left out from the sample. The model is developed in the remaining part of the sample. But this process is repeated for consecutive fractions of patients. For example, the data set may be split in deciles (containing 1/10 of the patients), with model development in nine of the ten and testing in one of the ten, which is repeated ten times (“ten-fold cross-validation”). In this way, all patients have served once to test the model. The performance is commonly estimated as the average overall assessments.174
Fig. 17.4

Cross-validation refers to assessing model performance consecutively in a random part of the sample, with model development in the other parts. With ten-fold cross-validation, deciles of the sample serve as validation parts

Compared with split-sample validation, cross-validation can use a larger part of the sample for model development (e.g. 90%). This is an advantage. However, the whole cross-validation procedure may need to be repeated several times to obtain truly stable results, for example 50 times ten-fold cross-validation. The most extreme cross-validation is to leave out each patient once, which is equivalent to the jack-knife procedure.108 With large numbers of patients, this procedure is not very efficient.

A problem is that cross-validation may not properly reflect all sources of model uncertainty, such as caused by automated variable selection methods. We provide an example at the book’s website, where we consider the stability of a backward stepwise selection procedure in the large subsample from GUSTO-I (sample 4, n=785, 52 deaths). A ten-fold cross-validation procedure suggests a quite stable selection of “important predictors”: SHO, A65, HIG, and HRT. In contrast, bootstrapping shows a much wider variability. The underestimation of variability is easily recognized for jack-knife cross-validation, where the development sample is identical to the full sample except for one patient. Hence, largely the same predictors will generally be selected in each jack-knife sample as in the full sample. Such model uncertainty can better be reflected with bootstrap validation.

17.2.4 17.2.4 Bootstrap Validation

As discussed in Chap. 5, bootstrapping reflects the process of sampling from the underlying population (Fig. 17.5). Bootstrap samples are drawn with replacement from the original sample, reflecting the drawing of samples from an underlying population. Bootstrap samples are of the same size as the original sample.108 In the context of model validation, 100–200 bootstraps may often be sufficient to obtain stable estimates, but in one simulation study we reached a plateau only after 500 bootstrap repetitions.401 With current computer power bootstrap validation is a feasible technique for most prediction problems.
Fig. 17.5

Bootstrap validation refers to assessing model performance in the original sample for a model (Model 1*) that was developed in a bootstrap sample (Sample 1*), drawn with replacement from the original sample

For bootstrap validation a prediction model is developed in each bootstrap sample. This model is evaluated both in the bootstrap sample and in the original sample. The first reflects apparent validation, the second test validation in new subjects. The difference in performance indicates the optimism. This optimism is subtracted from the apparent performance of the original model in the original sample.174,108,409,413 The bootstrap was illustrated for estimation of optimism in Chap. 5.

Advantages of bootstrap validation are various. The optimism-corrected performance estimate is rather stable, since samples of size N are used to develop the model as well as to test the model. This is similar to apparent validation, and an advantage over split-sample and cross-validation methods. Compared with apparent validation, some uncertainty is added by having to estimate the optimism. When sufficient bootstraps are taken, this additional uncertainty is however negligible.

Moreover, simulations have shown that bootstrap validation can appropriately reflect all sources of model uncertainty, especially variable selection.401 The bootstrap also seems to work reasonably in high-dimensional settings of genetic markers, where the number of potential predictors is larger than the number of patients (“p>n problems”), although some modifications may be considered.374 Disadvantages of bootstrap validation, and other resampling methods such as cross-validation, include that only automated modelling strategies can be used, such as fitting a full model without selection, or following an automated stepwise selection approach. In many analyses, intermediate steps are made, such as collapsing categories of variables, truncation of outliers or omission of influential observations, assessing linearity visually in a plot, testing some interaction terms, studying both univariate and multivariable p values, or assessing proportionality of hazards for a Cox regression model. It may be difficult to repeat all these steps in a bootstrap procedure.

In such situations, it may be reasonable to at least validate the full model containing all predictors to obtain a first impression of the optimism. For example, when we consider 30 candidate predictors, and build a final model with predictors that have multivariable p<0.20 in a backward stepwise selection procedure, but after univariate screening with e.g. p<0.50, the optimism can be estimated by validating the full 30 predictor model. Another reasonable approximation for the optimism in this example may be to simply perform backward stepwise selection with p<0.20, ignoring the univariate screening. We would definitely be cheating if we validated the finally selected model and ignored all selection steps. In one study we found an optimism estimate of 0.07 for the c statistic when we replayed all modeling steps (based on univariate and multivariable p values) in contrast to 0.01 when we considered the final model as pre-defined.401

17.3 17.3 External Validation Studies

External validation of models is essential to support general applicability of a prediction model. Where internal validation techniques are all characterized by random splitting of development and test samples, external validation considers patients that differ in some respect from the development patients (Fig. 17.1). External validation studies may address aspects of historic (or temporal), geographical (or spatial), methodological, and spectrum transportability.222 Historic transportability refers to performance when a model is tested in different historical periods. Especially relevant is validity in more recently treated patients. Geographic transportability refers to testing in patients from other places, e.g. other hospitals or other regions, see e.g. a recent study in stroke patients.240 Methodological transportability refers to testing with data collected by using alternative methods, e.g. when comorbidity data are collected from claims data rather than from patients’ charts. Spectrum transportability refers to testing in patients who are, on average, more (or less) advanced in their disease process, or who have a somewhat different disease.222 Spectrum transportability is relevant when models are developed in secondary care and validated in primary care, or models developed in randomized trials are validated in a broader, less-selected sample.

In addition to these aspects, we may consider whether external validation was performed by the same investigators who developed the model, or by investigators not involved at the development stage. If model performance is found adequate by fully independent investigators, in their specific setting, this is more convincing than when this result was found by investigators who also developed the model.

A simple distinction in types of external validation studies is shown in Table 17.2 . We distinguish temporal validation (validation in more recent patients), geographic validation (validation in other places), and fully independent validation (by other investigators at other sites). Mixed forms of these types can occur in practice. For example, we validated a testicular cancer prediction model in 172 patients: 100 more recently treated patients from hospitals that participated in the model development phase and 72 from a hospital not included among the development centres.412
Table 17.2

Summary of types of external validation studies (based on Justice et al.222)




Temporal validation

Prospective testing, more recent patients


Geographic validation

Multi-site testing


Fully independent validation

Other investigators at another site


17.3.1 17.3.1 Temporal Validation

With temporal validation, we typically validate a model in more recently treated patients. A straightforward approach is to split the development data into two parts: one part containing early treated patients to develop the model and another part containing the most recently treated patients to assess the performance.

Also, we may aim for a prospective application of the model in a specifically collected cohort. An example is from a study in patients suspected of Lynch syndrome (see Chap. 10).

17.3.2 17.3.2 Example: Development and Validation of a Model for Lynch Syndrome

We aimed to predict the prevalence of Lynch-syndrome related genetic defects (MLH1 or MSH2 mutations) based in proband and relative characteristics (“family history”). Predictors included type of cancer diagnosis, age, and number of affected relatives. We developed a model with 898 patients who were tested at Myriad Genetics between 2000 and 2003. This model was tested in a validation sample containing 1,016 patients who were tested between 2003 and 2004 (Table 17.3 ).
Table 17.3

Multivariable analysis of Lynch syndrome prediction model


Development OR [95% CI]

Validation OR [95% CI]

Combined OR [95% CI]





2.2 [1.9 – 2.5]

7.0 [6.0 – 8.1]

3.8 [3.6 – 4.1]


CRC 2+

8.2 [5.6 – 12]

37 [25 – 55]

16 [14 – 20]



1.8 [1.5 – 2.2]

1.5 [1.2 – 1.7]

1.5 [1.4 – 1.6]


Endometrial cancer

2.5 [2.1 – 3.1]

7.1 [6.1 – 8.2]

4.2 [3.9 – 4.6]


Other HNPCC cancer

2.1 [1.7 – 2.5]

1.4 [1.1 – 1.8]

1.8 [1.6 – 2.0]


Family history


CRC in 1st/2nd degreea

2.3 [2.1 – 2.5]

3.0 [2.8 – 3.3]

2.6 [2.5 – 2.7]


CRC 2 in 1st degree

3.1 [2.6 – 3.6]

4.2 [3.6 – 4.8]

3.6 [3.4 – 3.8]


Endometrial cancer 1st/2nd degreea

2.7 [2.4 – 3.2]

2.7 [2.3 – 3.1]

2.6 [2.4 – 2.8]


Endometrial cancer 2 in 1st degree

6.5 [1.8 – 24]

26 [6.0 – 113]

12 [6.3 – 23]


Other HNPCC cancer

1.5 [1.4 – 1.7]

1.4 [1.4 – 1.6]

1.5 [1.4 – 1.6]


Age at diagnosis



1.5 [1.5 – 1.5]

1.4 [1.4 – 1.4]

1.4 [1.4 – 1.4]


Endometrial cancerc

1.3 [1.2 – 1.4]

1.4 [1.3 – 1.4]

1.3 [1.3 – 1.4]


Model performance


c statistic

0.79 [0.76–0.83]d

0.80 [0.76–0.84]e

0.80 [0.77–0.83]d


Mean observed vs. predicted

14% vs. 14%

15% vs. 13%e

15% vs. 15%


Calibration slope


1.26 [1.03–1.49]e



Effects of predictors are shown for the development (n=898) and validation (n=1,016) patients, as well as in the combined data set (n=1,914) used for estimation of the final prediction model. Model performance includes assessment of discrimination and calibration

a Family history coded as first-degree + 0.5 second-degree relatives, with first-degree relatives coded as 0 or 1 and second-degree relatives coded as 0, 1, 2+

b Age effect for colorectal cancer and/or adenoma in probands, and colorectal cancer in first- and second-degree relatives

c Age effect for endometrial cancer in probands, in 1st degree, and in second-degree relatives

d Internal validation by bootstrapping for c statistic and calibration slope

e External validation for c statistic, mean observed and predicted probabilities, and calibration slope

In the validation sample, the outcome definition was slightly different, since not only mutations but also deletions of genes were assessed. This led to a slightly higher prevalence of mutations (15% at validation versus 14% at development), while the case-mix remained similar (mean predicted probability for validation sample, 13%). This difference in prevalence of the outcome could easily be adjusted by using a slightly higher intercept in the logistic regression model (+0.25, indicating –25% higher odds). The effects of the predictors were similar in the development and validation samples. Also, the discriminative ability remained at a similar level as at development with c statistic around 0.80.

The good performance at external validation may not be too surprising given that definitions of predictors were exactly the same. For the final model, both data sets were combined, such that 1,914 patients were analyzed, leading to smaller confidence intervals for the effects of the predictors and the c statistic.

17.3.3 17.3.3 Geographic Validation

With geographic validation, we evaluate a predictive model according to site or hospital. Geographic validation can be seen as a variant of cross-validation. It could be labelled “leave-one-centre-out cross-validation.” Importantly, standard cross-validation makes splits in the data at random; with geographic validation the splits are not at random. Some examples are shown in Table 17.4. Geographic validation is often possible with collaborative studies, and more meaningful than a standard cross-validation.
Table 17.4

Examples of studies with external validation according to site (“leave-one-centre-out cross-validation”)






Testicular cancer

6×5 groups

6×1 group

A group consisted of a single hospital or a previously published patient series


Chlamydia trachomatis

4 × 3 regions

4 × 1 region

Municipality health centres organizing regional case finding



5 × 4 hospitals

5 × 1 hospital

Large hospitals participating in a RCT+ a category “other”

A drawback of such geographical validations is that validation samples may get quite small, leading to unreliable results. Results may easily be overinterpreted, for example as showing that “the model was not valid for hospital X.” For example, in the testicular cancer case study, we found a systematic difference in calibration for patients treated in one centre (Fig. 17.6).467 In fact, we perform multiple, small, subgroup analyses. Emphasis should be on general consistency (if this is observed) rather than on differences that will always occur with small numbers. On the other hand, remarkable findings for a specific setting may indicate a need for further validation before applying the model in this setting, and trigger further research.
Fig. 17.6

Results of external validation by centre for the testicular cancer prediction model. We note c statistics around 0.8 for all sites, and non-significant miscalibration according to the Hosmer-Lemeshow test (H-L test), except in graph(f). B, benign tissue; T, tumor467

17.3.4 17.3.4 Fully Independent Validation

Finally we mention external validation by independent investigators (“fully independent validation”). Other investigators may use slightly different definitions of predictors, outcome, and study patients that were differently selected compared with the development setting. An example of that is a prostate cancer model developed for clinically seen patients and validated in patients selected by a systematic screening program (European Study on Prostate Cancer, ERSPC).424 Here, case-mix seemed similar, but a severe underestimation of relatively innocent (“indolent”) cancer probability was found (Table 17.5). This phenomenon was addressed by a new logistic model intercept, while keeping the regression coefficients close to their original values.
Table 17.5

Prediction accuracy of three previous nomograms for indolent prostate cancer developed by Kattan et al.227 for 247 ERSPC patients424




Performance parameter






Area under the ROC curve

Kattan et al.






0.61 [0.54–0.68]

0.72 [0.66–0.78]

0.76 [0.70–0.82]









49% [43–55%]

49% [43–55%]

49% [43–55%]


Calibration slope







0.78 [0.32-1.24]

0.87 [0.55-1.19]

1.07 [0.74-1.40]


Base: Serum PSA + clinical stage + biopsy Gleason grade 1 and 2

Medium: Base + US volume + %positive cores

Full: Base + US volume + mm cancerous tissue + mm non-cancerous tissue

Similarly it was found that a prediction model for the selection of patients undergoing in vitro fertilization for single embryo transfer needed an adjustment when a model developed at one hospital was applied in another centre. Again, a systematic difference remained even after adjustment for well-known and important predictors.206 This difference in average outcome (“calibration-in-the-large”) is an important motivation for recalibration of model predictions as a simple but important updating technique (see Chap. 19).

Some examples of fully independent validation studies with their main conclusions are listed in Table 17.6. It seems that fully external validation studies often provide more unfavourable results than a temporal or geographical external validation. This is also illustrated by other examples of fully independent validation, showing generally poor results, in a review by Altman and Royston.13
Table 17.6

Examples of studies with fully independent external validation






Prostate cancer

Two hospitals227

Screening setting (ERSPC)424

Intercept problem


Aneurysm mortality

One hospital + meta-analysis421

UK small aneurysm trial54 and another hospital231

Missing predictors; poor/moderate performance


Renal artery stenosis


One French hospital278

“Reasonably valid”


17.3.5 17.3.5 Reasons for Poor Validation

Unfavourable results at validation may often be explained by inadequate model development. Sample size may have been relatively small, or patients were selected from a single centre. This was for example noted in a review of over 25 models in traumatic brain injury.333 Also, statistical analysis may often have been suboptimal, e.g. with stepwise selection in relatively small samples with many potential predictors, and no shrinkage of regression coefficients to compensate overfitting.

Other explanations include true differences between development and validation settings, especially in coding of predictors and outcome. The problem of transportability of models that incorporate laboratory test results was already recognized in the 1980s for a prediction rule for jaundice, where units were not consistent.379 Indeed, the validation of a model that was previously developed by others is often more difficult than that anticipated. If a nomogram is presented with some non-linear terms, it is not so easy to derive a formula to calculate outcome predictions for new patients. So, it is quite likely that errors are made at such external validation studies. Units of measurement and the intercept value require special attention. Contacting the authors may help to prevent mistakes.

Moreover, variables required for a model may not be available at validation. A constant value can be filled in (e.g. the mean or median), but obviously this limits the external performance of a model. For example, a Dutch model for abdominal aneurysm mortality was validated in the UK small aneurysm study, while two of the seven predictors were not available.54 In a validation study with patients from Rotterdam, all predictors except one were available and a better external performance was found.231

17.4 17.4 Concluding Remarks

We considered several approaches to internal and external validation. For internal validation, bootstrapping appears most attractive, provided that we can replay all modelling steps. This may sometimes be difficult, e.g. when decisions on coding of predictors, and selection of predictors are made in the modelling process. Several variants of bootstrapping are under study, which may be more efficient than the procedure described here. Also, the optimism may in fact be larger than estimated by bootstrapping when the ratio of predictors considered to the sample size is very unfavourable, such as in genetic marker research.292,220,221

Any internal validation technique should be seen as validating the modelling process rather than a specific model.181 For example, when a split-sample validation is followed, e.g. to convince physicians who are skeptical of modern developments, the final model should still be derived from the full sample. It would be a waste of precious information if the final model were only based on a random part of the sample. Differences in regression coefficients will generally be small, since the split was at random, and the data have overlap, but the estimates of the full sample will be more stable. If a stepwise selection procedure was followed in the random sample, it should be repeated in the full sample. This may result in a different model specification, but this is preferable to sticking to results from only part of the available data.

The same reasoning holds for cross-validation and bootstrap validation. Especially with bootstrap validation we may well illustrate the instability of stepwise selection procedures (see Chap. 11). The final model may only be selected in a few of the bootstrap samples. This model uncertainty has to be taken into account in the optimism estimate for the final model.

If external validation has been performed, we may similarly define the final model from the combined data set. This was for example done in the Lynch syndrome case study (Table 17.3).25 The regression coefficients in the final model are a compromise between the estimates in the development and validation sample. This combination of data implies that the two samples represent the same population, which is not necessarily the case (Chap. 18). If relevant differences are found, a setting-specific intercept or setting-specific interaction terms may be included (see Chaps. 19–21).

17.5 Questions

  1. 17.1
    Stability of internal validation techniques (Table 17.1)
    1. (a)

      Split-sample validation is notoriously unstable. In contrast, apparent validation and bootstrap validation share stability in the estimation of model performance. Do you agree?

    2. (b)

      Cross-validation eventually uses 100% of the sample for validation; why might multi-fold cross-validation help?

  2. 17.6

    Interpretation of external validation (Fig. 17.6)

Fig. 17.6 can be interpreted in different ways. One perspective is to emphasize the similarity in performance between settings. Alternatively, we might focus on graph E, which shows a systematic miscalibration. What would be your view on the performance of this centre? Consider a fixed effect and random effect perspective (see also Chap. 20).
  1. 17.3

    Problems with internal validation50

Interpret the published results on “internal validation” in Table 2 of an Ann Int Med paper (
  1. (a)

    What do you think went wrong?

  2. (b)

    What do you think of the interpretation provided in the text?

  3. (c)

    What do you think about the “corrected Table 2,” published as an erratum?


Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • E.W. Steyerberg
    • 1
  1. 1.Department of Public HealthErasmus MCRotterdamThe Netherlands

Personalised recommendations