Benchmarking can be used to identify opportunities for quality improvement.1 Performance or benchmarks can be monitored over time within a single practice, or compared across different practices. These methods for performance measurement and improvement require careful interpretation of the results and awareness of limitations.2 In complex systems, such as intensive care units (ICUs), it can be difficult to compare measures of quality since patients present with heterogeneous illnesses and varied disease severity. Methods have been proposed to account for this heterogeneity, most commonly regression techniques to risk-adjust the measure of interest.3,4,5

An ideal benchmarking system will use data that are readily available and simple to interpret.6 Ontario is the most populous province in Canada. In 2007, the Critical Care Information System (CCIS) was implemented by the provincial health ministry as part of a strategy to improve the quality and efficiency of the critical care system.7 The CCIS includes a measure of organ dysfunction on ICU admission (Multiple Organ Dysfunction Score [MODS])8 and daily nursing workload measures (Nine Equivalents Nursing Manpower Use Score [NEMS])9; however, this data has not been used to perform risk-adjustment, likely because validated models for this purpose are lacking. The ability of MODS to predict mortality has been reported in small, single-centre studies from Canada, Finland, and other countries.10,11 We used CCIS data from the two medical-surgical ICUs in our hospital to develop and internally validate a prediction model for ICU mortality.12 None of these models have been externally validated.

External validation of a prediction model’s performance is an important and necessary process prior to clinical implementation.13,14,15,16 Access to “big data” is increasing as evident by analysis of registry databases that contain electronic health records for thousands or even millions of patients from multiple practices and hospitals.17 The CCIS is an example of a large e-health database that includes data from different types of ICUs, and thus provides an opportunity to assess both reproducibility (similar case-mix) and transportability (different but related populations) within the same study.18 The objective of this study was to conduct and report a methodologically sound external validation using guidelines and referenced statistical articles from the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) explanation and elaboration document.19


Approval for this study was granted by the Western University Research Ethics Board on 15 February 2017. Requirement for consent was waived.

Study design

We used an independent population-based cohort to perform a validation study on a previously published ICU mortality prediction model.12

Data source

We used data from the Ontario CCIS for this study. The CCIS is a web-based data application that uses a combination of methods to capture data. Demographic data can be auto-populated directly from the hospital electronic admission, discharge, and transfer system, but most of the data are manually entered by clerical and clinical staff as appropriate. Data elements used in this study, a subset of those captured in CCIS, are shown in Electronic Supplementary Material (ESM), eTable 1. All ICUs in Ontario are required to enter data into the CCIS for all admissions.

Data were obtained for all level-3 ICU admissions between July 1 2015 and December 31 2016. Level 3 ICUs are defined as those providing life support and mechanical ventilation for more than 48 hours. Critical Care Services Ontario has organized the ICUs into groups based on ICU subtype (Table 1). The eligibility criteria, conditions, definitions, and measurements in this validation study were identical to those used in the original development study.

Table 1 CCIS level-3 ICU subtype groups and number of critical care units

The minimum effective sample size for external validation has been reported as 100 outcome events.20 The data set included over 13,500 deaths. All ICU subtype groups had well over 100 deaths except burn ICUs, which were excluded from the subgroup analyses.

The validation data set was first subject to administrative cleaning. We excluded admissions to pediatric and labour and delivery level-3 ICUs. Also excluded were records where patient age was reported as < 18 yr or > 115 yr, length of ICU stay was reported as 0 days (entry errors), or where duplicate MODS and/or NEMS entries were reported. For duplicate records, the record with the later time stamp was selected for linkage with the admission and discharge data. Finally, any records with missing predictor data were omitted from the analyses.

Complete case analyses were used to assess model performance. Records with missing data represented approximately 5% of all cases and exclusion of these cases was not considered a threat to the validity of the results.21 The outcome of interest was ICU mortality. Predictor variables, available within the first 24 hr of critical care admission, were defined as follows: 1) age group (18–39, 40–79, ≥ 80 yr); 2) sex (M or F); 3) NEMS group (0–22, 23–29, ≥ 30); 4) MODS group (0, 1–4, 5–8, 9–12, ≥ 13); 5) admission source (operating room/postanesthesia care unit, emergency department, unit/ward, other hospital and other); 6) admitting diagnosis (cardiovascular/cardiac/vascular, respiratory, gastrointestinal, neurologic, trauma, other); and 7) readmission to critical care during the same hospital stay.12

Since we chose to restrict our analyses to variables contained within the CCIS data set, we modified our previously published model 12 by excluding the Charlson Comorbidity Index. eTable 1 (available as ESM) is provided as supplemental digital content and shows the equation for Logit [ICU Mortality].

Statistical analyses

The relatedness of the development and validation data sets was reviewed using two approaches. First, the distribution of context-important patient characteristics, including predictors and outcomes, were compared. Descriptive analyses of these characteristics were performed for the development and validation data sets and for the latter, also stratified by CCIS ICU subtype. Continuous data elements are expressed as mean (standard deviation [SD]) or median [interquartile range (IQR)] as appropriate. Categorical data elements are reported as proportions. To quantify the extent of the relatedness in case-mix between the development and validation samples, a binary logistic regression model (membership model) was created to predict the probability that an individual record belonged to either sample.22 Independent variables were the predictors and outcome from the prediction model. The discriminative ability of the model was quantified using its C-statistic with lower values indicating similarity between the data sets.

Three measures were used to assess the performance of the model in the validation data set: 1) calibration-in-the-large, 2) calibration slope, and 3) discrimination. Calibration-in-the-large represents the level of agreement between observed and predicted mortality. It was calculated as the logistic regression model intercept given that the calibration slope equals 1 (logit(y)=a + logit(ŷ)).22,23 Where calibration-in-the-large was significantly different from 0, intercept recalibration was performed by fitting a new logistic regression model with an intercept only and an offset term for the linear predictor. Calibration slope reflects whether predicted risks are appropriately scaled with respect to each other over the entire range of possible values. It was estimated from the recalibration model equation logit(y)=a + boverall logit(ŷ).22,24 Loess-based calibration plots were created with predicted risk on the x-axis and observed mortality on the y-axis to illustrate the agreement across the range of predicted risks.23 Discrimination refers to the ability of the prediction model to separate individuals that died and those that survived. The concordance statistic was used to evaluate the discriminative value of the prediction model.

For those observations excluded from the analyses because of missing predictors, comparisons with the observations used in the validation were also made. All analyses were performed using SAS 9.4 (SAS Institute Inc., Cary, NC, USA).


After applying the exclusion criteria, 121,201 records were available for external validation (Fig. 1). The demographic and clinical characteristics (predictors) and ICU mortality of the patient population included in the development model and external validation data set are shown in Table 2. The C-statistic for the membership model comparing the development data set to the entire CCIS cohort was 0.764. Values between 0.7 and 0.8 are generally considered to reflect acceptable discrimination25 and in the case of this membership model, represent a data set that is somewhat related to the development data set, but not strongly so where a C-statistic of < 0.7 would be expected. This is confirmed by some key differences illustrated in Table 2. Specifically, the development population was younger, had a different source distribution (less from the operating room and emergency department, more from the ward and referrals from other hospitals), as well as higher levels of organ dysfunction upon admission, daily nurse workload, readmission, and ICU mortality. Admitting diagnosis also differed between the data sets with the development sample having a higher proportion of admissions for respiratory issues and a lesser proportion of cardiovascular-related admissions.

Fig. 1
figure 1

Flow chart of patient records included in the external validation. Administrative cleaning includes the following: n = 1,609 (duplicates), n = 427 (admitted in error), n = 88 (ICU LOS = 0), n = 511 (age < 18 yr), n = 6 (age > 105 yr). ICU = intensive care unit; LOS = length of stay

Table 2 Baseline and clinical characteristics and outcomes of patients in the development and external validation data sets

These same analyses were performed for each ICU subtype group. The discrimination of the membership models indicated varying degrees of relatedness to the development sample. Relatedness to the development sample was found in teaching hospital medical-surgical units (C-statistic = 0.660) and community hospital medical-surgical units with high rates of mechanical ventilation (C-statistic = 0.740) but discordance in community hospital medical-surgical units with low rates of mechanical ventilation (C-statistic = 0.836), cardiac/cardiovascular units (C-statistic = 0.969), and coronary care units (C-statistic = 0.974). eTable 2 (available as ESM) is provided as supplemental digital content and shows the characteristics and outcomes for each individual ICU subtype group compared with those for the entire cohort. The demographic and clinical profile of cases excluded from the analyses because of missing data were similar to those included in the external validation (Table 2), and as such, data were considered to be missing completely at random.

Calibration-in-the-large represents overall calibration of the model. Perfect agreement between observed and predicted values has an intercept value of 0. For all data combined and also for all ICU subtype groups except medical-surgical units in teaching hospitals, the intercept value was less than 0 indicating that the model over-predicted ICU mortality.22 This over-estimation was greatest in cardiac/cardiovascular and coronary care units. In the medical-surgical units in teaching hospitals, the intercept value was greater than 0 showing a slight under-estimation of mortality (Table 3). Given the differences between actual ICU mortality and predicted risk, an intercept recalibration was performed for all models resulting in calibration-in-the-large values that are essentially 0.

Table 3 Predicted risk and model performance statistics for external validation of the entire CCIS cohort and for ICU subtype groups

The calibration plots in Figs 2a and 2b show that some over-prediction remains following intercept recalibration, specifically when the risk of death is higher. The extent of over-prediction varies across ICU subtype groups but represents a small proportion of patients.

Fig. 2
figure 2figure 2

a Loess-based calibration plots for validation of entire CCIS cohort. CCIS = Critical Care Information System. b Loess-based calibration plots for validation of individual ICU subtype groups. ICU = intensive care unit; TH = teaching hospitals; CH = community hospitals

The calibration slope reflects whether the predicted risks are scaled appropriately to each other over the complete range of predicted probabilities and was another measure used to evaluate the model’s predictive performance in the validation samples. Calibration slopes not significantly different from 1 include all CCIS data, as well as community hospital medical-surgical units and cardiac/cardiovascular units. The calibration slope for teaching hospital medical-surgical units were significantly less than 1, showing higher variation in predicted probabilities (Table 3). Specifically, the variation between predicted and observed risks is too low for low-outcome risks and too high for high-outcome risks. The coronary care unit data set has a calibration slope significantly above 1 indicating too little variation in the predicted risks; predicted risks are systemically too high.

Discrimination for all CCIS data and the individual ICU subtype groups ranged from acceptable to very good (Table 3). The validation data sets with the lowest area under the curve (AUC) [IQR] were teaching hospital medical-surgical units (C = 0.781 [0.774 -0.788]) and cardiovascular/cardiac units C = 0.768 [0.747 - 0.789]). The data sets including all CCIS data and all other ICU subtype groups had areas under the curve greater than 0.80.


We used a prospectively collected, population-based cohort to perform external validation on a risk prediction model for ICU mortality. We found that an intercept update was required, which greatly improved the calibration-in-the-large for the entire cohort as well as for all ICU subtype groups. Over-estimation for higher predicted risk groups remains, but this population represents relatively few patients. Since the intention of the model is for performance measurement and not individual patient prognosis, the model fit is acceptable for the entire cohort of ICUs.

The development and application of robust prognostic models are essential for valid performance measurement and many existing prognostic models have a limited life span because of changes in clinical practice and healthcare over time that can alter the risk of mortality for a given clinical situation. Prognostic models require periodic updating. Current prognostic models for mortality were published between 2005 and 2007 including Acute Physiology and Chronic Health Evaluation (APACHE) IV (AUC = 0.88),5 Simplified Acute Physiology Score (AUC = 0.848,)26 and Mortality Probability Admission Model (MPM0)-III (AUC = 0.823).27 The organ dysfunction scores that assess the presence and severity of organ dysfunction include MODS (AUC = 0.695), Sequential Organ Failure Assessment (SOFA) (AUC = 0.776), and Logistic Organ Dysfunction Score (AUC = 0.805).11 The AUC we report here for the entire cohort and for ICU subtype groups compares favourably with these other models.

The development model showed strong agreement between observed and expected mortality as assessed using the Hosmer-Lemeshow goodness-of-fit test. Limitations of this decile-based analysis include the influence of sample size and the arbitrary selection of the risk categories.28,29,30 In this external validation, calibration was assessed using loess-based calibration plots, calibration-in-the-large, and calibration slope.23 Although the results are not directly comparable, the underlying conclusions are that the model has acceptable calibration in both the development and validation data sets, indicating good overall agreement between observed and expected ICU mortality.

Discriminative ability increased slightly in this external validation and the membership model did indicate some case-mix differences. We anticipated that a data set containing over 120,000 patients would include a more diverse case-mix than the developmental model. Differences in case-mix can include the distribution of predictor values, varied participant or setting characteristics, and incidence of the outcome.18 This increase in heterogeneity would enhance discriminative ability in the validation cohort, and has several effects on model performance across different settings and populations.31,32 In fact, case-mix variation can lead to differences in the performance of a prediction model, even when the true predictors’ effects are consistent.31

Benchmarking is an approach to identify and implement best practices.1,33 Indicators selected for benchmarking can be compared over time within a single unit or practice, across units or practices or against a predetermined goal. Many potential indicators will not require risk or case-mix adjustment, while this will be needed for most patient-related outcomes such as mortality and length of stay. We caution against use of  simple rank ordering or comparisons of one unit to another since regression models, such as the one we report, provide an estimated risk based on the average of the entire cohort. While our recalibration has reduced the bias across this cohort, estimates for subgroups or individual ICUs will remain biased. As can be seen in our data, it appears that teaching hospitals perform worse than average, community hospitals with high ventilator usage perform better than average, and cardiac units perform much better than average. Nevertheless, this would be a false conclusion since the differences across subgroups must cancel out across the entire cohort. At most, evaluation of subgroups or individual ICU results should only be compared with the average estimated performance and include confidence intervals.3 Models could be recalibrated for specific ICU subtypes but this involves subjective categorization of units and will not resolve the bias for individual ICUs. One randomized trial used quantiles to identify achievable performance levels for groups of units and reported improved performance in individual units.34 Ultimately, we believe that models such as these should be used to monitor performance over time only within individual ICUs. One such approach incorporates risk-adjusted measures into statistical process control methods.35,36

There are numerous strengths to this study. First, the breadth of the units that submit data to the CCIS allows for testing of both reproducibility (similar ICU subtype groups) and transportability (different ICU subtype groups), and the size of the CCIS data set provided ample statistical power for the required analyses. The TRIPOD framework indicates that a model’s predictive performance should be evaluated in relation to subgroups of interest, such as age or sex, specific settings or population rather than just across all individuals combined, which can mask any deficiencies in the model.19 It is increasingly recognized that the predictive performance of a model tends to vary across settings, populations, and periods,22,31,37,38 which implies there is often heterogeneity in model performance and that multiple external validation studies are needed to fully appreciate the generalizability of a prediction model.22 In this study, we have conducted subgroup analyses for each ICU subtype to evaluate performance in specific ICU patient populations. Another strength is adherence to the TRIPOD guidelines, which include references to appropriate analytic methods and complete reporting of the results.15,22,39,40 Next, both MODS and NEMS are relatively easy to collect, making this prediction tool more apt for risk-adjustment compared with more complex scoring systems. MODS requires only eight routinely collected variables and, in contrast to SOFA, is not dependent on treatment.41 NEMS assesses ICU resource utilization and efficiency that has been validated as a nurse workload measure in large cohorts of ICU patients.42 It is easy to use with minimum inter-observer variability,9,42 but has not been evaluated as a mortality or risk prediction tool.

Limitations of this study include our inability to adjust for chronic health status as these data are not captured in the CCIS. Linkage to other data sets containing comorbidity data such as the Canadian Institute for Health Informatics Discharge Abstract Database could resolve this limitation, but we did not have access to identifiable patient information and such linkage was not possible. Another limitation is that, although ICU mortality is a proximal metric that can be used to evaluate quality of care in the ICU and ultimately improve patient outcomes, ICU survival is not a patient-centred goal. We found a low frequency of patients within the range of severity where mortality is over-predicted; however, this would need to be monitored regularly to ensure that results are interpreted correctly. Also, we could not evaluate the burn ICU subtype group accurately because of the low number of deaths. Finally, although there are no published studies on the accuracy of the CCIS data, we previously reported that inter-observer variability in data collection appears to be randomly distributed.43


Following an intercept update to adjust for the difference in mortality between the development and validation data sets, our ICU mortality prediction model performs well and shows both reproducibility and transportability. Some ICU subtype groups show inferior model fit compared with others, but the over-estimation of mortality occurs primarily in risk groups with low prevalence and thus has a minimal impact on overall calibration. These models could be used to provide risk-adjusted mortality rates to support performance measurement over time within individual ICUs using data that is easy and feasible to collect. Since the model represents an average of all the patients in the cohort, we recommend it should not be used for simple comparisons between ICUs or ICU subtypes.