INTRODUCTION

The United States Veterans Affairs (VA) healthcare system is the nation’s largest integrated healthcare delivery system, with approximately 550,000 acute care hospitalizations annually to 140 acute care hospitals.1 Starting in 2005, the VA began to measure and report risk-adjusted mortality for patients admitted to intensive care units (ICUs) for the purpose of performance assessment and improvement.2,3 Tracking risk-adjusted mortality is helpful for evaluating changes over time, evaluating changes in response to specific policies or performance improvement initiatives, and identifying hospitals with greater-than-predicted mortality for further review.

The VA’s risk-adjustment model includes data on patients’ demographics, chronic health conditions, admitting diagnosis, and physiology within the first 24 h of admission, similar to the Acute Physiology and Chronic Health Evaluation (APACHE4) measure3. The development, validation, and first re-calibration of the ICU mortality model were published previously.2,3 Over the past 15 years, however, the mortality model has been adapted for risk adjustment of all inpatient hospitalizations, updated to incorporate additional variables, and re-calibrated annually to account for temporal changes in diagnosis, coding, medical management, and outcomes. Periodic re-fitting of risk-adjustment models is necessary to prevent model performance from degrading over time.5,6 Consistent with Centers for Medicare and Medicaid Services, the VA mortality models are re-calibrated annually.

Given the expansion and revision of VA mortality models over time since publication of the original VA ICU mortality model, we sought to evaluate the performance of VA’s mortality models in a recent sample of hospitalizations. Specifically, we tested the models’ discrimination, assessed the models’ calibration, and examined the stability of model performance across quarters. While we examined all four mortality models in operational use, we focused on the acute care 30-day mortality model because it is the most comprehensive (includes both ward and ICU patients, and captures both in-hospital and post-discharge mortality) and therefore the most important mortality model for overall performance assessment.

METHODS

Setting

The VA healthcare system is an integrated healthcare delivery system that provides comprehensive healthcare to Veterans. The VA was among the first healthcare delivery systems to have a universal electronic health record, and to measure and report risk-adjusted mortality.1

Mortality Models

As part of routine performance assessment, the VA measures and reports four standardized mortality ratios (SMRs) for each VA hospital on a quarterly basis: (1) acute care 30-day mortality (acute care SMR-30); (2) ICU 30-day mortality (ICU SMR-30); (3) acute care in-hospital mortality (acute care SMR); and (4) ICU in-hospital mortality (ICU SMR). The mortality models are each developed on a rolling 2-year look-back of VA hospitalizations, then applied to the current fiscal year. The inclusion criteria, definitions, and key differences of each SMR model are presented in Appendix 1 and Supplemental Table 1. A summary of key changes to the models since their last description is presented in Supplemental Table 2.

For the acute care models, predicted mortality is estimated using a logistic regression model that includes the following predictors: age, admitting diagnosis category, major surgical procedure category, 29 comorbid conditions, physiologic variables (sodium, BUN, creatinine, glucose, albumin, bilirubin, white blood cell count, and hematocrit), immunosuppressant status, ICU stay during hospitalization, medical or surgical diagnosis-related grouping (DRG), source of admission (e.g., inter-hospital transfer, nursing facility), and marital status. For physiologic variables, the most deranged value within a specified time frame is included in this statistical model. For non-operative patients, this time frame is between 24 h prior to hospital admission and 24 h after hospital admission. For operative patients, this time frame is between 14 days prior to hospital admission and 24 h after hospital admission. Normal values are imputed for missing physiologic variables, as is conventional for risk adjustment.7 The admitting diagnosis category assigns all possible admitting diagnoses to one of 51 mutually exclusive groupings, which were consolidated from the Healthcare Cost and Utilization Project’s Clinical Classification Software categories8 based on clinical similarity and on the observed mortality rate. Similarly, the major surgical procedure category includes 24 mutually exclusive groupings based on major surgical procedures within 24 h of presentation. Comorbid conditions are identified from diagnostic codes during hospitalization, using the methods of Elixhauser et al., adapted for ICD-10 coding.9,10 Immunosuppressant status is defined based on use of immunosuppressive medications in the 90 days prior to hospitalization.11 The ICU models include additional physiologic variables (PaO2, PaCO2, and pH) as well as hospital length of stay prior to ICU admission.

Model Performance

For this study, the SMR models were developed using hospitalizations from fiscal years (FY) 2017–2018, and model performance was assessed using hospitalizations in FY 2019. Thus, the study examines a recent, but pre-pandemic, cohort of hospitalizations. We evaluated model performance using c-statistics to assess discrimination and comparison of observed vs predicted deaths by decile of predicted risk to assess calibration (i.e., the agreement between observed outcomes and predictions).6,12,13,14 The c-statistic is a measure of goodness of fit for binary outcomes of a logistic regression model, and tells the probability that a randomly selected hospitalization that had mortality had a higher predicted risk than a randomly selected hospitalization that did not experience mortality.15,16 Additionally, we report Hosmer-LemeshowGoodness-of-Fitchi-square and Brier scores (to harmonize with a prior study of the VA’s mortality model3), as well as mean and maximum difference in observed versus predicted percent mortality across deciles of risk to summarize the model calibration14. We considered model discrimination to be strong when c-statistic was >0.8, consistent with standard practice.15,16 We are not aware of any generally accepted threshold for grading model calibration,12,13,17 but considered overall and mean calibration errors of <1.0% to reflect good model calibration.

We assessed model performance in the derivation cohort, the validation cohort, and by quarter for the validation cohort. For the ICU models, we also assessed model performance by level of intensive care, as defined by availability of subspecialty services.18 Finally, for the acute care SMR-30 model, we evaluated c-statistics in a series of nested models to understand the incremental impact of administrative and clinical data on model discrimination.

Data management and analysis were completed in SAS Enterprise Guide 8.3 (SAS Institute Inc., Cary, NC). Figures were produced in R. This study was approved by the Ann Arbor VA Institutional Review Board with a waiver of informed consent.

RESULTS

Cohort Characteristics

Among 1,996,645 inpatient stays during fiscal years 2017–2019, there were 1,143,351 acute care hospitalizations meeting criteria for the Acute Care SMR-30. Of 1,996,645 inpatient stays, 673,813 (33.7%) were excluded due to a non-acute care treating specialty (e.g., nursing, psychiatry, and rehabilitation care), 114,068 (5.7%) because they occurred within 30 days of a prior hospitalization, 1,280 because they involved specialized treatments (organ transplantation of left ventricular assist device) (0.1%), 415 patients who died within 4 h of arrival (0.02%), and 20,011 with hospice care during the calendar day of admission or the preceding year (1.0%). Study flow diagrams showing the application of model exclusions for each SMR model are presented in Supplemental Tables 3 and 4, while Supplemental Table 5 shows the number of unique patients in the acute care and ICU SMR 30 models.

Acute care SMR-30 cohort characteristics and outcomes are presented in Table 1. Hospitalizations in the SMR-30 model were median age 68 years (IQR 61–74), 94.4% male, and 70.8% White, with a median of 3 comorbid conditions (IQR 2, 5). The majority of hospitalizations were admitted via the emergency department or directly, while 15.1% were admitted from the operating room, 2.4% were transferred in from another hospital, and 1.9% were admitted from nursing facilities. The most common admission diagnosis categories were musculoskeletal injuries (7.5%), congestive heart failure (5.7%), non-sepsis/non-pneumonia infections (5.6%), neurological diseases (5.2%), and sepsis (4.5%). In-hospital mortality was 1.8%, and 30-day mortality was 4.3%. Patient characteristics were similar between the derivation (FY 2017-2018) and validation (FY 2019) cohorts. For the acute care SMR-30 validation cohort, predicted risk of 30-day mortality was median 1.6% (IQR 0.6%, 4.4%), mean 4.7% (Fig. 1).

Table 1 Patients and Hospitalization Characteristics for Derivation, Validation, and Full Cohort for the Acute Care SMR-30 Model
Figure 1
figure 1

Histogram showing the distribution of predicted risk of 30-day mortality for the Acute Care SMR-30 Validation Cohort. Predicted 30-day mortality for the SMR-30 derivation cohort was a median of 1.5%, mean 4.3%, IQR 0.5%, 4.1%. For the validation cohort, median 1.6%, mean 4.7%, IQR 0.6%, 4.4%. They-axis uses a pseudo-log transformation with a smooth transition to linear scale around 0.

Model Performance

In total, across the 4 SMR models, we assessed model performance for 24 different scenarios in the validation data, as defined by the model of interest, time-period of interest, and (for the ICU models only) level of intensive care available (Table 2). Overall, the c-statistic ranged from 0.848 to 0.918 across the 24 scenarios, indicating that model performance was consistently strong. When examining nested models for the SMR-30 model, c-statistic was 0.840 in a basic administrative model, 0.853 in an enhanced administrative model, and 0.870 in the full model—showing the added benefit of including physiological data (Supplemental Table 6).

Table 2 Performance of the SMR Models in Derivation and Validation Cohorts

The calibration plot (Fig. 2) and Table 3 show that the acute care SMR-30 model was well-calibrated in the validation cohort. There were 16,036 deaths (4.29% mortality) in the SMR-30 validation cohort versus 17,458 predicted deaths (4.67%), reflecting 0.38% over-prediction. Across deciles of predicted risk, the absolute difference in observed versus predicted percent mortality was a mean of 0.38%, with a maximum error of 1.81% seen in the highest-risk decile. Calibration plots and tables for the acute care SMR, ICU SMR, and ICU SMR-30 models are presented in Supplemental Figures 1–3 and Supplemental Tables 7–9. Similar to the acute care SMR-30 model, observed versus predicted mortality was within 1.0% for the acute care SMR, ICU SMR, and ICU SMR-30 validation cohorts. Additionally, mean error across risk deciles was <1.0%, and error greater than 1.0% was seen in only the highest risk decile of each model.

Figure 2
figure 2

Observed vs predicted mortality in the SMR-30 Validation Cohort using 10 equally sized bins defined by decile of predicted risk. This figure depicts the number of predicted and observed deaths in the validation cohort, stratified by decile of predicted risk for mortality. The number of hospitalizations per decile, as well observed and predicted number and proportion of deaths by decile are presented in Table 3.

Table 3 Observed vs Predicted Mortality in the SMR-30 Validation Cohort Using 10 Equally Size Bins Defined by Decile of Predicted Risk

DISCUSSION

The VA was among the first healthcare systems to measure and report risk-adjusted ICU mortality. And, over the past 15 years, the VA’s mortality model has been updated, re-calibrated annually, and adapted for risk-adjustment of all VA acute care hospitalizations. In this study, we show that the VA’s mortality models (acute care SMR-30, acute care SMR, ICU SMR-30, and ICU SMR) can strongly discriminate in-hospital and 30-day mortality. Furthermore, the models are well-calibrated, with observed versus predicted mortality within 1% for all but the highest risk decile. Overall, the performance of each of the four VA mortality models is similar to the initial VA ICU mortality model,11,19 similar to other physiology-based mortality models such as APACHE, 4,7,20,21 and superior to risk models using administrative data only22,23. Likewise, the relatively lower calibration for the top risk-decile is consistent with other physiologic risk-adjustment models.7

A second major finding of our study is that the rates of inpatient and 30-day mortality for eligible acute care hospitalizations are relatively low (1.8% and 4.3%), which limits the ability to differentiate hospitals statistically based on mortality.24 Nonetheless, mortality monitoring is a critical component of quality measurement given the importance of identifying any hospitals with statistically greater-than-predicted mortality, as well as identifying numeric differences that may trigger further review to identify and remediate any problems before statistically significant differences in mortality arise. The strong performance of the VA mortality models lends credibility to their use in hospital evaluation and their ability to account for differences in patient case-mix across hospitals. However, mortality does not equate to quality. Greater-than-predicted mortality may occur for a number of reasons, not all of which reflect poor care. Thus, these mortality models serve as a warning tool to trigger deeper review, but are not a stand-alone marker of hospital quality. The results must be contextualized and evaluated alongside other metrics.

Several aspects of the modeling approach warrant further discussion. First, hospitalizations were assigned to one of 51 mutually exclusive admission diagnosis categories based on their admitting diagnosis, similar to the approach taken in the Kaiser Permanente Northern California’s risk-adjustment model7. By contrast, other models have used hierarchical approaches to classifying admission diagnoses, which are not based on diagnostic codes and must therefore be incorporated into workflow. For example, the UK’s Intensive Care National Audit and Research Centre’s coding method classifies ICU admissions by type (surgical, medical), system (e.g., respiratory), site (e.g., lungs), process (e.g., infection), and condition (e.g., bacterial pneumonia)25. While there are 741 unique conditions in this approach, five conditions accounted for 19.4% of all admissions,25 and the majority of unique conditions were ultimately excluded from the model to due imprecision in estimating the association between the condition and mortality (in which case hospitalizations are classified by the body system).21 The VA admission diagnosis categories each include one or more clinical classification software8(CCS) diagnosis categories. The merging of CCS categories into admission diagnosis categories was informed by clinical rationale, as well as by the observed mortality rates for CCS categories. For this reason, for example, upper and lower extremity fractures were merged together, while hip fracture was kept as a separate diagnosis category due to its higher associated mortality. The mapping of individual admission diagnoses to admission diagnosis categories via the CCS categories facilitates assignment of any new ICD-10-CM codes to an admission diagnosis category since the Agency for Healthcare Research and Quality updates the clinical classification software on an ongoing basis. While some clinicians may prefer more granular admission diagnosis groupings, each group must have a sufficient number of observations to estimate the association with mortality—limiting the number of discrete diagnosis groupings that can be used in practice. Instead, the physiologic variables serve to further differentiate the hospitalizations within the same diagnosis category, and consistently provide far more prognostic information than the diagnosis category.7,11

Second, hospitalizations were excluded from the VA model if the patient had a hospice encounter in the year preceding or on the calendar day of admission. Only 1.5% and 0.1% of otherwise eligible hospitalizations were excluded, respectively, due to hospice encounters prior to or on the day of admission. Furthermore, in exploratory analyses without this exclusion, the mortality models perform similarly since the model consistently identifies patients referred to hospice as having a high risk for mortality. Some clinicians may argue for expanding the hospice exclusion to also exclude patients with who transition to hospice at later points in hospitalization. However, a majority of patients who die during inpatient hospitalization are transitioned to comfort-only measures or have treatment limitations initiated prior to death, such that broad exclusions of patients with hospice care could substantially limit the ability to differentiate mortality outcomes across hospitals. Initiation of hospice care during or before the calendar day of admission was felt to be the fairest approach. However, the best approach to incorporating treatment limitations into hospital performance assessment remains an area of ongoing study, and best practices are yet to be defined.26 Through the VA’s Life-Sustaining Treatment Decisions Initiative, there is a national effort to elicit, honor, and document Veteran’s values, goals, and healthcare treatment preferences. The initiative’s harmonized approach to documenting treatment preferences across VA hospitals may allow for future incorporation of treatment preferences documented at hospital admission to be incorporated into performance measurement.

Third, physiological variables are currently incorporated into the VA’s mortality models as categorical variables, which allow for ready interpretation of the association between physiologic derangements and the risk of mortality. By contrast, some other models (and VA’s initial ICU mortality model) use cubic splines3,7,11—which allow for more flexible parameterization of the physiologic variables, but come at the cost of decreased transparency, since the model output is not readily interpretable. The opaqueness of regression models has been cited as a key drawback of regression-based performance assessment, which may reduce credibility and motivation to act on the assessment results.27 Thus, given the trade-offs between statistical precision and interpretability, there is no “best approach” to the incorporation of physiologic variables. The current VA mortality models using categorical physiologic variables perform similarly to the prior VA ICU mortality model using cubic splines, indicating that the loss of performance is minimal, and therefore, the added statistical precision may not be worth the added complexity of interpretation.

There are some limitations to acknowledge. First, there are many drawbacks to the use of risk-adjusted mortality for measuring hospital quality, which are discussed in detail elsewhere, including low power, inability to differentiate preventable versus unpreventable deaths, and the imperfect correlation between process and outcome measures28,29. Despite these limitations, monitoring risk-adjusted mortality is an important component of quality improvement, as discussed above. Secondly, the VA’s acute care mortality models incorporate 8 physiologic variables (sodium, BUN, creatinine, glucose, albumin, bilirubin, white blood cell count, and hematocrit), with an additional three values (PaO2, PaCO2, and pH) included in the ICU models. These physiologic variables are commonly included in other physiologic risk-adjustment models and have high clinical face value, but are not fully comprehensive. Additional physiologic measurements such as vital signs (heart rate, blood pressure, respiratory rate, pulse oximetry), mental status, and blood lactate measurement may provide additional prognostic information30. Vital signs and mental status cannot be readily incorporated into the VA’s mortality model at present because they are recorded outside the electronic health record (e.g., in ICU-specific programs) in many units, leading to systematic missingness that could bias risk adjustment. Lactate measurements, however, are available in the electronic health record, and are currently being considered for incorporation into VA mortality models. Finally, the VA patient population has unique demographics, risk factors, and comorbidity profile, so this model may not generalize to other settings. Indeed, model performance often degrades when applying models to new settings, underscoring the need for periodic model evaluation and recalibration and the benefit of developing context-specific models rather than simply applying “off-the-shelf” risk tools5,21.

CONCLUSIONS

We have shown that the VA’s mortality models, which incorporate patient physiology and are recalibrated annually using hospitalizations from the prior 2 years, are highly predictive and have good calibration both overall and across risk deciles. The strong model performance underscores the benefit of physiologic data and the development of models in the population and setting in which they will be used.