Can a 5-to-90-day Mortality Predictor Perform Consistently Across Time and Equitably Across Populations?

Advance care planning (ACP) facilitates end-of-life care, yet many die without it. Timely and accurate mortality prediction may encourage ACP. However, performance of predictors typically differs among sub-populations (e.g., rural vs. urban) and worsens over time (“concept drift”). Therefore, we assessed performance equity and consistency for a novel 5-to-90-day mortality predictor across various demographies, geographies, and timeframes (n = 76,812 total encounters). Predictions were made for the first day of included adult inpatient admissions on a retrospective dataset. AUC-PR remained at 29% both pre-COVID (throughout 2018) and during COVID (8 months in 2021). Pre-COVID-19 recall and precision were 58% and 25% respectively at the 12.5% certainty cutoff, and 12% and 44% at the 37.5% cutoff. During COVID-19, recall and precision were 59% and 26% at the 12.5% cutoff, and 11% and 43% at the 37.5% cutoff. Pre-COVID, compared to the overall population, recall was lower at the 12.5% cutoff in the White, non-Hispanic subgroup and at both cutoffs in the rural subgroup. During COVID-19, precision at the 12.5% cutoff was lower than that of the overall population for the non-White and non-White female subgroups. No other significant differences were seen between subgroups and the corresponding overall population. Overall performance during COVID was unchanged from pre-pandemic performance. Although some comparisons (especially precision at the 37.5% cutoff) were underpowered, precision at the 12.5% cutoff was equitable across most demographies, regardless of the pandemic. Mortality prediction to prioritize ACP conversations can be provided consistently and equitably across many studied timeframes and sub-populations. Supplementary Information The online version contains supplementary material available at 10.1007/s10916-023-01962-z.


Introduction
Advance care planning (ACP, which may also refer to the resulting advance care plan) is a process to discuss and document patients' preferences for end-of-life care [1].Patients and clinicians agree that ACP enables each patient to receive their desired life-extending care while avoiding the pain, discomfort, social separation, and cost of end-of-life procedures that the patient does not want [2,3].Demonstrated ACP benefits include respecting end-of-life wishes, decreasing the burden on loved ones, stress reduction, improved patient satisfaction, and fewer in-hospital deaths [3].
Although experts agree on the importance of ACPs, clinicians cite time constraints and poor communication with other providers as barriers to having end-of-life discussions [3,4].Reduced access to healthcare in mixed-rurality populations may make ACP even more unlikely [5].Due to these barriers, many patients do not have documented preferences 67 Page 2 of 10 at the end-of-life and therefore do not achieve what has been termed an "ideal death" [6][7][8].
Some algorithms predict mortality too early to create urgency or too late for meaningful ACP discussion.For example, the Charlson comorbidity index predicts mortality within the next ten years and may not create a sense of urgency [9], while the APACHE II and IV scores predict mortality risk for ICU patients during the current inpatient stay [10] when the ability to have meaningful discussion may be compromised (e.g., due to obtundation or mechanical ventilation) [11,12].
Accordingly, NYU Langone Health developed an algorithm to predict mortality within 60 days after the start of an inpatient admission using data from their three medical centers in New York City.Their aim was to support identification of palliative care candidates.Their model utilized 9614 features and achieved 0.28 area under the precisionrecall curve (AUC-PR) [13].
We sought a model to predict post-inpatient mortality to meet a different need -to help prioritize and encourage timely ACP conversations during an inpatient stay.Although our system aims for ACPs with every patient, time constraints and other factors can make this infeasible.Our system serves a mixed-rurality population, and rurality constraints (e.g., gaps in palliative care availability and longer travel distances for care) may further reduce ACP feasibility [7].Predicting mortality using clinician gestalt alone may have limited accuracy, but combining gestalt with a predictive model may be synergistic [14].Therefore, to help prioritize ACPs when resources are limited, and to encourage clinicians to have ACPs in those more likely to benefit, we developed a model to predict mortality occurring 5-to-90 days after the start of an inpatient admission.For more information about the model, see Supplement.
The model's intended use is to predict mortality soon after the length of an average inpatient stay.Therefore, the 5-to-90 day window was chosen to: 1) begin after the average 4-day length of an inpatient stay [15], 2) allow at least 4 days for an ACP if the inpatient stay is longer than average, and 3) create enough urgency to stimulate the ACP.Since much of the data feeding the model may come from outpatient care prior to the admission, and most of the prediction window covers a period when most patients will have been discharged after the admission, the effects of home geography on mortality and its prediction, access to care, and likelihood of an ACP are highly relevant.Initial efforts to build a predictor inspired by many of the Langone model's strongest reported features did not lead to adequate performance, so a new model had to be created for our mixed rurality population.The model appears to be novel because it was trained on a mixed-rurality population, utilizes a 5-to-90-day prediction window, and requires only 13 input features (easing implementation and the ability to explain predictions -see Table 1).
Algorithms can experience performance degradation over time due to "concept drift," [16] and may perform differently across demographic groups [17].This can lead to mistrust of the model and loss of its benefits, while varying performance across demographic groups can lead to healthcare prior to the encounter to 1 month prior to the encounter to the period of 1 month prior to the encounter to the time of the encounter • Range of BNP from the time of the encounter to one month prior • Change in the count of abnormal labs per day comparing the period of 12 months prior to the encounter to 1 month prior to the encounter to the period of 1 month prior to the encounter to the time of the encounter • Count per day of outpatient visits per day from 12 months to 1 month prior to the encounter • Average red blood cell count (RBC) from the time of the encounter to one month prior • Range of total bilirubin from the time of the encounter to one month prior • Count per day of inpatient visits per day from 12 months to 1 month prior to the encounter • Count per day of emergency department (ED) visits per day from inequities [18].Therefore, this study assesses whether the model retains predictive performance over time (especially during a global pandemic) and performs equitably across patient subgroups.

Objective
We sought to retrospectively assess the model's performance over different timeframes and demographic subgroups to assess and compare its consistency and equity of performance in those contexts.

Declarations
This study was approved with exemption determination by the University of Illinois College of Medicine at Peoria Institutional Review Board.

Model assessment
We assessed the model on datasets retrospectively extracted from the health system's enterprise data warehouse (EDW), which contains data from a variety of sources, including the health system's electronic health record and another source [19] of death records.The pre-COVID dataset included visits throughout 2018 and the during-COVID dataset included visits during 8 months of 2021.Datasets contained one row per inpatient visit during the selected timeframe, including visits for patients >= 18 years of age at the time of admission, and whose resuscitation status at the time the model was assessed (a proxy for status on admission) was either "Full Code" or null.Since multiple health systems service the geography and different patients have different data elements available, we also required at least one lab test available in the EDW in the 31-365 days prior to the visit for its inclusion (to ensure at least minimal data available on which to predict).Since the model automatically adjusts for, and makes a "best effort" prediction in the face of missing data (described in the Supplement), predictions were made on every included patient.No visit used to originally develop or assess the model was used in this analysis.Although the model uses significantly engineered input features, all features are generated from a single query against the database.
Model performance was assessed by populating datasets with the input features and target variable (5-to-90-day mortality), generating a prediction using the features, and assessing performance in different timeframes, for different patient subsets, and at different certainty cutoffs.Boolean predictors produce a certainty value between 0 and 100%.Implementation teams select a certainty cutoff value to divide "yes" from "no" predictions, seeking the best tradeoff between false positives and false negatives given the intended use.To assess performance, we calculated precision (positive predictive value) and recall (sensitivity) at certainty cutoffs of 12.5% (for greater recall) and 37.5% (for greater precision), area under the receiver-operator characteristic curve (AUC-ROC), and AUC-PR.Those cutoffs were chosen by clinicians as having appropriate false positive vs. false negative tradeoffs for our intended use (based on the model's prior performance on the development test set).All datasets ended at least 6 months prior to analysis to ensure at least 90 days had passed after the visit to populate the target variable plus another 90 days to account for death reporting delays.
Performance was assessed on various demographic subgroups.Since White non-Hispanic patients represent a majority of the studied population, other race/ethnicity subgroups were combined to reduce the likelihood of overly small subgroups.Socioeconomic disadvantage was estimated using the Area Deprivation Index (ADI) [20].A within-state ADI decile was assigned using each patient's recorded home zip code.Since multiple ADI values could be associated with a single 5-digit zip code, when the ADI was mapped using a 5-digit zip code, the average of all ADI values for each 5-digit zip code was used.To reduce the likelihood of overly small subgroups, patients were grouped into ADI deciles of < = 5 and > 5. Patients were excluded from those subgroups if an ADI could not be assigned (e.g., no matching zip code).Performance by level of rurality was assessed using Rural-Urban Continuum Codes (RUCC) [21], mapped using the patient's home zip code and applying the suggested categorizations of codes 1-3 as metropolitan ("metro") and codes 4-9 as non-metropolitan ("non-metro").Patients were excluded from those subgroups if an RUCC code could not be assigned.See Fig. 1 for a graphical representation of inclusion/exclusion among groups.

Statistical methods
Statistical comparisons were performed using R (version 4.2.0).Precision and recall were compared between the total population and the population stratified by demographic variables using two proportion z-tests with unequal sample sizes with a two-sided alternative hypothesis at 5% significance (alpha = 0.05).A Bonferroni correction for 24 tests for the pre-COVID dataset and 24 tests for the during-COVID dataset (the numbers of population and subset pairings) was used to adjust p-values for multiple comparisons within each performance metric (precision and recall).Post-hoc power analysis was done to determine the sample size required to detect a small Cohen's h effect size (0.2) [22] for a twoproportion z-test with unequal sample sizes with a power of 0.80.Correlation coefficients were calculated using Pearson r correlations.

Results
The datasets included 76,812 distinct inpatient visits, 47,750 prior to the COVID-19 pandemic and 29,062 during the pandemic.
AUC-ROC and AUC-PR for the pre-COVID dataset were 82% and 29% respectively, and 81% and 29% for the during-COVID dataset.No significant differences were found in precision or recall at either cutoff when comparing predictor performance on the full pre-COVID and during-COVID datasets (Table 2).
Model performance on each demographic subset of the pre-COVID dataset was compared to its overall performance on that dataset (Table 3).The only significant differences in precision or recall between a subgroup and the overall population were lower recall in the White non-Hispanic population at the 12.5% cutoff and lower recall in the nonmetro population at both cutoffs.While a majority of the comparisons were adequately powered, a substantial minority were underpowered.
For the during-COVID dataset (Table 4), compared to the overall population, the only significant differences among subgroups were lower precision in the Other Race/Ethnicity and the Other Race/Ethnicity female-only subgroups, but again, a substantial minority of comparisons (including all precision comparisons at the 37.5% cutoff) were underpowered.
AUC-PR was also calculated for the subgroups (Fig. 2).

Discussion
ACP informs end-of-life care to respect patient preferences, ensure quality of life, and avoid costly, unnecessary, and unwanted interventions [2,24].Mortality prediction may help spur ACP conversations.Timely predictions may help strike a balance between sufficient clinical urgency and an adequate lead time to allow for these often time-consuming discussions [4,25].These predictions may be especially useful in mixed-rurality populations due to relatively reduced access to healthcare compared to urban populations.This work was inspired by studies out of NYU Langone demonstrating the performance and impact of their 60-day mortality prediction model, originally intended to encourage appropriate patient referrals to supportive and palliative care   [13].Their model's performance, with an AUC-PR of 28%, was also sufficient to achieve good rates of physician agreement with the alerts and greater use of ACPs [14].Therefore, we sought similar performance for our model in our mixed-rurality population and to maintain that performance over time despite changing conditions.COVID-19 created significant systemic change in healthcare.Systemic change often causes performance degradation in machine learned models [16].Our predictor demonstrated resistance to this concept drift, achieving an AUC-PR of 29% on both pre-COVID and during-COVID datasets.NYU Langone selected a certainty cutoff providing 75% precision to identify likely-appropriate referrals to supportive and palliative care.The tradeoff for high precision was a recall of just 4.6% [13].Since our intended use was solely to encourage ACP discussions, we evaluated two cutoffs designed to provide higher recall despite reduced precision.On the full pre-COVID dataset at a 12.5% certainty cutoff, our model achieved 58% recall and 25% precision; at a 37.5% cutoff the model achieved 12% recall and 44% precision.Model performance on the full during-COVID dataset did not significantly differ from that of the full pre-COVID dataset for any of those measures, demonstrating resistance to concept drift and performance degradation.
Previous work found racial differences in the relationship between physiologic and socioeconomic parameters and mortality prediction [26].Many recommend accounting for potentially differing model performance among demographic groups [27][28][29].The COVID-19 pandemic has disrupted healthcare, particularly affecting patients with low socioeconomic status [30,31].The timing and effectiveness of ACPs can be affected by socioeconomic circumstances, race, and geographic location [32,33].Given these considerations, we assessed model performance in different subgroups including rurality, level of socioeconomic disadvantage, gender, ethnicity, and race.Significant performance differences were not seen for most comparisons, with notable exceptions and caveats.Recall was significantly lower than that of the overall pre-COVID population for White non-Hispanic patients and patients from non-metro areas.The reason for this is uncertain, but as discussed below, equity in precision may be more important than equity in recall for this use.Also, as the largest subpopulation, small relative performance differences for White non-Hispanic patients will more easily achieve statistical significance.During COVID, the Other Race/Ethnicity subgroup and its female-only subset had lower precision than the overall population (likely affected by the low 5-to-90-day mortality prevalence).
Conclusions cannot be drawn and further research is warranted for a substantial minority of comparisons that were neither significantly different nor adequately powered.However, for the majority of comparisons, model performance was comparable to that of the overall population.
As expected, precision tended to be lower in subgroups having a lower 5-to-90-day mortality prevalence (Fig. 3).
In the two instances for which precision was statistically significantly lower than the overall group, 5-to-90-day mortality prevalence was among the lowest of any subgroup.Since most precision comparisons were underpowered at the 37.5% cutoff, the 0.64 prevalence-to-precision correlation at that cutoff may be underestimated.This analysis shows that differences among subgroups in predicted risk at a particular cutoff are associated with actual differences in risk.
For subgroups having significant differences in model performance, the cutoffs for those subgroups could be adjusted to equalize performance.However, changing the cutoff typically improves either precision or recall while worsening the other, so one must select a metric to equalize.In our scenario, selecting cutoffs that equalize precision across subgroups would increase the likelihood that all who receive an alert will have a similar risk of near-term death.However, this means that subgroups with a lower prevalence of 5-to-90-day mortality will be less likely to receive an alert and therefore may less likely have an ACP.Instead, cutoffs could be selected to equalize recall across subgroups so that an equal fraction of patients who actually suffer a near-term death receive an alert.However, subgroups with a lower prevalence of 5-to-90-day mortality will be more likely to get an alert when they have a lower risk of death.This may lead to alert fatigue and/or mistrust of the predictor [18], and the magnitude of variation in cutoffs among demographic groups that would lead to predictor distrust in this context is not known.In addition, if clinician capacity for ACPs is limited, patients with lower 5-to-90-day mortality risk may get ACPs at the expense of those with greater urgency and need.Cutoffs could be selected to equalize the frequency of positive alerts across subgroups to equalize the predictor's impact on ACPs across subgroups.As with equalizing on recall, however, this outcome may be lost if alerts on lower risk patients lead to alert fatigue and/or mistrust of the predictor.Also, those in greatest need of an ACP may be less likely to get one if clinician bandwidth to have ACPs is constrained.Other approaches may be taken, but all involve tradeoffs.Existing literature suggests that equalizing the performance of a Boolean predictor among different subgroups is use-case dependent [17,18].For our use case, we suspect that equalizing precision across subgroups may best serve the clinical need by reducing the risk of alert fatigue and mistrust and prioritizing alerts to those with the greatest predicted need.However, since only a few statistically significant performance differences were seen among subgroups, and the statistical significance of those differences was inconsistent across the studied time periods, it may be wisest not to draw firm conclusions about whether or how to adjust cutoffs until the pandemic further stabilizes and the study can be repeated.
Our use of ADI to assess predictive model equity across levels of economic disadvantage along with the assessment of equity across different levels of rurality may be unique.A PubMed search on "ADI prediction equity" or "area deprivation index prediction equity" [34,35] returned only one relevant result looking at the equity of a prediction model for various levels of ADI, and that study did not assess equity across levels of rurality [36].

Limitations
Although assessments were designed to avoid use of data that will not be available at the time of prediction, complete avoidance cannot be guaranteed in this retrospective study.Other confounders related to the study's retrospective nature may have affected results.This work was performed at one multi-hospital health system serving a predominantly White and Midwestern population, potentially limiting generalizability.Some demographic data may be inaccurate, affecting results.The ADI may not accurately represent the patient's socioeconomic status, and our use of an average ADI for five-digit zip codes may not represent the patient's census block ADI.Some demographies were aggregated to avoid small group sizes, and the predictor may perform differently across the aggregated demographies.Use of current code status as a proxy for status on admission may have affected results, but we believe patients are more likely to change from null or full code status to something else than the reverse.Our study was limited to model performance analysis, not its impact on clinical care.These limitations represent fruitful areas of future research.

Conclusion
The predictor resisted concept drift and performance degradation from before to during the pandemic.Using precision for performance equitability assessment, although some precision comparisons (especially at the 37.5% cutoff) were underpowered and warrant further study, precision at the 12.5% cutoff was equitable across most demographies, regardless of the pandemic.
For time-constrained clinicians unable to have ACP discussions with every inpatient, this model may consistently and equitably help prioritize patients likely to benefit in the near-term from these crucial conversations.

Fig. 2 Fig. 3
Fig. 2 Area under the curve precision recall for subgroups in the pre-COVID and during-COVID periods

Table 1
Available and selected features included in the model

Table 2
Predictor validation pre-and during-COVID at selected cutoffs

Table 3
Predictor performance for subgroups pre-COVID Bold represent p-values < 0.05, therefore values that are statistically significantly different from the base comparison of "All" ADI Area Deprivation Index -higher ADI values suggest greater levels of disadvantage *asterisked items had Power < 80%; A Bonferroni correction for multiple comparisons was applied to the p-values; Prevalence is the fraction of patients in that group that died 5-90 days after the day of admission

Table 4
Predictor performance for subgroups during-COVID Bold represent p-values < 0.05, therefore values that are statistically significantly different from the base comparison of "All" ADI Area Deprivation Index -higher ADI values suggest greater levels of disadvantage *asterisked items had Power < 80%; A Bonferroni correction for multiple comparisons was applied to the p-values; Prevalence is the fraction of patients in that group that died 5-90 days after the day of admission