Development and Validation of a Machine Learning Algorithm Using Clinical Pages to Predict Imminent Clinical Deterioration

Background Early detection of clinical deterioration among hospitalized patients is a clinical priority for patient safety and quality of care. Current automated approaches for identifying these patients perform poorly at identifying imminent events. Objective Develop a machine learning algorithm using pager messages sent between clinical team members to predict imminent clinical deterioration. Design We conducted a large observational study using long short-term memory machine learning models on the content and frequency of clinical pages. Participants We included all hospitalizations between January 1, 2018 and December 31, 2020 at Vanderbilt University Medical Center that included at least one page message to physicians. Exclusion criteria included patients receiving palliative care, hospitalizations with a planned intensive care stay, and hospitalizations in the top 2% longest length of stay. Main Measures Model classification performance to identify in-hospital cardiac arrest, transfer to intensive care, or Rapid Response activation in the next 3-, 6-, and 12-hours. We compared model performance against three common early warning scores: Modified Early Warning Score, National Early Warning Score, and the Epic Deterioration Index. Key Results There were 87,783 patients (mean [SD] age 54.0 [18.8] years; 45,835 [52.2%] women) who experienced 136,778 hospitalizations. 6214 hospitalized patients experienced a deterioration event. The machine learning model accurately identified 62% of deterioration events within 3-hours prior to the event and 47% of events within 12-hours. Across each time horizon, the model surpassed performance of the best early warning score including area under the receiver operating characteristic curve at 6-hours (0.856 vs. 0.781), sensitivity at 6-hours (0.590 vs. 0.505), specificity at 6-hours (0.900 vs. 0.878), and F-score at 6-hours (0.291 vs. 0.220). Conclusions Machine learning applied to the content and frequency of clinical pages improves prediction of imminent deterioration. Using clinical pages to monitor patient acuity supports improved detection of imminent deterioration without requiring changes to clinical workflow or nursing documentation. Supplementary Information: The online version contains supplementary material available at 10.1007/s11606-023-08349-3.


INTRODUCTION
3][4][5][6] Many organizations have adopted processes, such as Rapid Response Systems, to identify and intervene on patients likely to experience deterioration. 7,8 [10][11][12][13] Detecting patients at risk of clinical deterioration commonly relies on data contained within the electronic health record (EHR) to monitor physiological features.2][23] Incorporating features of clinical concern with structured EHR data can improve EWS performance. 21,24 any rapid response systems incorporate intuition as a calling criteria for activation, which provides clinicians opportunity to request assistance at an early stage.However, barriers exist to calling on rapid response support, including lack of confidence and feelings of uncertainty often leads to delayed rapid response calls or escalation in care. 25ew EWS incorporate features that allow experts to include subjective assessments.Measures of worry are not directly captured in the EHR and mentions of concerns are often only documented in free-text notes or comments.Many healthcare institutions use pager messages, or brief unidirectional text-based messages from a healthcare worker to an individual's pager, as an approach to indicate clinical needs and concerns.Healthcare workers communicate clinical concerns through electronic messages. 26Analysis of the content and frequency of pager messages to predict clinical outcomes represents a rich source of detail about a patient's condition.We examined the efficacy of machine learning on pager messages sent by nurses to physicians to detect imminent clinical deterioration events in hospitalized patients.

METHODS
We conducted this study at Vanderbilt University Medical Center (VUMC).VUMC is a large academic medical center located in middle Tennessee and provides referral care across the southeastern United States.VUMC includes an 864-bed adult hospital and sees nearly 2 million annual ambulatory visits.Clinicians at VUMC used an Epic EHR for all clinical functions.At VUMC, clinical pages are a primary mode of communication between healthcare workers.Most commonly, clinicians send pages about patients through the EHR by selecting the integrated care team paging activity.Clinicians can also send pages through a personnel and schedule management mobile phone application that is external to the EHR.Paging through the mobile application is used for administrative tasks, including to coordinate personnel and general bed management.Epic Secure Chat is not currently implemented at VUMC.This study was approved and granted a waiver of consent by the Vanderbilt University Institutional Review Board.This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline.

Study Design and Population
In this retrospective study, we predicted clinical deterioration events among inpatients receiving care in hospital wards outside ICUs at VUMC.Our study included all patients who were at least 18 years old at the time of admission, and admitted between January 1, 2018 and December 31, 2020.We excluded patients receiving hospice or palliative care since these patients have different treatment plans and care processes.We labeled each encounter for the first of three deterioration events: Rapid Response activation, unplanned transfer to intensive care, and in-hospital cardiac arrest.The process for rapid response team (RRT) activation requires a hospital worker to call the VUMC emergency medical service (EMS) activation team in response to early warning criteria.EMS activation then calls a dedicated rapid response team to assess the patient and intervene as necessary.Unplanned transfers to intensive care included any transfer from a hospital ward to an ICU that did not include an intermediate surgery.In-hospital cardiac arrest was defined as any cardiac arrest on a hospital ward.All deterioration events were validated through retrospective chart review as part of ongoing rapid response quality improvement by an expert clinician (KS) on the VUMC RRT.Hospitalizations with a planned ICU stay or that did not include at least one sent page were excluded from our analysis.We also excluded encounters that were in the top 2% longest length of stay and did not experience a deterioration event as prior research has shown that excessively long hospitalizations are most caused by non-clinical factors. 27

Data Collection and Preprocessing
We collected data on all pages sent during our study.Page data included a unique identifier, page timestamp, patient name, medical record number (MRN), and message text.To ensure pages were sent about hospitalized patients, we matched pages to patient encounters by MRN and timestamp.If an MRN was not available, we mapped the page to an encounter using last name, timestamp, and room.33% of pages could not be mapped and were excluded from our study.We manually reviewed a random subset of 250 unmatched pages and found that these discussed operating room availability, bed management, and personnel management.None of the pages included details about patient-specific care.

Feature Selection
Model features included word embeddings, or numerical representations of the text from each page.We generated word embeddings using a clinical Bidirectional Encoder Representations from Transformers (BERT) model, which represents the content of clinical text. 28,29 linical BERT has been shown to develop meaningful representations of clinical text, including messages between healthcare team members. 26,29 o extract word embeddings from Clinical BERT, we first fine-tuned the model using the corpus of pages to ensures that the model accurately captures corpusspecific features.We used the fine-tuned model to process each page and extract word embedding from the last layer.We implemented Clinical BERT using the HuggingFace Transformers library. 30e combined page-level features into an encounter-level feature set for input into our models (Fig. 1).To create encounter-level features, we sequentially combined pagelevel features in ascending order.Encounter-level features contained only clinical pages sent during an encounter.We normalized timestamps into a single numeric value to represent the number of elapsed hours between hospital admission and time the page was sent.We created a washout period by truncating encounter-level features at a 3-, 6-, or 12-hour time horizon before the first deterioration event.All pages sent during an encounter, before the respective time horizon, were included in the feature set.We maintained all pages for hospitalizations that did not contain a deterioration event.

Model Development and Evaluation
We trained two-layer long short-term memory (LSTM) machine learning models to predict clinical deterioration.Many machine learning models look at data at a single point in time.In contrast, LSTM models learn features and patterns from sequential data and are commonly used to make predictions over data in a timeseries, such as clinical pages during a hospital encounter. 31The feature embedding and LSTM model pipeline are presented in Fig. 1.We split the encounter-level data by year into training (70%) and testing (30%) datasets.We randomly split 20% of the training dataset to use during hyperparameter tuning and to enable early stopping during final model training to avoid overfitting.We tuned hyperparameters using random search with preset hyperparameter ranges (Appendix A).The testing dataset was held out from parameter tuning and used only to measure final results.We calculated validation loss after each iteration of hyperparameter tuning and model training to enable early stopping when validation loss plateaued or increased for three subsequent iterations.Following hyperparameter optimization, we developed models using the entire training dataset.We implemented our machine learning models and evaluated the performance using the Tensorflow (version 2.6.2) 32 and scikit-learn (version 1.1.3) 33packages in Python 3.6.9.

Statistical Analysis
We compared cohorts of patients who experienced deterioration events versus those who did not experience deterioration events using Welch t-tests for numerical features and Chi-square tests for categorical features.We considered a p value less than 0.05 to be statistically significant.Statistical analyses were performed using R version 4.1.2.
We measured classification model performance to predict clinical deterioration 3-, 6-, and 12-hours before the deterioration event using a held-out set of 30% of hospital encounters.We also compared classification performance stratified by type of deterioration event as a secondary analysis.Measured outcomes included area under the receiver-operating characteristic curve (AUROC), area under the precisionrecall curve (AUPRC), sensitivity, specificity, F-score, and positive predictive value (PPV).We set thresholds for our models to achieve high sensitivity while maintaining acceptable PPV.5][36][37][38] As a sensitivity analysis, we reported classification metrics for every one-tenth change in predicted probability between 0 and 1.
We compared our model's performance against recommended prediction thresholds from three common early warning scores: Epic Deterioration Index (EDI) [score ≥ 60], 39 Modified Early Warning Score (MEWS) [score ≥ 5], 40 and National Early Warning Score (NEWS) [score ≥ 7]. 41Each EWS calculates a new score at regular thresholds.The EDI automatically calculates a score every 15 minutes; we re-calculated MEWS and NEWS scores each time a parameter was newly documented.We calculated encounter-level performance for each EWS.For encounters in which a deterioration event occurred, we obtained the highest EWS score in the time window from the start of hospitalization until k hours before a deterioration event.If a deterioration event did not occur, we obtained the maximum EWS score from the entire hospitalization.We indicated a predicted deterioration event when the maximum EWS score surpassed the prediction threshold.We also evaluated performance of EWS scores calculated in the k hours immediately before a deterioration event.We include this analysis across all deterioration events (Appendix B) and stratified by type of event (Appendix D).The model using clinical pages outperformed the early warning scores on the hold-out testing dataset at each time horizon.Our model to predict deterioration in the next 3-hours yielded the best performance with an AUROC of 0.866 (95% CI [0.865-0.867])and F-Score of 0.295 (95% CI [0.293-0.297]).We compare discrimination of all four predictive models in Fig. 2. We observed that the Modified Early Warning Score (MEWS) yielded the lowest performance of all models with an AUROC of 0.655 (95% CI [0.655-0.655])at 3-hours before a deterioration event and an AUROC of 0.635 (95% CI [0.635-0.635])12-hours before an event.Classification metrics for all models are available in Table 2. Classification metrics stratified by deterioration event are available in Appendix C.

There
Using the pre-defined prediction threshold, our model accurately identified 61.9% of deterioration events within 3-hours and 46.9% of events within 12-hours.Table 3 presents results from our sensitivity analysis across prediction thresholds.Within 6-hours, the lowest prediction threshold of 0.1 would accurately identify 74% of patients experiencing a deterioration event within 6-hours with a PPV of 13%.Increasing the PPV to 41% would identify 29% of deterioration events.The best performing early warning score, the Epic Deterioration Index, accurately identified 51% of events within 6-hours with a PPV of 14%.

DISCUSSION
We developed a deep learning algorithm to classify imminent clinical deterioration among hospitalized patients.Using text from the sequence of clinical pages sent during routine care, our retrospective analysis found that our models accurately predicted 62% of deterioration events within 3-hours and 47% of deterioration events within 12-hours with good discrimination (AUROC, 0.87-0.82).These results significantly improved upon the best existing, commonly implemented EWS, which yielded AUROCs of 0.781 at 3-hours and 0.782 at 12-hours before a deterioration event.
Clinical pages offer insight into decision making and intuition around key clinical findings.In reviewing pages predicted to demonstrate a high probability of deterioration, we found that messages indicated a mix of expressions of direct concern (i.e., "Please call for critical findings in arterial study"; "Can you come see pt? Significant full body tremors, idk if it's from anxiety, Valium given 25 mins ago") and mentions of specific, potentially concerning, findings (i.e., "Pt SBP >140, IV hydralazine given x2, IV pain meds x2"; "BP 90/63 [MAP 72]").Few studies highlighted the importance of clinical intuition in recognizing clinical deterioration.Romero-Brufau found that a nurse recorded indicator of worry significantly improved the prediction of ICU-transfer in 24-hours. 21Nursing worry and clinical concern provides important context that combines both subjective and objective impressions of patient condition. 22,24,42  that are not easily recorded as objective findings. 24Common EWS measure a median of 12 variables. 16We hypothesize that our model evaluates a wider array of findings and concerns, which contributes to its improved performance.
4][45] Nonetheless, it is possible that the combination of clinical pages and structured EHR data may offer improved performance, which we will investigate in future work.
Predicting clinical deterioration must balance adequate time to meaningfully intervene in patient care with a time horizon in which clinically meaningful changes to predictors can be observed.EWS predict clinical deterioration at lengthy time horizons; most commonly exceeding 24-hours. 167][48] In comparing predictions using clinical pages with common EWS at the encounter level, we found that performance of our clinical pages algorithm improved closer to the deterioration event, suggesting that pages continue to indicate worrisome trends in the time leading to an event.Common EWS stayed relatively consistent across all time horizons.When evaluating EWS performance only during the time horizon (Appendix B), we note substantially poorer performance that gradually increases with longer time horizons.This reflects prior findings that EWS identify the sickest patients rather than individuals likely to imminently deteriorate. 17,20 WS performance differs by type of event. 5,49 ur analysis stratified by event found that predicting cardiac arrest yielded the highest performance, which echoes prior work. 5,46,47 Ierestingly, EWS yielded better performance than clinical pages when predicting cardiac arrest -both across the entire encounter and immediately before an event.This suggests that these patients maintain high scores throughout the encounter and that the clinical team may already be aware of the patient's condition.Our clinical pages model demonstrated marked improvement in predicting ICU transfer or Rapid Responses, suggesting clinical intuition is an important predictor.
Predicting imminent deterioration supports workflows for clinical response and intervention.Hospital quality and safety leaders could incorporate these findings into existing Rapid Response processes by providing a list of highrisk patients to support outreach and rounding support.When trends are detected, urgent messages could communicate findings to the charge nurse and clinical team for assessment and intervention.Highlighting concerning trends can help providers prioritize urgent needs.The algorithm could support automatic calls for Rapid Response support when a patient has a high likelihood of deterioration.Enabling data-driven response to increased patient acuity can improve upon common barriers to calling rapid response based on intuition alone, including feelings of uncertainty. 50,51 ing clinical pages to monitor patient acuity integrates key data points without changes to clinical workflow or nursing documentation.Few studies incorporated nontraditional data sources as artifacts of clinical care. 34Fu and colleagues measured frequency of documentation in the EHR to predict clinical deterioration in intensive care units with modest performance.Extracting clinical impressions or concerns from clinical documentation has shown promise in some clinical scenarios, 37,52 but limitations to timely documentation and frequency of nursing assessments limit utility of these approaches.
Our findings have limitations.We performed this research at a single academic medical center which uses an Epic EHR and relies extensively on pages to communicate between nurses, physicians, and other clinicians.Results may not generalize to other organizations.Our patient population also included a disproportionate number of White patients.These demographic characteristics closely reflect the broad demographics of middle Tennessee but nonetheless introduces potential racial bias in our sample.Future work should seek to better understanding how implicit biases affect clinical paging behavior.While our findings suggest significant improvement in detection of imminent deterioration, this research was conducted as a retrospective analysis.Additional study as a prospective randomized controlled trial should validate the impact of our model to improve clinical care.Our corpus of clinical deterioration events was based on retrospective chart review by a single reviewer.Despite cross-referencing annotated events with data from admission-discharge-transfer feeds, Rapid Response activations, and STAT activations, it is possible that a subset of events may have been incorrectly annotated or some events may have been missed during the annotation process.Finally, it is inconclusive if our findings highlight new clinical concerns versus existing concerns of which the clinical team is already aware.We will test the extent to which our machine learning approach highlights unrecognized instances of clinical urgency or concern in future work.

CONCLUSION
Our findings suggest that machine learning applied to the content and frequency of clinical pages improves prediction of imminent clinical deterioration.Our models provided improved discrimination at each time interval and outperformed the best performing common early warning scores across all classification metrics.Quantitative clinical measures are integral to patient monitoring but are not a substitute for experience and intuition.Using clinical pages to monitor patient acuity integrates both expert intuition and clinical decision making around key data to improve detection of imminent clinical deterioration without changing clinical workflow or nursing documentation.

Figure 1
Figure 1 Page embedding and LSTM pipeline.

Table 1 Encounter Population Statistics
2 P-value of difference in encounters with deterioration event versus encounters without deterioration event2Statistics calculated for cohort experiencing a deterioration event are measured to the time of first event per encounter