INTRODUCTION

Unexpected clinical deterioration in hospitalized patients is a significant patient safety concern which can result in cardiac arrest, transfer to intensive care units (ICU), and preventable death.1 Patients who experience unanticipated deterioration often display signs of clinical instability in the preceding hours.2,3,4,5,6 Many organizations have adopted processes, such as Rapid Response Systems, to identify and intervene on patients likely to experience deterioration.7, 8 The efficacy of Rapid Response Systems is promising, with prior research highlighting a significant reduction in in-hospital mortality and cardiac arrest.8,9,10,11,12,13

Detecting patients at risk of clinical deterioration commonly relies on data contained within the electronic health record (EHR) to monitor physiological features. These early warning scores (EWS) use structured data, including patient demographics, vital signs, and nursing assessments to stratify patients by risk of deterioration.5, 14,15,16 EWS detect the sickest patients at risk of poor clinical outcomes, but often suffer from low discriminatory power and poor sensitivity to detect imminent events within the next 12-hours.17,18,19,20

Experienced clinicians accurately recognize clinical deterioration through intuition and knowledge about the patient before objective evidence is available.21,22,23 Incorporating features of clinical concern with structured EHR data can improve EWS performance.21, 24 Many rapid response systems incorporate intuition as a calling criteria for activation, which provides clinicians opportunity to request assistance at an early stage. However, barriers exist to calling on rapid response support, including lack of confidence and feelings of uncertainty often leads to delayed rapid response calls or escalation in care.25

Few EWS incorporate features that allow experts to include subjective assessments. Measures of worry are not directly captured in the EHR and mentions of concerns are often only documented in free-text notes or comments. Many healthcare institutions use pager messages, or brief unidirectional text-based messages from a healthcare worker to an individual’s pager, as an approach to indicate clinical needs and concerns. Healthcare workers communicate clinical concerns through electronic messages.26 Analysis of the content and frequency of pager messages to predict clinical outcomes represents a rich source of detail about a patient’s condition. We examined the efficacy of machine learning on pager messages sent by nurses to physicians to detect imminent clinical deterioration events in hospitalized patients.

METHODS

We conducted this study at Vanderbilt University Medical Center (VUMC). VUMC is a large academic medical center located in middle Tennessee and provides referral care across the southeastern United States. VUMC includes an 864-bed adult hospital and sees nearly 2 million annual ambulatory visits. Clinicians at VUMC used an Epic EHR for all clinical functions. At VUMC, clinical pages are a primary mode of communication between healthcare workers. Most commonly, clinicians send pages about patients through the EHR by selecting the integrated care team paging activity. Clinicians can also send pages through a personnel and schedule management mobile phone application that is external to the EHR. Paging through the mobile application is used for administrative tasks, including to coordinate personnel and general bed management. Epic Secure Chat is not currently implemented at VUMC.

This study was approved and granted a waiver of consent by the Vanderbilt University Institutional Review Board. This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline.

Study Design and Population

In this retrospective study, we predicted clinical deterioration events among inpatients receiving care in hospital wards outside ICUs at VUMC. Our study included all patients who were at least 18 years old at the time of admission, and admitted between January 1, 2018 and December 31, 2020. We excluded patients receiving hospice or palliative care since these patients have different treatment plans and care processes. We labeled each encounter for the first of three deterioration events: Rapid Response activation, unplanned transfer to intensive care, and in-hospital cardiac arrest. The process for rapid response team (RRT) activation requires a hospital worker to call the VUMC emergency medical service (EMS) activation team in response to early warning criteria. EMS activation then calls a dedicated rapid response team to assess the patient and intervene as necessary. Unplanned transfers to intensive care included any transfer from a hospital ward to an ICU that did not include an intermediate surgery. In-hospital cardiac arrest was defined as any cardiac arrest on a hospital ward. All deterioration events were validated through retrospective chart review as part of ongoing rapid response quality improvement by an expert clinician (KS) on the VUMC RRT. Hospitalizations with a planned ICU stay or that did not include at least one sent page were excluded from our analysis. We also excluded encounters that were in the top 2% longest length of stay and did not experience a deterioration event as prior research has shown that excessively long hospitalizations are most caused by non-clinical factors.27

Data Collection and Preprocessing

We collected data on all pages sent during our study. Page data included a unique identifier, page timestamp, patient name, medical record number (MRN), and message text. To ensure pages were sent about hospitalized patients, we matched pages to patient encounters by MRN and timestamp. If an MRN was not available, we mapped the page to an encounter using last name, timestamp, and room. 33% of pages could not be mapped and were excluded from our study. We manually reviewed a random subset of 250 unmatched pages and found that these discussed operating room availability, bed management, and personnel management. None of the pages included details about patient-specific care.

Feature Selection

Model features included word embeddings, or numerical representations of the text from each page. We generated word embeddings using a clinical Bidirectional Encoder Representations from Transformers (BERT) model, which represents the content of clinical text.28, 29 Clinical BERT has been shown to develop meaningful representations of clinical text, including messages between healthcare team members.26, 29 To extract word embeddings from Clinical BERT, we first fine-tuned the model using the corpus of pages to ensures that the model accurately captures corpus-specific features. We used the fine-tuned model to process each page and extract word embedding from the last layer. We implemented Clinical BERT using the HuggingFace Transformers library.30

We combined page-level features into an encounter-level feature set for input into our models (Fig. 1). To create encounter-level features, we sequentially combined page-level features in ascending order. Encounter-level features contained only clinical pages sent during an encounter. We normalized timestamps into a single numeric value to represent the number of elapsed hours between hospital admission and time the page was sent. We created a washout period by truncating encounter-level features at a 3-, 6-, or 12-hour time horizon before the first deterioration event. All pages sent during an encounter, before the respective time horizon, were included in the feature set. We maintained all pages for hospitalizations that did not contain a deterioration event.

Figure 1
figure 1

Page embedding and LSTM pipeline.

Model Development and Evaluation

We trained two-layer long short-term memory (LSTM) machine learning models to predict clinical deterioration. Many machine learning models look at data at a single point in time. In contrast, LSTM models learn features and patterns from sequential data and are commonly used to make predictions over data in a timeseries, such as clinical pages during a hospital encounter.31 The feature embedding and LSTM model pipeline are presented in Fig. 1. We split the encounter-level data by year into training (70%) and testing (30%) datasets. We randomly split 20% of the training dataset to use during hyperparameter tuning and to enable early stopping during final model training to avoid overfitting. We tuned hyperparameters using random search with preset hyperparameter ranges (Appendix A). The testing dataset was held out from parameter tuning and used only to measure final results. We calculated validation loss after each iteration of hyperparameter tuning and model training to enable early stopping when validation loss plateaued or increased for three subsequent iterations. Following hyperparameter optimization, we developed models using the entire training dataset. We implemented our machine learning models and evaluated the performance using the Tensorflow (version 2.6.2)32 and scikit-learn (version 1.1.3)33 packages in Python 3.6.9.

Statistical Analysis

We compared cohorts of patients who experienced deterioration events versus those who did not experience deterioration events using Welch t-tests for numerical features and Chi-square tests for categorical features. We considered a p value less than 0.05 to be statistically significant. Statistical analyses were performed using R version 4.1.2.

We measured classification model performance to predict clinical deterioration 3-, 6-, and 12-hours before the deterioration event using a held-out set of 30% of hospital encounters. We also compared classification performance stratified by type of deterioration event as a secondary analysis. Measured outcomes included area under the receiver-operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), sensitivity, specificity, F-score, and positive predictive value (PPV). We set thresholds for our models to achieve high sensitivity while maintaining acceptable PPV. Existing literature on EWS has targeted an acceptable PPV between 10% – 20%.34,35,36,37,38 As a sensitivity analysis, we reported classification metrics for every one-tenth change in predicted probability between 0 and 1.

We compared our model’s performance against recommended prediction thresholds from three common early warning scores: Epic Deterioration Index (EDI) [score ≥ 60],39 Modified Early Warning Score (MEWS) [score ≥ 5],40 and National Early Warning Score (NEWS) [score ≥ 7].41 Each EWS calculates a new score at regular thresholds. The EDI automatically calculates a score every 15 minutes; we re-calculated MEWS and NEWS scores each time a parameter was newly documented. We calculated encounter-level performance for each EWS. For encounters in which a deterioration event occurred, we obtained the highest EWS score in the time window from the start of hospitalization until k hours before a deterioration event. If a deterioration event did not occur, we obtained the maximum EWS score from the entire hospitalization. We indicated a predicted deterioration event when the maximum EWS score surpassed the prediction threshold. We also evaluated performance of EWS scores calculated in the k hours immediately before a deterioration event. We include this analysis across all deterioration events (Appendix B) and stratified by type of event (Appendix D).

RESULTS

There were 87,783 patients (mean [SD] age at hospital encounter, 54.0 [18.8] years; 45,835 women [52.2%]) who experienced 136,778 hospitalizations and were the subject of 1,869,928 pages (mean [SD] 12.8 [13.9] pages per encounter). Deterioration events were recorded in 6214 encounters (4.5%). Rapid Response activations were the most common deterioration event (4121 [66.3%]), followed by ICU transfers (1753 [28.2%]) and in-hospital cardiac arrest (340 [5.5%]). Deterioration events occurred a mean [SD] 3.9 [3.9] days into the hospitalization. Table 1 details encounter-level population statistics.

Table 1 Encounter Population Statistics

The model using clinical pages outperformed the early warning scores on the hold-out testing dataset at each time horizon. Our model to predict deterioration in the next 3-hours yielded the best performance with an AUROC of 0.866 (95% CI [0.865–0.867]) and F-Score of 0.295 (95% CI [0.293–0.297]). We compare discrimination of all four predictive models in Fig. 2. We observed that the Modified Early Warning Score (MEWS) yielded the lowest performance of all models with an AUROC of 0.655 (95% CI [0.655–0.655]) at 3-hours before a deterioration event and an AUROC of 0.635 (95% CI [0.635–0.635]) 12-hours before an event. Classification metrics for all models are available in Table 2. Classification metrics stratified by deterioration event are available in Appendix C.

Figure 2
figure 2

Model discrimination.

Table 2 Performance of Clinical Page Prediction Model Compared to Commonly Implemented Early Warning Scores

Using the pre-defined prediction threshold, our model accurately identified 61.9% of deterioration events within 3-hours and 46.9% of events within 12-hours. Table 3 presents results from our sensitivity analysis across prediction thresholds. Within 6-hours, the lowest prediction threshold of 0.1 would accurately identify 74% of patients experiencing a deterioration event within 6-hours with a PPV of 13%. Increasing the PPV to 41% would identify 29% of deterioration events. The best performing early warning score, the Epic Deterioration Index, accurately identified 51% of events within 6-hours with a PPV of 14%.

Table 3 Comparison of Classification Metrics by Predicted Probability Cutpoint

DISCUSSION

We developed a deep learning algorithm to classify imminent clinical deterioration among hospitalized patients. Using text from the sequence of clinical pages sent during routine care, our retrospective analysis found that our models accurately predicted 62% of deterioration events within 3-hours and 47% of deterioration events within 12-hours with good discrimination (AUROC, 0.87–0.82). These results significantly improved upon the best existing, commonly implemented EWS, which yielded AUROCs of 0.781 at 3-hours and 0.782 at 12-hours before a deterioration event.

Clinical pages offer insight into decision making and intuition around key clinical findings. In reviewing pages predicted to demonstrate a high probability of deterioration, we found that messages indicated a mix of expressions of direct concern (i.e., “Please call for critical findings in arterial study”; “Can you come see pt? Significant full body tremors, idk if it’s from anxiety, Valium given 25 mins ago”) and mentions of specific, potentially concerning, findings (i.e., “Pt SBP >140, IV hydralazine given x2, IV pain meds x2”; “BP 90/63 [MAP 72]”). Few studies highlighted the importance of clinical intuition in recognizing clinical deterioration. Romero-Brufau found that a nurse recorded indicator of worry significantly improved the prediction of ICU-transfer in 24-hours.21 Nursing worry and clinical concern provides important context that combines both subjective and objective impressions of patient condition.22, 24, 42 Douw and colleagues found that descriptions of nurse worry encompass over 170 unique clinical concerns, including impressions that are not easily recorded as objective findings.24 Common EWS measure a median of 12 variables.16 We hypothesize that our model evaluates a wider array of findings and concerns, which contributes to its improved performance. Pages also provide specific insight into immediate concerns without relying on documentation in the EHR which is often delayed.43,44,45 Nonetheless, it is possible that the combination of clinical pages and structured EHR data may offer improved performance, which we will investigate in future work.

Predicting clinical deterioration must balance adequate time to meaningfully intervene in patient care with a time horizon in which clinically meaningful changes to predictors can be observed. EWS predict clinical deterioration at lengthy time horizons; most commonly exceeding 24-hours.16 However, patients who experience deterioration events begin to show signs of clinical instability in the preceding 8 to 12 hours.46,47,48 In comparing predictions using clinical pages with common EWS at the encounter level, we found that performance of our clinical pages algorithm improved closer to the deterioration event, suggesting that pages continue to indicate worrisome trends in the time leading to an event. Common EWS stayed relatively consistent across all time horizons. When evaluating EWS performance only during the time horizon (Appendix B), we note substantially poorer performance that gradually increases with longer time horizons. This reflects prior findings that EWS identify the sickest patients rather than individuals likely to imminently deteriorate.17, 20 EWS performance differs by type of event.5, 49 Our analysis stratified by event found that predicting cardiac arrest yielded the highest performance, which echoes prior work.5, 46, 47 Interestingly, EWS yielded better performance than clinical pages when predicting cardiac arrest – both across the entire encounter and immediately before an event. This suggests that these patients maintain high scores throughout the encounter and that the clinical team may already be aware of the patient’s condition. Our clinical pages model demonstrated marked improvement in predicting ICU transfer or Rapid Responses, suggesting clinical intuition is an important predictor.

Predicting imminent deterioration supports workflows for clinical response and intervention. Hospital quality and safety leaders could incorporate these findings into existing Rapid Response processes by providing a list of high-risk patients to support outreach and rounding support. When trends are detected, urgent messages could communicate findings to the charge nurse and clinical team for assessment and intervention. Highlighting concerning trends can help providers prioritize urgent needs. The algorithm could support automatic calls for Rapid Response support when a patient has a high likelihood of deterioration. Enabling data-driven response to increased patient acuity can improve upon common barriers to calling rapid response based on intuition alone, including feelings of uncertainty.50, 51

Using clinical pages to monitor patient acuity integrates key data points without changes to clinical workflow or nursing documentation. Few studies incorporated non-traditional data sources as artifacts of clinical care.34 Fu and colleagues measured frequency of documentation in the EHR to predict clinical deterioration in intensive care units with modest performance. Extracting clinical impressions or concerns from clinical documentation has shown promise in some clinical scenarios,37, 52 but limitations to timely documentation and frequency of nursing assessments limit utility of these approaches.

Our findings have limitations. We performed this research at a single academic medical center which uses an Epic EHR and relies extensively on pages to communicate between nurses, physicians, and other clinicians. Results may not generalize to other organizations. Our patient population also included a disproportionate number of White patients. These demographic characteristics closely reflect the broad demographics of middle Tennessee but nonetheless introduces potential racial bias in our sample. Future work should seek to better understanding how implicit biases affect clinical paging behavior. While our findings suggest significant improvement in detection of imminent deterioration, this research was conducted as a retrospective analysis. Additional study as a prospective randomized controlled trial should validate the impact of our model to improve clinical care. Our corpus of clinical deterioration events was based on retrospective chart review by a single reviewer. Despite cross-referencing annotated events with data from admission-discharge-transfer feeds, Rapid Response activations, and STAT activations, it is possible that a subset of events may have been incorrectly annotated or some events may have been missed during the annotation process. Finally, it is inconclusive if our findings highlight new clinical concerns versus existing concerns of which the clinical team is already aware. We will test the extent to which our machine learning approach highlights unrecognized instances of clinical urgency or concern in future work.

CONCLUSION

Our findings suggest that machine learning applied to the content and frequency of clinical pages improves prediction of imminent clinical deterioration. Our models provided improved discrimination at each time interval and outperformed the best performing common early warning scores across all classification metrics. Quantitative clinical measures are integral to patient monitoring but are not a substitute for experience and intuition. Using clinical pages to monitor patient acuity integrates both expert intuition and clinical decision making around key data to improve detection of imminent clinical deterioration without changing clinical workflow or nursing documentation.