Introduction

Hypotension is known to be the most consistent manifestation of decompensated shock leading to major organ failure and death [1]. Hypotension, along with other chronic risk factors, is associated with an increased chance of acute kidney injury, myocardial ischemia, and mortality [2, 3]. However, the underlying signatures from hemodynamic monitoring variables that portend impending hypotension are not clearly identified [4]. Identifying that a patient is on a trajectory to a hypotensive episode with sufficient lead time could lead to effective mitigation of hypotension and possibly improved outcomes. Moreover, current treatment protocols for hypotension may themselves be associated with unwanted consequences such as excessive resuscitation and worsening of acute lung injury [5, 6].

Several early warning scores have been introduced to identify patients at risk for decompensation and trigger escalation of care [7,8,9]. However, most are manually calculated and often require additional data to be entered beyond what is readily available from monitors or electronic health records, limiting their utility [10]. More importantly, current metrics are unable to provide reliable, continuous feedback to clinicians who need to make time-sensitive decisions for rapidly fluctuating conditions. Even recent publications on data-driven prediction models lack clinically applicable implementation strategies [11,12,13]. Therefore, the creation of a real-time, continuous, translationally relevant forecasting system, which goes beyond a simple prediction model to include an additional enrichment layer to enhance alerting reliability, would favor successful implementation at the bedside by enhancing reliability and reducing false alarms.

With parsimonious use of multi-granular features and application of machine learning (ML) algorithms, we previously demonstrated the value of an early warning system to predict cardiorespiratory insufficiency (CRI) with high accuracy in step-down units [14] as well as tachycardia prediction in the ICU [15]. We have also demonstrated that the risk of CRI evolves along heterogeneous but repeatable trajectories, enabling early forecasting of the onset of crisis [16]. We hypothesized that clinically significant hypotension, a frequent form of CRI, could also be predicted and that this prediction could be actioned into a practical alert system for critically ill patients.

Materials and methods

Study population

A publicly available retrospective multigranular dataset, the Medical Information Mart for Intensive Care III (MIMIC-3), collected between 2000 and 2014 from a tertiary care hospital in Boston, MA, was used as data source [17]. Subjects with age ≥ 18 that have all vital signs and clinical records were selected. Algorithms were applied to identify hypotension events, as described below, to classify the source population into ‘hypotension group’ (subjects experiencing at least one hypotension event) and ‘non-hypotension group’ (subjects with no hypotension events). To enhance specificity of identification of the first recorded event of hypotension, we further excluded from the hypotension group subjects who had received vasopressors, any amount of crystalloid bolus, or packed red blood cell transfusion two hours prior to the subject’s initial (first) hypotension event. Only the first hypotension event was targeted for prediction. Subjects admitted to the ICU prior to the median date of hospital admission were used for model selection and training, while those admitted later than the median admission date were used for validation.

Defining hypotension, preprocessing of data, and feature engineering

To identify clinically relevant hypotension, the following steps were taken. First, the threshold for hypotension was determined as systolic blood pressure (SBP) ≤ 90 mmHg and mean arterial pressure (MAP) ≤ 60 mmHg [18]. Second, at least 5 of 10 consecutive blood pressure readings (5 out of 10 min, with discrete data points) had to be below the thresholds. Third, if there was a gap of 2 min or less between two periods under the thresholds, these two periods were combined into a single hypotension event (Additional file 1: Figure S1).

In preprocessing, physiologically implausible values were removed including SBP, diastolic blood pressure (DBP), or MAP < 10 or > 400 mmHg; respiratory rate < 1 or > 100/min, heart rate < 10 or > 400/min, SpO2 < 10%. Missing values were imputed using a moving average of the three previous values assuming signal stability during that missing data time period if less than 10 min. Interpolation with the use of future data was avoided. The performance of this imputation method was examined by comparing statistics of imputed data windows to ground truth provided by data segments without missing data.

We computed features from raw vital signs time series including statistical values (variance, quartiles, mean, median, min, max), frequency domain features using discrete Fourier transform [19], as well as exponentially weighted moving average (EWMA, moving average with increased weighing for recent events) [20]. Features were computed on windows of different durations (5, 10, 30, and 60 min), rolling every minute. To appropriately label true hypotension events, we assumed all windows from hypotension subject in the training set had a positive label (hypotension) for the last 15 min prior to the hypotension event. For the non-hypotension group, we assigned negative labels to non-hypotension subject windows selected such that their ‘onset’ time was defined as within 15 min of the mean time on the first hypotension onset from ICU admission in the hypotension subjects (Fig. 1). Matching similar time lapse in the non-hypotensive group with the time to the first hypotension event in hypotension group was performed to generate the negative label with minimal potential bias, as the overall trajectories during the first few hours, or the last few hours of ICU stay could introduce greater non-physiologic confounders. The decision of using a 15-min timeframe was made considering practical clinical response time in anticipation of a hypotensive event and also objectively supported by a t-stochastic neighbor embedding (t-SNE) analysis [21], an unsupervised ML model that estimates the divergent time point of the two groups before hypotension (Additional file 2: Figure S2).

Fig. 1
figure 1

Schematic illustration to assess the performance of the hypotension prediction model on a finite time horizon (2 h). From the developmental cohort, the last 5 min before hypotension episode for the hypotension group was labeled and trained as positive, and 15-min long data before the last 2 h and 15 min were labeled and trained as negative

Model training and validation

We used a random forest (RF) classifier, K-nearest neighbor (KNN) classifier, gradient boosted trees, and logistic regression with L2 regularization models [22, 23] with tenfold cross-validation process on the training set (development cohort) (Additional file 3: Figure S3). The best performing model in the training set was applied to the validation set. Performances of those different supervised ML models were compared using the area under the receiver operating characteristic curve (AUROC) and its evolution over time. Since the number of the hypotension group and the non-hypotension group subjects were unequal (unbalanced), we also computed area under the precision recall curves (AUPRC), according to recent recommendations following the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement [24, 25]. Model calibration was evaluated using the Brier’s score [26, 27]. A brief explanation on models used, training methods, and performance evaluation techniques can be found in supplementary glossary (Additional File 4: Glossary).

Hypotension prediction risk score trajectories

A risk score (a number between 0 and 1, representing the relative probability of future hypotension at the end of the observation window) was generated every minute for each subject in the validation cohort, generating individualized risk trajectories for the entire ICU stay of a subject. Due to missing data, not all subjects had risk scores computed every minute before hypotension or the end of the monitoring window.

Operational performance of the hypotension alert system

To better understand how a hypotension forecasting model would perform prospectively when deployed as an alerting system, we identified exceedances per hour per subject, of the predicted hypotension risk score beyond various risk score thresholds. To further determine whether an exceedance of the risk score qualifies as a system alert, we employed a stacked RF model. This two-step (stacked) model (Additional file 5: Figure S4) was first trained with tenfold cross-validation, then underwent out-of-sample subjects validation using additional features including time since admission, the average, minimum, maximum and standard deviation of risk scores created by the first model, over the last 5, 10, and 30 min prior to the current time on a moving window. Lastly, a lockout period of 15 min (alert will not be generated if it were to occur within 15 min after the previous alert) was used to prevent an excessive number of alerts, either true or false. The alert-level performance was evaluated with true positive alert rate and total alerts per subject per hour, to evaluate whether alerts received could be trusted, and how many alerts clinicians would receive at the bedside. The subject(patient)-level performance was assessed with the probability of a future hypotension event following an alert (positive predictive value, PPV) and recall rate (1—sensitivity) to demonstrate how likely an alert system would predict, and fail to predict hypotension events, respectively.

Results

From a source population of 10,269 subjects with 22,246 ICU stays, we identified 1532 subjects (1946 ICU stays) as the hypotension group, and 1707 subjects (2585 ICU stays) as the non-hypotension group (Fig. 2). The development cohort included 641 hypotension subjects (781 ICU stays) and 826 non-hypotension subjects (1148 ICU stays). The validation cohort includes 666 hypotension subjects (799 ICU stays) and 793 non-hypotension subjects (1131 ICU stays). The development and validation cohorts were similar in size, gender, types of ICU (i.e., Medical ICU, Cardiothoracic ICU, or Surgical ICU), and in-hospital mortality. Notably, age and length of hospital stay were slightly shorter in the developmental cohort (Additional file 6: Table S1). In comparing hypotension group and non-hypotension group, the hypotension group subjects were older (age 66.4 years vs. 60.8 years) and had longer hospital stays (11 days vs. 5.7 days). Otherwise both the hypotension and non-hypotension groups were similar in distribution of sex and ICU types (Additional file 7: Table S2). The mean time to the first hypotension event was 102 h and 16 min (standard deviation: 164 h and 23 min). The median time was 34 h and 28 min, with the interquartile range was 107 h and 45 min.

Fig. 2
figure 2

Data extraction pipeline. From the initial MIMIC3 database, inclusion and exclusion criteria were applied, along with the definition of hypotension event. Then, feature selection was performed to derive the hypotension (subjects experienced hypotension events during the ICU stay) and the non-hypotension (those without the hypotension event) groups

Performance of machine learning algorithms

In selecting the best risk model, RF showed equivalent performances as visualized on the AUROC on the validation cohort, compared with KNN, logistic regression, and gradient boosted forest models. AUROC during the last 15 min and 1 h prior to the hypotension event were 0.93 and 0.88, respectively, on the validation cohort (Additional file 8: Figure S5, Left), with a good calibration score (Brier score 0.09) (Additional file 8: Figure S5, Center), with AUPRC of 0.90 and 0.83 (Additional file 8: Figure S5, Right). Given equivalence of predictive performance, but prevailing calibration score, we chose the RF algorithm to build a prediction model. Then, the performance of the RF model was verified with the validation cohort, with AUROC, calibration plot, and AUPRC (Fig. 3).

Fig. 3
figure 3

Performance evaluation for various supervised machine learning algorithms, with the evolution of area under the receiver operating characteristics (AUROC) over time (left), calibration plot with the Brier’s scores (center), and the evolution of the area under the precision–recall curve (AUPRC) over time (right)

Trajectory analysis

With the missing values due to incomplete data, we had 793 hypotension subjects and 1131 non-hypotension subjects with risk score trajectories. The trajectories (Fig. 4) drawn with the validation cohort exhibited a clear separation in mean risk between hypotension and non-hypotension subjects from 3 h prior to hypotension. The separation became wider as the hypotension subjects are getting closer to the hypotension event, and the risk of hypotension subjects escalated rapidly from approximately 30 min prior to hypotension.

Fig. 4
figure 4

Average evolution of the risk scores for hypotension (red) and non-hypotension (blue) groups is projected as trajectories, extending to 4 h preceding hypotension events. Shaded areas represent 95% confidence intervals. Dotted lines indicate the number of hypotension and non-hypotension subjects used to derive the risk score trajectory points at a given horizon

Alert identification and operational usefulness

We investigated the relationship between different risk score thresholds and the probability of future hypotension (alert). We first chose a threshold of 0.5, aiming to detect at least 90% of actual future hypotension events, which yielded 7.39 alerts/subject/hour on average, with a PPV of 57.7%. Using the stacked RF model, average alert frequency was reduced to 4.93 alerts/subject/hour at the same thresholds, with PPV improved to 65.2%. Adding a 15-min lockout period on the output of the stacked model further decreased alerts to 0.79 alerts/subject/hour. That means, with the lockout period, the probability of future hypotension with a single alert (PPV) will not be negatively impacted, but a clinician could expect much less alerts (every 75 min on average). Using the same thresholds, our model will have sensitivity of 92.4% (failing to predict hypotension for 7.6% of actual hypotension subjects) (Fig. 5). On the patient(subject)-level, the whole ICU data of a given subject showed the AUPRC of 0.68. When a random 1-h data from the hypotension and non-hypotension groups were used, the AUPRC was 0.91 in the validation cohort.

Fig. 5
figure 5

Relationship between the detected hypotension subjects (%) and the probability of hypotension after an alert between the stacked model (orange line) and single random forest model (blue line). Detected cases indicate the percentage of hypotension subjects our model successfully predicted as an alert before hypotension. At the risk score threshold of 0.5, the probability of hypotension in the future (positive predictive value) was approximately 0.65 (red dashed arrow), with 92.37% of hypotension events were captured (vertical blue dashed arrow)

Discussion

This study is one of the few to demonstrate the capacity of data-driven prediction of clinically significant hypotension events using a supervised ML algorithm to stream bedside monitoring vital sign data and is unique in providing an actionable roadmap for the design of an alerting system using a predictive model developed on retrospective data that has clinical utility. In this study, we focused on designing a ‘data-driven’ risk prediction system that has operational usefulness. Our approach is notable for a modeling component where we assess model performance using multiple methods and express dynamic changes in risk, and an implementation part where we design a two-step pipeline to increase the overall reliability of alerts and decrease alarm fatigue.

There are few studies using ML-based approaches to predict hypotension. A recent report describes the intraoperative prediction of hypotension 10–15 min earlier, with using data from a specialized commercially available noninvasive continuous arterial waveform sensor [11]. Our research demonstrated that this prediction task could be extended to ICU patients, with using relatively sparse (minute-by-minute) data, resulting in a longer prediction horizon (AUROC of 0.88 at 60 min prior to the event), linked with clinically relevant alerts (an average of 0.79 alerts/subject/hour, with a 15-min lockout time). Our strategy has practical benefits in implementation to secure time to prepare action plans and can be useful in a relatively resource-limited environment where continuous care by trained critical care providers is not available for immediate action.

Performance of the prediction model needs to be assessed with multiple methods. The previous study on the MIMIC II database has used vital signs and medication data to predict hypotension at 1-h prior to the event showed an AUROC of 0.934 [12]. Despite seemingly high AUROC, however, the study reported many negative samples (low pretest probability) and a deceptively low PPV of 0.151, limiting real-life feasibility of the resulting model. To bolster potential feasibility in implementation, we performed a multi-faceted performance evaluation by employing not only AUROC trajectory, but also AUPRC and calibration assessed using Brier’s score. Methodologically, model analysis relying only on the cumulative assessment of ROC curves could mislead the interpretation of such patterns in vital sign physiology, as AUROC itself does not address misclassification cost [28]. The addition of PRC allowed assessing the ability of the models to identify true positives when the groups are imbalanced, providing the full picture of the capacity of different models. Calibration study helped selecting the model type whose predictions are best aligned with posterior probability distributions.

Building risk score trajectories provides conceptual and practical advantages. Instantaneous scores may be subject to stochastic variations, erroneous entries, and artifacts, while trajectories provide historical context to a risk, which translate to the clinical concept of worsening or improving health state, perhaps in response to an intervention [29,30,31]. Second, studies suggest that the prognosis of critically ill patients is associated with early recognition and timely intervention of abrupt changes [32]. Early identification of dynamic risk changes on critically ill patients can be highly informative, as sudden unexpected physiologic deteriorations are common in the ICU (e.g., septic patients develop gastrointestinal bleeding with stress ulcer; or renal failure patients exhibit arrhythmias with severe hyperkalemia). In our study, prediction visualized with the risk score trajectory allowed early differentiation in the mean risk scores from at least 3 h prior to hypotension events, which objective metric does not rely on practitioners’ level of skill in interpretation. While this finding illustrates the power of our model, it needs to be interpreted with caution: The mean risk is never applicable to an individual patient; thus, some probability of an individual being outside of a common band should be provided with high interpersonal variances of risk scores in real life.

Prediction followed by a reliable alerting strategy of any critical event in real life would be the one of the holy grails of the ICU care, because of a highly variable and heterogeneous individual clinical pictures. A recent study confirmed this, as various cardiovascular states, reserves, or responses were observed when a standardized resuscitation protocol was employed during septic shock [33]. Delivering the prediction to the bedside is also challenging, as the alert should be actionable and linked to a meaningful management strategy. A recent study developed an intraoperative hypotension prediction index with an AUROC of 0.88 at 15 min, with excellent PPV and NPV [34], and was linked with an action plan in the operating room to treat predicted hypotension, resulted in less hypotension and less post-operative complications [35]. In our study, we conceptualized a predictive alert system performing well in the ICU environment where less frequent monitoring takes place than the operating room, along with a lockout design to minimize alarm fatigue. First, our two-step model demonstrated the utility of potential implementation with further increased true positive and decreased false positive rates. Then, a 15-min lockout period allowed the model to be more actionable by reducing the unwanted impact on alarm fatigue. Mitigating alarm fatigue is important, because it could be associated with failure to rescue, if clinicians ignore excessively frequent alarms even though they may carry critical information [36]. The 15-min lockout period was chosen to decrease repetitive alerts with similar clinical meaning, assuming that alerts more frequent than 15 min would not alter the rationale of management strategy. In a hypothetical 20-bed ICU, our model would alarm 16 times per hour, while without the lockout period, it would yield 159 alerts (7.93 × 20) per hour. A good example for this application would be a recent study used a rule-based model for cardiothoracic ICU patients to decrease alerts by 55% with lockout, while capturing almost all true clinical deterioration events [37]. Finally, with the alert-level and subject-level analyses of the alert system, we showed that a vast majority of future hypotension could be captured with a high sensitivity and an accountable probability, suggesting potential real-life utility.

Our work has several limitations. First, an external validation cohort using multicenter data was not used to confirm the performance of our model. Instead, we used an a priori separated out-of-sample validation set. We are currently collecting a large-scale multigranular ICU data in our institution and plan for further external cohort validation. Second, our operational definition of hypotension was based on conventional cutoff values, not specifically designed to meet the characteristics of the individual subjects. In addition, despite our preprocessing to minimize non-physiologic artifacts, there could still be artifacts fell within physiologic normal range and might have interfered the analysis. However, we postulated the effects of the remaining artifacts are likely minimally associated if the distribution of artifacts across vital sign samples are not systematic (i.e., random). Third, we developed a risk model for the first hypotension event in the ICU, to simplify the prediction task. An alerting system that would apply across the span of an admission would have to go beyond a first episode of hypotension. An alerting system for all hypotension episodes would need to integrate a different risk model for subsequent episodes, and this model would likely use clinical interventions and number of prior episodes as additional risk factors. We are currently developing such a model. We also excluded subjects that had received fluid bolus or vasopressor treatment within 2 h of a first hypotension event. This modeling choice, justified by the notion that such interventions were possibly linked to unwitnessed hypotension (prior to data availability) and thus that the first hypotensive episode recorded episode of hypotension in this subpopulation, was more likely to be a subsequent episode. These choices narrowed the potential utility of our alert system if deployed. Although our proposed alerted system could be tested in these circumstances, we would expect a decrease in performance. Fourth, we did not use temporal correlations between features imposed by physiological constraints as additional features. Expanding the set of variables to include more sophisticated and physiologically inspired features, and including higher granularity monitor data (e.g., waveforms) might achieve better algorithm performance. Lastly, our selection of hypotension events might have missed potential real events (false negatives), as acutely profound hypotension events are usually treated within minutes with fluid boluses resuscitation or vasopressor therapy. With a higher granularity dataset, we argue that individual trajectory and its triggering factors for future hypotension could be identified in a more sophisticated manner.

Conclusions

Clinically relevant hypotension events can be predicted from minute-by-minute vital signs dataset with the use of machine learning approaches. This prediction can be integrated into a highly sensitive alert delivery system with low false alerts causing minimal alarm fatigue, with potential real-life utility.