1 Introduction

Worldwide, more than 300 million surgeries are performed annually, a number expected to increase with the aging population and the consequent rise in comorbid conditions [1, 2]. In patients having non-cardiac surgery, intraoperative hypotension (IOH) under general anesthesia is a frequent physiological derangement [3]. In those undergoing liver resections, IOH has been reported in nearly half of the cases because of their complexity, positioning, substantial blood loss, and fluid shifts [4,5,6]. Brief episodes of IOH, ranging from moderate to profound, have been shown to carry clinically significant consequences [7]. Studies have shown that IOH is associated with a higher risk of postoperative morbidity due to acute renal failure, delirium, myocardial injury after non-cardiac surgery [3, 4, 8,9,10,11,12,13,14]. Therefore, avoiding IOH is important in procedures such as liver resections.

Machine learning (ML) in healthcare has emerged in recent decades as a tool for prognostication, clinical decision support, and predicting complications. Machine learning strategies provide a framework from which to estimate intraoperative events of interest from observed time-series data (i.e., blood pressure or oxygen saturation) [15,16,17,18]. In that regard, significant research efforts are underway to apply ML in the complex interplay between surgery, patient demographics, anaesthetic factors, and their relationship with perioperative outcomes [15,16,17].

For example, several recent ML approaches have been applied to predict IOH events [15,16,17,18,19,20,21,22]. In a randomized clinical trial, an ML-derived early-warning system showed a decrease in the incidence and duration of IOH, suggesting that prediction of IOH is superior to prompt management of the event [20]. This study used a commercially available ML-derived tool called the Hypotension Prediction Index (HPI, Edwards Lifesciences, Irvine, CA) [20]. The index estimates the occurrence of IOH using an algorithm derived from the dynamic variations of an arterial line waveform. Potential limitations in previous work with HPI include the need for high-frequency arterial waveform data (which adds cost to patient care and most anaesthetics delivered do not use), selection bias, selection of prediction-outcome pairs, reduced performance between backwards and forward case analysis as recently indicated by Davies et al., and complicated or inaccurate treatment recommendations [20,21,22,23,24,25]. It has also been suggested that the HPI overestimates the prediction of IOH [26]. Another notable gap in previous work using HPI is that liver resections were excluded in some HPI analyses due to possibly lower IOH prediction performance in that context, providing a strong point of motivation for the work presented below, which focuses on liver surgery.

Furthermore, in the clinical setting, anaesthesiologists routinely rely on additional factors to predict the individual thresholds of hypotension in the operating room. For example, the amount of anaesthetic being administered could affect the patient’s blood pressure [27]. Several factors are often considered to determine the optimal blood pressure range, such as age and amount of anaesthetic being delivered.

To more closely mirror the clinician’s complex decision-making process at the bedside, we investigated the performance of a novel prediction algorithm that incorporates multiple dimensions of physiological monitoring data (cardiac, respiratory, and neuro monitoring) in combination with patient-specific factors (age, sex, and BMI) to forecast the occurrence of IOH. Initial work specifically tackled the challenging scenario of major liver surgeries, which are hemodynamically complex and carry strong potential for substantial blood loss and fluid shifts that expose patients to the risk of IOH [5, 6]. We hypothesized that our multivariate ML algorithm would yield high performance for IOH prediction in liver surgery. Features of this work distinct from previous reports in the literature include: (i) the incorporation of multivariate physiological time series data and patient-specific demographic information; (ii) a multi-model approach comprising an ensemble of ensembles hypothesized to improve performance and practicality; and (iii) the ability to control the sensitivity and specificity of the prediction in real-time.

2 Methods

2.1 Data curation, pre-processing, and truth definition

The study drew from retrospective data under Institutional Review Board approval (IRB# 2023-0656) for patients undergoing hepatobiliary surgery at our institution. Data were extracted from a database of physiological monitoring variables gathered from the electronic medical record for surgeries between 3/22/2016 and 2/18/2023. With respect to guidelines reported in the multidisciplinary review of ML-based predictive models, the work was retrospective in nature with internal validation [28]. Inclusion criteria were open partial or total liver lobectomy, age > 18 years at the time of surgery, availability of demographic data, and intraoperative physiological signals, as noted below. Patients undergoing emergency or minimally invasive surgeries and those having multiple surgical procedures utilizing other surgical specialties were excluded.

A key aspect of the model described below is the combination of demographic and multivariate physiological time-series data incorporated into the prediction model. Demographic data included the subject age at the time of surgery, body mass index (BMI), and sex. Physiological data drawn from anaesthesia monitoring included 13-time series signals recorded at 1-min intervals: diastolic arterial line (Art Line D), systolic arterial line (Art Line S), mean arterial pressure (MAP), heart rate as measured by pulse oximeter (HR, Oximeter), inspired oxygen fraction (FiO2), delivered oxygen flow (O2 flow), bispectral index (BIS Monitor), inspired desflurane, expired desflurane, minute volume ventilation (Minute Volume), respiratory rate (Resp), peak inspiratory pressure (PIP), and tidal volume (Vt). The saturated haemoglobin concentration from the pulse oximeter reading (SpO2) was originally included, but analysis of feature importance (Sect. 2.2.2) showed little influence on IOH prediction, and it was subsequently excluded.

Time series data were considered 30 min after arterial line placement, when all physiological signals mentioned above were present, and were low-pass filtered with a 3-min averaging window width to reduce noise. Signals exhibiting missing data of 1–2 min were median filtered to impute minor gaps in the data. Training cases were classified as positive for IOH if they exhibited a drop in MAP below 65 mm Hg sustained for at least 2 min. Only the first IOH event was considered for each case, beyond which the subject was assumed to be under medical intervention and was not further considered in training or testing. Negative cases were those for which MAP was above 65 mm Hg throughout the observation period. Selection bias was minimized by ensuring forward analysis of physiological variables and future IOH (up to 8 min in advance, as detailed below) and not by the way the data was assembled [29].

Figure 1 illustrates the segmentation of time series data into “data intervals” (i.e., the input data from which a prediction of IOH is made) and “forecast intervals” (i.e., the time period between the data interval and a possible IOH event). For illustration purposes, the time stamp in Fig. 1 denotes time-shifted series relative to the start of the data interval and does not describe the actual time during surgery. Throughout this work, the data interval was taken to be a 10-min interval prior to a possible IOH event. As a starting point, the current work focused on a relatively short forecast interval up to 8 min in duration, selected in part due to the relatively fast hemodynamic swings that have been described in liver surgery. Thus, the proposed interval along with the sliding window approach described below is fairly realistic with respect to clinical decision making in the operating room and could in turn provide flexibility to monitor, react, and promptly intervene to prevent IOH. Shorter (~ 5 min) and longer (~ 15 min) intervals are possible subjects of future work. The forecast interval duration was randomly varied from 3 to 8 min according to a uniform distribution, and as shown in Fig. 1, for a data interval spanning t = 1–10 min, the IOH event could therefore occur anywhere from t = 13 min to t = 18 min.

Fig. 1
figure 1

Illustration of time series data segmentation. The “data interval” refers to a 10-min interval of data ingested by the predictive model. The “forecast interval” refers to the time between the end of the data interval and the instance of a possible IOH event

2.2 Predictive models

A number of model architectures were considered for initial investigation of IOH prediction, including artificial neural networks (ANN), gradient boosted trees (GBT), and random forest (RF) classifiers. While ANNs are a potentially powerful approach (as in, for example [16]), they typically require large training sets and can challenge interpretability. Simpler ML approaches such as GBTs and RFs can yield a reasonable degree of predictive performance, a high degree of interpretability and explainability, and have widely available shared software libraries that facilitate implementation and reproducibility (e.g., the sktime ML Python library for time series data). For these reasons, the initial studies reported below were based on RF supervised classifiers with 100 decision trees combined using the ColumnEnsembleClassifier() function, and future work will certainly consider alternative architectures.

Several variations in model training and testing were investigated to rigorously assess the potential for bias and to evaluate performance under a broad variety of conditions. The differences in accuracy were analyzed using a non-parametric Mann–Whitney U test, with p-value < 0.05 interpreted as evidence of statistical significance. Models were trained with and without Z-score signal normalization. Two variations in train:test proportion were considered (90:10 and 80:20) to investigate the tradeoffs between a larger training set and fewer test samples, hypothesizing slight improvement for the former due to a larger volume of training data. As an alternative to the 8-min forecast interval (randomly varied from 3–8 min), models with a fixed 5-min interval were investigated. Finally, three variations in class balance were trained and tested: (i) the natural imbalance of the data (as shown below ~ 6× in favor of positive cases); (ii) balancing negative and positive datasets by sampling three negative datasets (non-overlapping, separated by at least 20 min) from each negative case plus three negative datasets (similarly non-overlapping) from positive cases sampled at least 30 min prior to the IOH event, referred to as “3× sampling”); and (iii) balancing achieved by sampling six non-overlapping datasets from negative cases (referred to as “6× sampling”). Models (ii) and (iii) therefore involved patient-level combined with segment-level splitting to achieve class balance and is recognized to carry potential bias. The bias is partly mitigated by sampling non-overlapping data intervals separated by at least 20 min and assumed to be independent, recognizing that age, sex, and BMI are common at the level of segment and therefore not strictly partitioned.

2.2.1 Application scenarios: static and dynamic

Two application scenarios were considered for analysis of model predictions. The first was a relatively simple “static” scenario in which a single 10-min data interval was taken as input for each case, occurring up to 8 min prior to a possible IOH event, as illustrated in Fig. 1. The static scenario was hypothesized to represent an optimistic upper bound on model performance under the idealized condition that half of the validation data were positive or negative for IOH.

A second, more realistic, and challenging scenario was considered in which the 10-min data interval advanced in 1-min steps through an extended period of truly negative data prior to a possible IOH event. With the forecast interval up to 8 min and a possible IOH event time-shifted to t = 40 min, the duration of the truly negative period ranged from 32 min (for an 8-min forecast interval) to 37 min (for a 3-min forecast interval). Sliding the 10-min data window in 1-min steps through a preponderance of truly negative instances and forming a prediction at each step is referred to below as the “dynamic” sliding window scenario, illustrated in Fig. 2. We hypothesized that model performance would be challenged (specifically, a reduction in PPV) in the dynamic scenario due to the high prevalence of negative instances—analogous to the more realistic clinical scenario of an early warning system for which most of the samples are truly negative, and a high level of PPV with few false alarms is required for the system to be useful. Testing in the “sliding window” scenario involved 20 cases (10 positive + 10 negative) that were held out and previously unseen in training at either the patient-level or segment-level.

Fig. 2
figure 2

Representation of the dynamic scenario. Illustration of the dynamic scenario in which the data interval advances in 1-min steps through a propensity of truly negative data prior to a possible IOH event. Regions for which IOH prediction is truly negative or truly positive are illustrated in green and red, respectively. The MAP is plotted in red on the left axis for a positive case time-shifted such that the IOH event occurs at t = 40 min. The right axis shows the Number of Votes (among 11 models in the MMV approach) for a positive prediction, and the resulting MMV predictions at each 1-min step are labelled TN (green), FP (yellow), TP (green), and FN (red)

2.2.2 Model variations: MAP-only, single-model, and multi-model

Three main variations in the predictive model were developed and tested. First was a RF model trained as detailed above but with only MAP as an input feature—referred to as the univariate “MAP-only” model, analogous to previously reported models [15].

To evaluate the importance of multi-dimensional inputs that more closely reflect the actual clinical considerations of an anaesthesiologist, a multivariate RF model was developed that takes the 3 demographic variables and 13 physiologic time series signals described above as input—referred to as the “single” multivariate model (in contrast to “multi-model,” described below). The importance of individual features contributing to multivariate model predictions was evaluated via ablation—i.e., removing a given feature, retraining, and retesting in the absence of the ablated feature. The reduction in model performance yields a surrogate for feature importance, and the features were rank-ordered accordingly.

In light of the challenge presented by a large prevalence of truly negative instances in the “dynamic” scenario, a “multi-model voting” (MMV) approach was developed that runs multiple, separately trained RFs in parallel, each contributing a vote. Whereas the conventional “single” multivariate model represents a single RF resulting from an ensemble of decision trees, the MMV approach comprises multiple RFs. Given that an RF is itself an ensemble approach, the MMV approach can be considered an “ensemble of ensembles.” With MMV, each of the (11) RF models from 11-fold cross validation described above contributes a prediction that is taken as a “vote” in forming a final prediction—e.g., by simple majority (Nvotes ≥ 6). Figure 2 illustrates the MMV approach for a single case in the dynamic scenario, with the number of votes (Nvotes) shown on the right axis and True-Negative (TN), True-Positive (TP), False-Negative (FN), and False-Positive (FP) predictions marked. The MMV approach was further investigated with Nvotes taken as an adjustable parameter, allowing the anaesthesiologist to control the sensitivity and specificity of the model by dialling the Nvotes threshold lower (for greater sensitivity) or higher (for greater specificity).

2.2.2.1 Performance evaluation

Predictive performance was evaluated in terms of standard binary hypothesis-testing metrics of TP predictions (correctly predicting an IOH event within the forecast interval), TN predictions (correctly predicting the absence of an IOH event), FP predictions (incorrectly predicting an IOH event), and FN predictions (incorrectly predicting that an IOH event will not occur). Corresponding metrics of Accuracy, Sensitivity, Specificity, positive predictive value (PPV), and negative predictive value (NPV) were computed. Receiver operating characteristic (ROC) curves were evaluated by varying the probability threshold (from 0 to 1) in model predictions to be considered positive, and the area under the ROC curve (AUC) was evaluated by numerical integration. The resulting distributions were analyzed using a non-parametric Mann–Whitney U test computed in Python to compare the observed difference in mean value between distributions, with p-value < 0.05 taken as statistically significant.

3 Results

3.1 Study cohort

The study cohort is summarized in Fig. 3. Inclusion criteria yielded a total of 918 cases undergoing open liver lobectomy. Of these, 723 were partial lobectomy, 98 were left lobectomy, and 97 were right lobectomy. Median age at the time of surgery was 55.8 years (range 20–90, Fig. 3a), Median BMI was 27.7 (range 20.0–35.2, Fig. 3b), and males constituted 70% of cases (642/918, Fig. 3c). At least one IOH event was observed (at least 30 min after arterial line placement and sustained for at least 2 min) in 85% of cases (783/918, Fig. 3d), and the remaining 15% (125/918) exhibited MAP > 65 mm Hg for the duration of their case. Among the 783 positive cases, 73% (570/783) were male, approximately consistent with the male:female proportion in cases overall.

Fig. 3
figure 3

Study demographics. a A large fraction of positive cases was evident in the data, consistent with clinical experience. b A high propensity of males was evident in the study population, with c a similar fraction of male positive cases. d Negative cases were more evenly balanced between male and female subjects. e BMI and f age were consistent with the study population of patients receiving liver lobectomy at our institution

3.2 Parameter selection and model variants

Among the basic model variations investigated, there was no statistically significant difference in performance with and without signal normalization (p = 0.132), consistent with scale invariance of the underlying RF approach; therefore, models were trained without signal normalization. As anticipated, a statistically significant improvement was observed in performance between 90:10 and 80:20 split in train:test datasets (p = 0.002) due to a somewhat larger training set, and the former was used throughout. There was no evidence of a statistically significant difference in performance between the 8-min forecast interval (i.e., variable 3–8 min) and the (fixed) 5-min interval (p = 0.242), and the former was used throughout in the interest of increased variety in the training set. The three variations in class balance resulted in: (i) 6× imbalance in favor of positive cases (783) vs negative cases (135); (ii) balanced datasets via “3× sampling,” of 135 negative cases (giving 270 negative datasets) plus 513 negative datasets drawn from positive cases sampled at least 30 min prior to the IOH event; and (iii) balanced datasets via “6× sampling” of negative cases to yield 783 negative and positive datasets. As expected, the imbalanced dataset exhibited a statistically significant reduction in performance compared to the 3× and 6× sampling (average AUC = 0.84 compared to 0.91 (p = 0.002) and 0.97 (p = 0.003), respectively); the 6× sampling dataset was used throughout to achieve class balance while mitigating bias in multiple sampling of negative datasets assumed to be independent.

3.3 Single-model performance

As a starting point, the performance of the MAP-only univariate model (i.e., a single RF model with MAP as the sole input feature in the relatively simple static scenario), is shown in Fig. 4. Performance overall is modest, as previously reported, exhibiting average AUC = 0.83 (over 11-fold) and median Accuracy = 0.73, Sensitivity = 0.69, Specificity = 0.79, PPV = 0.77, and NPV = 0.71 [30].

Fig. 4
figure 4

Performance of the MAP-only univariate RF model in the static prediction scenario. a ROC curve and average AUC over 11-fold. b Boxplots showing median (horizontal bar), interquartile range (shaded rectangle), range (vertical bars), and swarm (11 points corresponding to 11-fold in cross validation) in AUC, Accuracy, Sensitivity, Specificity, PPV, and NPV

By way of comparison, the performance of the multivariate RF model is shown in Fig. 5, also in the static scenario. Performance overall is markedly improved, demonstrating AUC = 0.97 (average over 11-fold) and median Accuracy = 0.95, Sensitivity = 0.86, Specificity = 0.93, PPV = 0.94, and NPV = 0.88. The importance of individual features in the model is shown in Fig. 5c, where the horizontal axis denotes the ablated feature. Age and BMI showed the greatest importance in the prediction, followed by a combination of hemodynamic features (Art Line S, Art Line D, and MAP), respiratory features (Tidal Volume and Resp), and anaesthetic delivery (Inspired Desflurane). Other features individually exhibited less influence on the model overall, but were maintained in the model, since they may contribute to the aggregate.

Fig. 5
figure 5

Performance of the multivariate RF model in the static prediction scenario. a ROC curve and average AUC over 11-fold of the model. b Boxplots showing median (horizontal bar), interquartile range (shaded rectangle), range (vertical bars), and swarm (11 points corresponding to 11-fold in cross validation) in AUC, Accuracy, Sensitivity, Specificity, PPV, and NPV. c Single-feature ablation studies, showing the Accuracy following elimination of the feature shown on the horizontal axis. “NONE” corresponds to the full model (all features)

While the accuracy exhibited in Fig. 5 is promising, deployment of an IOH prediction model as a real-time early-warning system must operate with a high degree of PPV to avoid false alarms. Performance of the multivariate RF model in the more challenging, dynamic sliding window scenario is shown in Fig. 6, where testing spanned a prolonged period, t = 1–40 min, within which the first 30 min was truly negative, prior to a forecast interval up to 8 min and possible instance of IOH at t = 40 min. Analysis was performed on ten positive test cases, amounting to 11 × 30 = 330 negative samples and 11 × 10 = 110 positive samples). A substantial drop in overall performance is evident: average AUC dropped from 0.97 to 0.84, median accuracy from 0.95 to 0.86, specificity from 0.93 to 0.88, and PPV from 0.94 to 0.63. The large drop in PPV in the dynamic scenario particularly motivated development of the MMV approach, below.

Fig. 6
figure 6

Multivariate RF model prediction models. Performance of multivariate RF model predictions in the dynamic, sliding window scenario in which the preponderance of instances is truly negative (t = 1–30 min) prior to a possibly positive event at t = 40 min. The reduced median performance and broader range in PPV compared to Fig. 5 motivated the MMV approach. Boxplots depict the median, IQR, and range, with the 110-point swarm overlay corresponding to (11 models) × (10 positive test cases)

3.4 Multi-model (MMV) performance

The MMV approach runs 11 separately trained RF models in parallel and evaluates the resulting 11 votes to yield a prediction. Moreover, the sensitivity and specificity of the MMV approach can be controlled by adjusting (“dialing”) the Nvotes threshold for positive prediction. Figure 7 summarizes the MMV performance and influence of the adjustable threshold in the dynamic sliding window scenario. The 11 ROC curves in Fig. 7a show the dependence on Nvotes—largely unaffected over the range Nvotes = 1–5 and decreasing for Nvotes > 6, illustrated further in terms of AUC in Fig. 7b. Figure 7c and d show the anticipated trade-offs in sensitivity and specificity with adjustment of Nvotes. In the dynamic scenario for which negative samples outnumber positives by at least a factor of 3, Figs. 3f–g show PPV and NPV to be optimal in the range Nvotes = 6–8.

Fig. 7
figure 7

Performance of the MMV approach. The figure illustrates the performance of the MMV approach in the dynamic sliding window scenario with variable Nvotes (the number of votes required for a positive prediction). a ROC curves and b AUC for 11 settings of Nvotes. Trade-offs in c sensitivity and d specificity measured as a function of Nvotes. e Accuracy, f PPV, and g NPV measured as a function of Nvotes. The boxplots depict median, IQR, and range, with swarm overlay corresponding to ten positive cases (noting that Sensitivity and PPV can only be computed for positive cases, since they require true-positives by definition)

Finally, the performance of the MMV approach was evaluated in the dynamic scenario with a nominal value of Nvotes = 6 (out of 11—i.e., a simple majority), as summarized in Fig. 8. Compared to Fig. 5, this scenario presents a more challenging, realistic context wherein the propensity of instances is truly negative and compared to the single-model approach in Fig. 6, MMV demonstrated improved AUC (0.96) and median Accuracy (0.98), Sensitivity (1.0), Specificity (0.96), PPV (0.89), and NPV (0.98). The improvements are statistically significant (p < 0.05) compared to the single-model predictions (Fig. 6).

Fig. 8
figure 8

Performance of the MMV approach (with Nvotes = 6; simple majority) in the dynamic sliding window scenario. a ROC curve. b Boxplots of AUC, Accuracy, Sensitivity, Specificity, PPV, and NPV, showing median, IQR, and range, with swarm overlays corresponding to the ten positive cases

4 Discussion

Liver resections have a high incidence of hemodynamic disturbances, including IOH, as indicated by our work and others, likely related to large fluid shifts, extreme positioning, and high use of vasopressors [4, 31,32,33]. Machine learning methods have been successfully employed to predict IOH in various clinical settings outside of liver surgery. Although a recent study demonstrated that the HPI system offers potential benefits to reduce IOH compared to goal-directed hemodynamic therapy, liver surgeries were excluded from the analysis [34]. The research reported above represents a novel use of multivariate ML models using forward analysis of variables (as recommended) to predict, with the aid of invasive blood pressure monitoring, the first episode of IOH during open liver resections by integrating demographic data (age, BMI, and sex) with 13 physiological time series signals recorded at 1-min intervals [22].

In this research, the multivariate model outperformed a univariate (MAP-only) model, and a fairly wide variety of hemodynamic and respiratory time-series signals were important to reliable prediction. This finding contrasts with a recent study indicating that MAP may perform as good as the HPI at set thresholds of 72 or 73 mmHg [30]. The dynamic sliding window scenario presented the real-world prevalence challenge of sequential data for which a preponderance of the data is truly negative for IOH, and a novel MMV technique was developed to maintain PPV under such conditions. The MMV approach also presents the intriguing capability for the anaesthesiologist to dial the sensitivity and specificity of the model according to their judgment with respect to factors related to the patient or phase of the procedure—a subject of future work.

Based on AUC, the HPI in the internal validation cohort has shown the highest performance at 5 min, with the lowest being in the Jacquet-Lagre`ze’s model [16, 17, 20, 35]. Compared to those models, our MMV strategy demonstrated a relatively high performance (AUC = 0.96). However, a fair comparison between available models and our strategy is difficult because of differences in the clinical data used as predictors, the type of surgical patient population, and ML strategies [19, 36].

Davies et al. used a cohort of patients from nine previous studies and compared data using either a backward approach with a gray zone or a forward approach without a gray zone [22]. The latter strategy (forward without a gray zone) is similar to the approach taken in our work; however, there are important distinctions. First, we used only data relating to an intraoperative context, whereas Davies et al. used a mixed population of operative and ICU patients. Second, Davies et al. included data from invasive and non-invasive arterial waveform analysis using HPI, whereas our work only used recording from invasive arterial blood pressure without waveform patterns. Lastly, Davies et al. observed in their forward approach without a gray zone a low PPV (0.52), whereas the MMV approach reported above maintains high PPV and suggests that our results could be more clinically actionable in predicting whether a particular patient will truly develop IOH.

Additionally, previous work used intraoperative arterial waveform analysis, such as the HPI model, to predict IOH [17, 20]. More recently, Hwang et al. used local trends of arterial blood pressures to how certain waveform shapes were associated with IOH. The study showed good predictive performance based on the reported AUC (> 0.9) [37]. The rationale for using arterial waveform contour lies in identifying abnormal compensatory mechanisms before IOH develops via changes in the waveform. However, pulse contour analysis technology needs frequent calibration in patients with low systemic vascular resistance or after changes in vasopressor dosage [38]. We have not used waveform contour analysis in the current work. Rather, the multivariate ML model described above incorporated demographic data in combination with arterial blood pressure values, respiratory parameters, and anaesthetic time series data for prediction instead of a single physiological parameter [17, 20].

Using a deep learning convolutional neural network, Lee et al. constructed several prediction models of IOH in a conglomerate of non-cardiac surgical patients employing (a) an arterial-pressure-only model, (b) an invasive multichannel model (arterial blood pressure, electrocardiograph, photoplethysmography, and capnography), (c) a photoplethysmography-only-model, (d) a non-invasive-multichannel-model and (e) a non-invasive hybrid model [16]. Lee’s invasive multichannel model shares some of the physiological parameters used in the prediction model reported above, demonstrating AUC = 0.91, sensitivity = 0.86, and specificity = 0.86 [16]. The MMV model reported here yielded a somewhat higher AUC (0.96), recognizing the need for external validation. Hwang et al. [37] constructed a model with an interpretable predictor and interpretable methods. The model showed high performance in the internal validation phase, which was somewhat reduced in external validation. Significant differences in the work reported by Hwang et al. from that reported above include use of a mixed population of patients, waveform shape analysis and Fourier transform for each blood pressure cycle, and use of a “gray zone” between MAP = 65 mmHg and 75 mmHg [37].

Feature ablation analysis shed light on the importance of the multiple intraoperative variables contributing to the model developed in this work. The three highest-ranking parameters in the prediction model were age, BMI, and diastolic pressure. Age and BMI variables have been previously shown to provide value in predicting postinduction hypotension; of course, they do not lend themselves to real-time monitoring/early warning systems in the operating room [39].

Accurate alerts on potentially imminent IOH should allow anaesthesiologists to intervene and improve postsurgical clinical outcomes. It could also be of value to report the relative uncertainty associated with the prediction—e.g., a score from 1 to 11 corresponding to Nvotes in the MMV approach. The ability to control the sensitivity and specificity of the model by adjusting the Nvotes threshold opens an interesting possibility that warrants further investigation. A clinical application of such an approach could allow the anaesthesiologist to “dial up” the number of votes in a relatively healthy patient for surgery, indicating that some level of tolerability of hypotension while “dialing down” for a patient with multiple co-morbidities and a lower threshold to intervene early.

Limitations of the work reported above include (a) retrospective study design, which may lead to predictive bias, (b) the arbitrary but well-accepted definition of IOH (< 65 mm Hg sustained for at least 2 min), (c) ML methods that can be subject to bias introduced in training data, (d) the prediction of the first IOH event (only), and (e) limitation to internal validation, with testing on external datasets recognized as an essential area of future work. Class imbalance intrinsic to the data (6× in favor of positive cases, consistent with prevalence of IOH in these surgeries) necessitated a combination of patient-level and segment-level splitting of the data, with 6× sampling of negative data intervals (non-overlapping, separated by at least 20 min) necessary to achieve class balance. Segment-level splitting in training for the “Static” scenario is acknowledged to carry potential bias, and future work warrants larger datasets to enable a larger volume of training data balanced at the level of the patient. However, the “sliding window” scenario involved testing with strict patient-level splitting in 20 unseen cases, confirming performance from the “static” scenario and suggesting that possible leakage effects were minimal.

Previous work on perioperative outcomes suggests an interaction between IOH, deep levels of hypnosis, and low minimum alveolar concentration (MAC) [27]. Our work included “anaesthetic dose” and depth of hypnosis as variables; however, we did not have MAC data available in our data registry. Furthermore, we only included patients receiving desflurane as the main volatile anaesthetic for the maintenance of hypnosis. In addition, despite including heart rate as an input variable in the model, our algorithm does not differentiate between endotypes of IOH [40].

Our model has shown that it is capable of predicting an IOH event up to 8 min before the event, which we acknowledge to be a somewhat shorter window overall than that used by the HPI (5, 10, and 15 min) [15]. Recognizing the flexibility of the MMV approach described above, future work could consider development short, medium, and long range predictors—e.g., with short-range models optimized for sensitivity (reduced Nvotes threshold) and long-range models optimized for specificity (higher Nvotes threshold). Future work will also address how the strategy can be adapted to multiple instances of IOH and in response to treatment and specific patient populations.

In conclusion, the rate of IOH is high in patients undergoing liver resection surgery—85% in this cohort. The overall performance of the MMV predictive model was high, achieving AUC = 0.96, median PPV = 0.89, and median NPV = 0.98 within the challenging context of a dynamic sliding window for which the propensity of data is normal (truly negative for IOH). Further research is needed to compare with other proprietary algorithms, such as that used for the HPI, and apply external validation in a larger, more complex population of patients. Furthermore, research is needed to determine whether our IOH prediction model would be superior to prompt management of IOH.