1 Introduction

Operating rooms (ORs) are some of the most expensive units of a hospital, so careful management is essential for efficient utilisation of these rooms Cardoen et al. (2010). We focus on a particular operational aspect of managing the use of ORs – the OR scheduling problem, i.e., determining the allocation of each elective surgery to a surgical list of an OR session. This allocation requires an estimate of the surgery duration which may have a high variability. A surgery duration is influenced by many factors such as the surgical procedure, the patient’s physical condition, surgeon’s experience, the number of supporting staff available, and the type of anaesthesia administered. Surgery durations are characteristically variable and, in addition, a lack of information at the time of OR scheduling can also contribute to a surgical list being scheduled unintentionally with too few surgeries (an under-booked list) or too many surgeries (an over-booked list). Under- or over-booked lists may result in undesirable consequences on the room outcomes which include reduced patient throughput, cancellation of surgeries, and overtime.

In this paper, we introduce a new performance measure referred to as the OR scheduling metric for evaluating surgical lists within OR scheduling. This measure accounts for the variability of surgery durations and is applicable across different session durations. This metric can be useful as a tool for OR scheduling by users (e.g., booking clerks) or for future incorporation within sophisticated OR scheduling algorithms.

1.1 Literature review

The research question considered here is: “Can a single metric evaluating surgical lists balance the probability of list overrun with the magnitude of list overrun”. In order to answer this question, in this literature review, we first explore previous work related to surgery scheduling, surgery durations and surgical list metrics.

Guerriero and Guido (2011) present a structured literature review on the use of operational research for surgical planning and scheduling processes. In their paper, they categorise three hierarchical decision levels for the management of ORs. The first level is “strategic” – which concerns the overall distribution of room times among different surgical specialities. The second level is “tactical” – which concerns the development of a surgery schedule that rosters surgeons and allocates durations for each OR sessionFootnote 1. Finally, the third level is “operational” – which concerns the scheduling of patients requiring an elective surgery to sessions, i.e., OR scheduling.

Several approaches include estimates of surgery durations for OR scheduling to create surgical lists in the literature. These approaches regard the surgery durations as either deterministic or stochastic. As an example of approaches using deterministic durations, Vijayakumar et al. (2013) relate the surgical scheduling problem as a dual bin-packing problem, which can be formulated as a mixed integer programming model, with the items being surgeries with deterministic durations and the bins being OR sessions. Vijayakumar et al. report increases in room utilisation rates and the number of scheduled surgeries. An example of approaches using stochastic surgery durations is Pandit and Tavare (2011) – together with the subsequent correction by Proudlove et al. (2013) – which uses the means and pooled standard deviations of the 10 most recent and relevant surgery durations to calculate the probability that a surgical list (within an OR session) will not overrun. Pandit and Tavare demonstrate an improvement in OR scheduling using their proposed approach as compared to the ad hoc method (i.e., no fixed algorithm) used by the hospital being studied.

After surgeries are performed, the room outcomes are evaluated. Samudra et al. (2016) identify several performance measures, such as room utilisation rates and waiting times for surgeries, for evaluating these room outcomes. The choice of performance measures for evaluation depends on the priorities agreed upon by the various hospital stakeholders. These performance measures have been broadly classified by Oh et al. (2011) as either hospital- or patient-centric. The stochastic treatment of surgery durations, also demonstrated in Marques and Captivo (2017) and Kroer et al. (2018), is preferred to improve the planning of surgical lists. However, the use of a probability-based measure to evaluate surgical lists, such as in Pandit and Tavare (2011), has the limitation that the probabilities of list overruns do not reflect the severity of the overrun whenever it occurs. To illustrate this, we consider two hypothetical scenarios.

  1. Scenario A.

    An overrun is almost guaranteed to occur but the excess duration is capped at a maximum of 1 minute.

  2. Scenario B.

    An overrun is predicted to occur with a 50% probability but the excess duration is guaranteed to be at least 1 hour when an overrun occurs.

Readers may agree that scenario A is preferable over scenario B despite the higher probability of overrun for Scenario A, and others may additionally suggest an expectation-based measure to be used concurrently with a probability-based measure for evaluations. While the use of multiple measures is a feasible suggestion, the subsequent evaluations are more complicated. For example, it is more difficult to rank a variety of scenarios similar to A and B in order of preference when more measures are involved.

The underlying form of the OR scheduling metric resembles that used within the context of healthcare (see Xie et al. 2017; Zhang et al. 2020) and also beyond (see Aumann and Serrano 2008; Hall et al. 2015; Jaillet et al. 2016; Zhang and Tang 2018).

The OR scheduling metric addresses a gap in the way surgical schedules are planned operationally. It combines stochastic estimates of surgical schedules into a single metric that balances probability- and expectation-based considerations. This single practical metric will be used when allocating surgeries to surgical lists.

1.2 Paper structure

We provide further information and a formal definition of the OR scheduling metric in Sect. 2. In Sect. 3, we apply the OR scheduling metric to evaluate historical surgical lists and show how the metric relates to the utilisation of an OR session. After that, in Sect. 4, we perform a simulation of OR scheduling based on (a subset of) the same historical surgical lists in order to compare the OR utilisation outcomes when using a traditional approach or the OR scheduling metric. Section 5 provides information about the benchmarking of the OR scheduling metric in practice. Finally, we provide a few concluding remarks in Sect. 6.

Readers may wish to skip Sects. 2 and 3 – which explore the mathematics behind the OR scheduling metric – and proceed directly to Sects. 4 to 6 if they are interested in the practical aspects of the OR scheduling metric for managing ORs and the managerial insights on this metric.

2 The OR scheduling metric

Booking clerks have a challenging task in OR scheduling to create surgical lists that achieve desirable outcomes. These outcomes include minimising the number and length of session overruns, maximising room throughput, and keeping adherence to policies such as waiting time for elective surgery. The booking clerks set their criteria for deciding when a surgical list is acceptable. In this section, we describe an approach that evaluates a surgical list plan based on its expected duration. For a surgical list of an OR session, let L be the random variable for the surgical list duration which is the cumulative duration of all events, including turnoversFootnote 2, that are scheduled to occur. A typical surgical list duration comprises surgery durations and turnover durations. Also, let d be the session duration which is the allocated duration of the OR session. Suppose that the ideal outcomes for room utilisation lie in the interval \(d - \epsilon _{l} \le L \le d + \epsilon _{u}\), where \(-\epsilon _{l}\) and \(\epsilon _{u}\) are, respectively, the lower and upper limits for the deviations from d that are deemed acceptable. We shall neglect the lower bound \(d - \epsilon _{l} \le L\) as we are considering the scenario where there is limited session time available to meet the demands for elective surgery services, i.e., high demand or limited OR sessions or both, so an OR should not frequently experience under-utilisations.

During the evaluation of the surgical lists, a realisation of L may be estimated by \(\mathbb {E}[L]\), the mean duration of L. The mean duration is readily available as it can be estimated using the durations of related historical events. Furthermore, this estimation does not require any knowledge about the population distribution of L. The mean duration is easy to use as a decision tool, as the booking clerks only need to ensure that \(\mathbb {E}[L] \le d + \epsilon _{u}\) for the evaluation of surgical lists. However, these estimates do not take into account the variability of L, which could lead to a high proportion of session overruns. A few possible remedies that account for the variability of L are

  1. 1.

    to slightly inflate the values of \(\mathbb {E}[L]\), i.e., \(\mathbb {E}[L] + k\), where a positive k is chosen at the discretion of the booking clerks, and/or

  2. 2.

    to use various measures that consider the variability of L, such as \(\text {Pr}(L > d + \epsilon _{u})\), which is the probability of an unacceptable session overrun, and \(\mathbb {E}[L - (d + \epsilon _{u}) \;\vert \; L > d + \epsilon _{u} \;]\), which is the expected duration of an unacceptable session overrun given that one has occurred.

These remedies may also be considered during OR scheduling in the evaluation of surgical lists, such as ensuring that \(\mathbb {E}[L] + k \le d + \epsilon _{u}\) for the first remedy and that the values of \(\text {Pr}(L > d + \epsilon _{u})\) and \(\mathbb {E}[L - (d + \epsilon _{u}) \;\vert \; L > d + \epsilon _{u} \;]\) are within acceptable thresholds for the second remedy. These remedies are used, for example, in Jebali et al. (2006) and Hans et al. (2008). To forecast desirable room outcomes, several of these performance measures may be used simultaneously. However, the values from multiple performance measures are usually more complicated to interpret and analyse as compared to values calculated from a single performance measure.

We propose the OR scheduling metric that incorporates expectation and probability in a single measure for surgical lists. Within the context of healthcare, the OR scheduling metric resembles other decision criteria in the literature such as the “entropic bed shortage metric” in Xie et al. (2017) – see Eq. (1) – and the “maximum risk aversion level” in Zhang et al. (2020) – see Eq. (2). We present their metric structures to demonstrate their similarities to the metric presented here and direct interested readers to the referenced papers for full definitions and explanations of each of the metrics. Their proposed metric is as follows.

$$\begin{aligned}&\rho (\tilde{z}) = \text {inf}\left\{ \alpha > 0 \mid \alpha \text {ln}\mathop {{}\mathbb {E}}[\text {exp}(\tilde{z}/\alpha )] \le 0 \right\} \end{aligned}$$
(1)
$$\begin{aligned} \begin{array}{r} f_{R}\bigl (\tilde{\xi }(\varvec{x})\bigr ) = \text {sup} \bigl \{ \alpha > 0 \mid (1/\alpha )\text {ln}\mathop {{}\mathbb {E}}[\text {exp}(\alpha \,\tilde{\xi }_{r,t}(\varvec{x}))] \le 0, \\ \forall r \in R, t \in T \bigr \} \end{array} \end{aligned}$$
(2)

Xie et al. (2017) demonstrate how their proposed metric can be used to study the risk of bed shortages in a variety of approaches (descriptive, predictive and prescriptive analytical) and scenarios (e.g., presence of long stayers and changing demographics of the aged population). Zhang et al. (2020) incorporate their decision criterion in a stochastic programming model that studies the tactical surgical scheduling problem in order to maximise the ability to resist overtime risk for each time period with a fixed duration (i.e., mixed surgical sessions are not considered). For the two papers, both the theory and the numerical experiments based on real surgery data are used to demonstrate the potential of their proposed metrics. Similar to these two decision criteria, there are multiple advantages to our proposed OR scheduling metric:

  1. 1.

    it is computationally easy and inexpensive to implement;

  2. 2.

    it allows for side-by-side evaluations of surgical lists from different OR session durations; and

  3. 3.

    it eliminates the need to simultaneously manage multiple (probability and expectation) measures.

The remainder of this section is organised as follows. In Sect. 2.1, we define the OR scheduling metric. We demonstrate the OR scheduling metric under normality assumptions in Sect. 2.2. After that, we state several desirable properties of the OR scheduling metric in Sect. 2.3 and demonstrate how our proposed metric may simplify the evaluation of surgical lists in Sect. 2.4.

2.1 Definition of the OR scheduling metric \(\rho (S)\)

For a surgical list, let L and d be the random variable for the surgical list duration and the parameter for the OR session duration respectively. We denote the random variable that represents the shortage position of an OR session by S, where \(S = L - d\). Let \(\mathbb {S}\) be the set of random variables that represent the shortage positions of all OR sessions. The OR scheduling metric \(\rho : \mathbb {S} \rightarrow [0, \infty ]\) quantifies the risk associated with the planned surgeries for a surgical list, or more precisely the risk of session overruns under uncertainty at the time of planning, and is defined by

$$\begin{aligned} \rho (S)&= \text {inf}\{\alpha > 0 \mid \mu _{\alpha }(S) \le 0 \}, \end{aligned}$$
(3)

where \(\mu _{\alpha }\) is given by

$$\begin{aligned} \mu _{\alpha }(S)&= \alpha \ln \mathbb {E}[ \exp (S/\alpha )]. \end{aligned}$$
(4)

Note that the convention inf \(\emptyset = \infty\) is used for the OR scheduling metric. The two extremes where a session overrun is expected (or respectively, unlikely) to occur are \(\rho (S) = \infty\) (or respectively, \(\rho (S) = 0\)).

We remark that the expression for \(\mu _{\alpha }\) is almost identical to that used in Eq. (1). In fact, the only difference is the random variable S in Eq. (4) which is continuous, but the random variable \(\tilde{z}\) in their paper is discrete.

Next, we give an intuition for the expressions of \(\mu _{\alpha }\) and \(\rho\). In the introduction of Sect. 2, we define a realisation of the random variable L (the surgical list duration) and note that it may be estimated by \(\mathbb {E}[L]\). For the OR scheduling metric, we instead estimate a realisation of the transformed scaled shortage position for an OR session \(\exp (S/\alpha )\) by \(\mathbb {E}[\text {exp}(S/\alpha )]\). By setting

$$\begin{aligned} \text {exp}\left( \frac{\mu _\alpha (S)}{\alpha }\right) = \mathbb {E}\left[ \text {exp}\left( \frac{S}{\alpha }\right) \right] , \end{aligned}$$

we observe that \(\mu _\alpha (S)\) is an estimate of S via the transformation and scaling that depends on \(\alpha\). With a little manipulation, we can get the definition of \(\mu _\alpha (S)\) given in Eq. (4). The OR scheduling metric \(\rho\) finds the smallest positive scaling factor \(\alpha\) with \(\mu _\alpha (S) \le 0\), i.e., that provides a estimate of S in which the OR session is not expected to experience an overrun. Thus, the “scaled” cumulative duration estimate of scheduled events fits within the session duration even though the scaling \(\alpha\) is not dependent on the OR session duration. Hence, it enables the comparison of surgical lists for OR sessions of different durations.

Expressions for \(\mu _{\alpha }(S)\) and \(\rho (S)\) can be obtained if the population distribution of S is unknown; we can use the values from a dataset that forms an empirical distribution of S and apply the definition of expected values. If the population distribution of S follows the normal distribution, then we can determine the expressions for \(\mu _{\alpha }(S)\) and \(\rho (S)\) analytically. We shall derive these expressions in the next subsection.

2.2 Computing \(\rho (S)\) under normality assumptions

In the special case where S and the turnovers follow the normal distribution, we use the moment generating functions to derive a closed-form expression for \(\rho (S)\). We consider three alternatives for the definition of \(\rho (S)\) with the derivation of the formulae given in supplementary information. The alternatives are:

  1. 1.

    Turnovers excluded

    Let \(T_{i} \sim \text {N}\left( \mu _{i},\sigma _{i}^{2}\right)\), \(i \in \{1,2,\ldots , I\}\) be the duration for surgery i. Assume that any two surgery durations are mutually independent. Then, by letting \(\overline{l}\) to be the mean duration of L which is \(\sum _{i=1}^{I} T_{i}\) in this case, we obtain:

    $$\begin{aligned} \overline{l}&= \sum _{i=1}^{I} \mu _{i}, \\ \rho (S)&= {\left\{ \begin{array}{ll} \dfrac{\sum _{i=1}^{I} \sigma _{i}^{2} }{2\bigl ( d - \overline{l} \bigr )} \;&{}\text {if } d - \overline{l} > 0,\\ \infty &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
    (5)
  2. 2.

    Turnovers normally distributed and depend only on preceding surgery

    This is an alternative where only clean up from the previous surgery is considered. The additional assumptions/definitions are:

    • \(C_i, i \in \{1,2,\ldots , I-1\}\) is the turnover cleaning duration after surgery i;

    • \(C_{i} \sim \text {N}\left( \mu _{Ci},\sigma _{Ci}^{2}\right)\);

    • \(T_{i} + C_{i}\) is normal for each i;

    • the covariance between \(T_{i}\) and \(C_{i}\) is \(\delta _{Ci}\), and the variance of \(T_{i} + C_{i}\) is \(\sigma _{i}^{2} + \sigma _{Ci}^{2} + 2\delta _{Ci}\).

    It follows that

    $$\begin{aligned} \begin{array}{l} \quad \;\;\overline{l} = \sum\limits_{i=1}^{I}\mu _{i} + \sum\limits_{i=1}^{I-1}\mu _{Ci}, \\ \rho (S) = \left\{ \begin{array}{l} \frac{\sum _{i=1}^{I-1} \bigl ( \sigma _{i}^{2} + \sigma _{Ci}^{2} + 2 \delta _{Ci} \bigr ) + \sigma _{I}^{2}}{ 2 \bigl ( d - \overline{l} \bigr )} \\ \qquad\text { if } d - \overline{l} > 0, \\ \infty \quad\text { otherwise.} \end{array} \right. \end{array} \end{aligned}$$
    (6)
  3. 3.

    Turnovers normally distributed and depend on both preceding and upcoming surgery

    This alternative considers both clean up (from the previous surgery) and preparation (from the upcoming surgery). The additional assumptions/definitions are:

    • \(P_{i} \sim \text {N}\left( \mu _{Pi},\sigma _{Pi}^{2}\right) , i \in \{2,3,\ldots , I\}\) is the preparation time for surgery i;

    • assume for each i that \(C_{i-1}\) and \(P_{i}\) are independent, and that \(P_{i} + T_{i}\) is normal;

    • \(\delta _{Pi}\) is the covariance between \(P_{i}\) and \(T_{i}\).

In this case,

$$\begin{aligned} \begin{array}{l} \quad\;\;\overline{l} = \sum\limits_{i=1}^{I}\mu _{i} + \sum\limits_{i=1}^{I-1}\mu _{Ci} + \sum\limits_{i=2}^{I}\mu _{Pi}, \\ \rho (S) = \left\{ \begin{array}{lll} \frac{\sum _{i=1}^{I} \sigma _{i}^{2} + \sum _{i=1}^{I-1} \bigl ( \sigma _{Ci}^{2} + 2 \delta _{Ci} \bigr )}{2 \bigl ( d - \overline{l} \bigr )} \\ \quad + \frac{\sum _{i=2}^{I} \bigl ( \sigma _{Pi}^{2} + 2 \delta _{Pi} \bigr ) }{2 \bigl ( d - \overline{l} \bigr )} \quad \text {if } d > \overline{l} \\ \infty \qquad\qquad\qquad\quad\;\; \text {otherwise.} \end{array} \right. \end{array} \end{aligned}$$
(7)

Note that, for all alternatives, the condition \(\overline{l} \ge d\) corresponds to the over-booking of surgeries for an OR session and results in \(\rho (S) = \infty\). Beaulieu et al. (2012) observe from their simulations that such plans frequently lead to the cancellation of at least one surgery. This is an undesirable situation which justifies \(\rho (S) = \infty\).

In the next subsection, we present properties that are desirable for the OR scheduling metric.

2.3 Properties of \(\rho (S)\)

The OR scheduling metric evaluates a surgical list by providing a non-negative or infinite score before the list is executed in an OR. A score is useful only if it is an accurate representation of the risks associated during the execution. In this subsection, we make explicit this representation by giving the desirable properties of the OR scheduling metric from Xie et al. (2017). The authors also show these properties are valid for any distribution. Therefore, the properties hold for our OR scheduling metric.

In each of the following, suppose that \(S, S_{1}, S_{2} \in \mathbb {S}\). The desirable properties are summarised as follows.

  • Property 1: Monotonicity.

    If \(\text {Pr}\left( S_{1} \le S_{2}\right) = 1\), then \(\rho (S_{1}) \le \rho (S_{2})\). Smaller values of the metric are associated with lower risks.

  • Property 2: Satisficing.

    If \(\text {Pr}\left( S \le 0\right) = 1\), then \(\rho (S)=0.\) If a session overrun is an impossibility, then the corresponding metric is zero.

  • Property 3: Overloading avoidance.

    If \(\mathop {{}\mathbb {E}}[S] > 0\), then \(\rho (S) = \infty\). If a session overrun is expected, then the corresponding metric is infinite.

  • Property 4: Positive homogeneity.

    For all \(k \ge 0\), \(\rho (kS) = k\rho (S)\). The metric is proportionally increased when the exposure to a given shortage position is magnified by k times.

  • Property 5: Subadditivity.

    \(\rho (S_{1}+S_{2}) \le \rho (S_{1}) + \rho (S_{2})\). The risk of combining two sessions is never larger than the sum of the risk of each session.

  • Property 6: Risk pooling under independence.

    Suppose that \(S_{1}\) and \(S_{2}\) are independent. Then \(\rho (S_{1}+S_{2}) \le \text {max}\{ \rho (S_{1}), \rho (S_{2})\}\). Combining two independent sessions is preferred over managing them separately.

These properties enable the OR scheduling metric to give an accurate reflection of the risks associated with the execution of surgical lists. In the next subsection, we relate the OR scheduling metric to a few established measures for evaluating surgical lists.

2.4 Comparing \(\rho (S)\) with other measures for surgical lists

A surgical list faces greater undesirable risks as the value of \(\rho (S)\) gets larger. While the OR scheduling metric has a meaningful interpretation that quantifies the risks of session overrun under uncertainty, its values cannot be easily related to real-world or physical quantities such as time. Nevertheless, we observe next that \(\rho (S)\) somewhat balances two key established measures of performance, namely: 1) the probability of session overrun; and 2) the conditional expectation of overrun duration given that overrun has occurred. Having an understanding of the relationship between the OR scheduling metric and these key existing measures of performance is important for users of the OR scheduling metric. For example, users may need to justify why a surgical list of a certain value of \(\rho (S)\) is deemed acceptable (or unacceptable) before the surgical list is realised.

Table 1 explores the relationship between the OR scheduling metric, the probability of session overrun Pr(\(S > 0\)), and the conditional expectation of overrun duration given that overrun has occurred \(\mathbb {E}\left[ S^{+} \vert S > 0\right]\). We consider surgical lists of the same session duration, i.e., 240 minutes, and assume that the cumulative list durations are normally distributed. The results of Table 1 are computed using Eq. (5). From the results of Table 1, we observe for a fixed \(\rho (S)\) that the two measures of performance have an inverse relationship, i.e., the higher the probability of an overrun the shorter the expected duration of an overrun. This is an important and intuitive observation as the OR scheduling metric should strike a good balance between these related performance measures. We also observe that as \(\rho (S)\) increases, both measures indicate a decline in performance due to the higher risks of session overrun. This is consistent with the monotonicity property of the OR scheduling metric.

Table 1 We consider surgical lists whose durations are normally distributed and\(\rho (S)\)is fixed at 1 (top) and 100 (bottom). Several instances of the statistical parameters\(\mu\)(mean) and\(\sigma\)(standard deviation) are listed. For each set of parameters, the values of Pr(\(S > 0\)), the probability of session overrun, and\(\mathbb {E}\left[ S^{+} \vert S > 0\right]\), the conditional expectation of overrun duration given that overrun has occurred, are calculated

So far, we have applied the normality assumption to compute the values of the OR scheduling metric for exemplars of surgical lists. In the next section, we use the OR scheduling metric to evaluate surgical lists from a real-world dataset. Note that the normality assumption for the cumulative duration of a surgical list (or even the duration of a surgery in the surgical list) may not apply. We shall see that empirical distributions may be used in place of the normal distribution to compute \(\rho (S)\).

3 Evaluation of historical surgical lists

In this section, we study the values of the OR scheduling metric computed on historical surgical lists. The objectives for this evaluation are: 1) to establish the credibility of the OR scheduling metric in making inferences about the room utilisation rates; and 2) to demonstrate the utility of the OR scheduling metric in achieving desired room utilisation rates.

We consider a surgery dataset which is provided by the Waitemata District Health Board (DHB) and studied by Soh et al. (2020) as part of a research collaboration with the University of Auckland. The Waitemata DHB is the largest in New Zealand by catchment population, and provides both acute and elective state-funded surgical services across three sites: North Shore Hospital - both within the main building and in the Elective Surgery Centre - and Waitakere Hospital.

The raw dataset comprises 6,210 ear, nose, and throat (ENT) elective surgeries performed within an approximate 3.5-year period between 1 February 2015 and 14 May 2018. After data cleaning – elaborated in the supplementary information – the cleaned dataset comprises 6,106 surgeries.

For the evaluation of historical lists, we shall use the following broadly categorised information.

  1. 1.

    Booking information: the timestamps of the scheduled start and end times of surgery for which the booked duration is obtained.

  2. 2.

    Surgery-related information: the surgery type which is described by at least one descriptor (or procedure), performing (or primary) surgeon, and the timestamps of the patient entering and leaving the OR from which the surgery duration is obtained.

  3. 3.

    Session-related information: the allocated session duration, allocated surgeon – which is denoted by the session owner, and the sequence of surgeries planned in a surgical list.

A surgical list for an OR session requires information about all of the planned surgeries for that session. This presents an issue for surgeries that have been removed from the cleaned dataset. In these cases, we use the raw dataset to provide the missing information, e.g., surgery duration and procedure, for our evaluation of the historical surgical lists.

The OR scheduling metric requires the population distributions of event durations in a surgical list. These population distributions may be approximated using empirical distributions. The procedures for obtaining empirical distributions – of surgery and surgical list durations – require the cleaned dataset and, in cases as described previously, the raw dataset. The supplementary information provides a detailed elaboration. Note that these procedures are unique to our surgery dataset. In general, we expect variations of these procedures to be implemented to obtain estimates of population distributions for other datasets.

The remainder of this section is organised as follows. In Sect. 3.1, we explain how the OR scheduling metric is applied using these estimates of the population distributions to assess each surgical list. After that, the credibility of the OR scheduling metric is established in Sect. 3.2. Finally, we discuss the evaluation results of the historical surgical lists in Sect. 3.3.

3.1 Computation of OR scheduling metric

The value of the OR scheduling metric for OR scheduling and, hence, creating surgical lists depends on the empirical distribution of each list’s duration. If the mean of the empirical distribution is greater than the session duration, then \(\rho (S) = \infty\). Otherwise, we compute \(\rho (S)\) numerically by using its definition as given in Sect. 2.1. For our ENT surgery dataset and throughout this research, the computations are performed using R (version 3.4.4) (R Core Team 2018). Since \(\mu _{\alpha }(S)\) is continuous, the “uniroot.all” function from the “rootSolve” package may be used on \(\mu _{\alpha }(S)\) to numerically find \(\rho (S)\), the smallest positive root within a bounded interval, such as \([10^{-4}, 10^7]\). If such a root does not exist, then \(\rho (S) = 0\). We emphasise that the inputs required to compute a value of the OR scheduling metric for a given surgical list involve only distributions of surgery durations and turnaround durations as well as the allocated session duration; the actual surgical list duration is not an input. Therefore, readers should be aware that this computation is unrelated to model training which typically requires parameters to be estimated and/or learning from (a subset of) paired input-output data.

We note that \(\rho (S)\) may vary when the normal approximation criterion for the duration of a surgery is used or the empirical distribution for the duration of the list requires an approximation (see supplementary information). To account for the sampling variability, we repeat the steps given in the supplementary information to obtain 30 approximated empirical distributions that give 30 estimates of the OR scheduling metric for a given surgical list. These values will be used to compute the mean. Note that the mean of values that include an infinity is taken to be infinite. Finally, for the purpose of graphing results, we apply the base-10 logarithm after restricting the values of \(\rho (S)\) to our chosen interval \([10^{-4},10^{4}]\); values of \(\rho (S)\) below \(10^{-4}\) or above \(10^{4}\) are presented as \(\log _{10}\rho (S) \le -4\) (under-booked lists) or \(\log _{10}\rho (S) \ge 4\) (over-booked lists) respectively.

3.2 Establishing credibility of the OR scheduling metric

Given that the OR scheduling metric is designed to evaluate the surgical lists before the realisations of events, the first objective for evaluating historical surgical lists is to establish the credibility of the OR scheduling metric in making inferences about the room utilisation rates. More formally, by defining the running duration to be the time interval between the first patient that enters the room and the last patient that leaves the room within realisations of each session, the actual room utilisation rate of a session is the ratio between the running duration and the allocated session duration. The ideal ratio is 1, that is, the room utilisation is maximised without an overrun. In this subsection we present our comparison of the OR scheduling metric values for historical surgical lists, using only information available before the list eventuated, with actual room utilisation rates for the lists. This comparison establishes the credibility of the metric’s inferences.

Scatter plots of actual historical room utilisation rates and values of \(\text {log}_{10}\rho (S)\) are given in Fig. 1. Note that the surgical lists used in the scatter plots are first categorised by the year number (starting from 1 February 2015) in order to declutter plots. Then, for each category, the surgical lists are further partitioned by the durations of the allocated sessions. Note that this partition is not necessary for the OR scheduling metric which can be used to compare surgical lists across different session durations. However, the same cannot be said for the room utilisation rates. Since a difference of 0.1 in the room utilisation rate represents a possibly undesirable change of 6 minutes in the running duration of a session per allocated hour, it is more meaningful to separately analyse the room utilisation rates of half-day and full-day sessions. We also present scatter plots of actual room utilisation rates and planned room utilisation rates (using surgeons’ estimates and assuming a turnover duration of 15 minutes between two consecutive surgeries) in Fig. 2 using the same layout as that in Fig. 1. Comparisons of the relationships between the OR scheduling metric and actual utilisation versus planned utilisation and actual utilisation are made next.

Fig. 1
figure 1

Scatter plots of \(\text {log}_{10}\rho (S)\) and actual historical room utilisation rates that are segregated by the type of session: half-day (top) and full-day (bottom), and by the year number, i.e., within 1 year (left), between 1 and 2 years (middle), and between 2 and 3 years (right) from 1 February 2015

Fig. 2
figure 2

Scatter plots of planned room utilisation rates and actual historical room utilisation rates that are segregated by the type of session: half-day (top) and full-day (bottom), and by the year number, i.e., within 1 year (left), between 1 and 2 years (middle), and between 2 and 3 years (right) from 1 February 2015

A positive trend in the values of the OR scheduling metric and the actual room utilisation rates establishes the credibility of the OR scheduling metric. Indeed, for each scatter plot in Fig. 1, there is a positive trend in the values of the OR scheduling metric and the actual room utilisation rates. In contrast, the trends in the planned room utilisation rates and the actual room utilisation rates in Fig. 2 (full-day sessions in particular) are not obvious especially after omitting several potential high leverage points (i.e., extreme values of the planned room utilisation rate) that could unduly influence these trends. We remark that there is wide variability of room utilisation rates for each value of the OR scheduling metric. The wide variability is unsurprising because the OR scheduling metric is a measure of session overrun risks which is not the same as a measure for predicting room utilisation rates. A high risk of session overrun for a surgical list, e.g., when the standard deviation of list duration is large, may eventually complete all the surgeries extremely quickly on the surgery day, i.e., a low room utilisation rate is observed. If accurate predictions are a consequence of the OR scheduling metric, i.e, we are able to precisely determine which surgical lists will overrun or not, then the values of \(\rho (S)\) will either be 0 (for lists that have no risk of session overruns) or infinite (for lists that lead to session overruns). Indeed, the OR scheduling metric makes inferences about the room utilisation rates but does not predict these rates.

In the next subsection, we evaluate the historical surgical lists from the results of the OR scheduling metric.

3.3 Evaluation of historical surgical lists

In Sect. 3.1, we compute the value of the OR scheduling metric \(\rho (S)\) for each historical surgical list. Figure 3 shows the values of \(\text {log}_{10}\rho (S)\) for surgical lists occurring between 1 February 2015 and 31 January 2018 in a chronological order. From the figure, we make the following four observations.

  1. 1.

    There appears to be no seasonal pattern based on the values of the OR scheduling metric.

  2. 2.

    There is a clear band of values of \(\rho (S)\) that lie between 1 and \(10^{2} (= 100)\) approximately (between 0 and 2 on the plots in the right column).

  3. 3.

    The discrete probability density function shifts in the positive direction as more surgical lists are over-booked and fewer surgical lists are under-booked for each subsequent year in 2016 or 2017.

  4. 4.

    A significant number of surgical lists have \(\text {log}_{10}\rho (S) \ge 4\), i.e., over-booking is more common than under-booking.

Fig. 3
figure 3

Evaluation of surgical lists using the surgical list durations provided in the raw dataset. The values of the OR scheduling metric are restricted in the plots to \([10^{-4},10^{4}]\). Values below \(10^{-4}\) or above \(10^{4}\) are presented as part of the \(\le -4\) or \(\ge 4\) collection of points. The resulting values for the OR scheduling metric are taken to base-10 log. The lists are sorted in chronological order for 1 year starting on February 2015 (top left), February 2016 (middle left) and February 2017 (bottom left). The charts on the right are the discrete probability densities

We remark that there are several possible reasons that contribute to the fourth observation: a) booking clerks are compelled to plan risky surgical lists in order to reduce the length of waiting lists; b) there are surgeries with a duration that definitely exceeds the duration of any full-day session but booking clerks must schedule the surgery; and c) booking clerks are intentionally booking more surgeries for several sessions because the surgical teams are paid for a full session regardless of whether the session finishes early, so utilising the full session duration reduces waste. In addition, surgical teams may be willing or able to work beyond the stipulated session end time, so over-booking is possible although with an extra expense.

We also comment that there may be particular sessions where surgeons are willing to accommodate short durations of overruns (which is a change in preference). For these sessions, the definition of the OR scheduling metric can be modified by replacing the allocated session duration d with \(d + \Delta\), where \(\Delta\) is the overrun duration that is acceptable to a surgeon. The value of \(\Delta\) should vary for each session, since it may also depend on resource-related factors. We are unable to perform a further analysis because we do not have data to determine a value of \(\Delta\) for each session in the ENT surgery dataset. Nonetheless, the OR scheduling metric can be implemented as part of the evaluation tool for surgical lists, i.e., by aiming for metric values to fall within an acceptable range, in order to achieve the desired room utilisation rates.

In the next section, we provide a simulation for traditional OR scheduling and compare it to simulations of OR scheduling using a variety of OR scheduling metric scenarios. We observe the associated room outcomes for each OR scheduling approach.

4 Simulation

In this section, we perform a simulation of OR scheduling that uses a subset of the actual surgical data from Sect. 3. Our objective is to demonstrate how the OR scheduling metric may be used to improve the room outcomes when the metric is incorporated in OR scheduling to create surgical lists. There are two key steps to our demonstration. The first step builds and validates a simulation of OR scheduling (simulated base scenario) that mimics the actual surgery scheduling processes by the hospitals at Waitemata DHB. In particular, the surgeons’ estimates of surgery durations are used to plan the surgical lists. After verifying and validating the simulation, we simulate OR scheduling using a number of simulated alternative scenarios of the OR scheduling metric for the second step. The patient throughput, actual room utilisation rates, and the waiting times for surgery will be used as key performance indicators to evaluate the eventuated surgical lists.

We describe here a simplified version of the actual elective surgery booking procedure that is followed by the ENT surgeons. In reality, the booking clerks do not follow a standard procedure for elective surgery booking, but instead each clerk uses individual customisations of the simplified procedure presented here, that are based on the idiosyncrasies of the specific booking clerk. Each surgery request for a patient must be initiated by the attending specialist, that is, the patient owner. The patient owner specifies the procedure(s) to be performed, gives an estimate of the surgery duration, and assigns a priority scoreFootnote 3 that takes the value 1, 2, or 3. Next, the anaesthetic department determines whether a patient is required to attend a pre-admission clinic and adjusts the estimated surgery duration accordingly before transferring the surgery request to a booking clerk. Upon receiving a booking request, the booking clerk books a slot in one of the allocated sessions listed in the surgical roster and specifies the booked start time and end time for the surgery.

Generally, a surgery should be performed within 14 days (priority 1), 60 days (priority 2), or 120 days (priority 3) from the surgery request date. There are occasional exceptions to this guideline for waiting times such as a mandatory reduced waiting time for a surgery if it is the patient’s first cancer treatment. Also, a patient could be temporarily suspended on the waiting list due to the patient’s request (e.g., patient is unavailable for surgery within a period of time) or a clinical reason (e.g., patient is not cleared for surgery till further investigations, treatments, or procedures are completed). To simplify the simulation, we shall assume that no surgery in the simulation is the patient’s first cancer treatment. We also assume that no patient will be temporarily suspended on the waiting list, so each day on the waiting list contributes to the waiting time for surgery.

In order to consider the waiting times in our simulation, we have acquired a supplementary dataset that captures the waiting list information for ENT patients whose surgeries (in the raw dataset) are performed between 1 September 2016 and 31 August 2017. The extent of this dataset determines the planning horizon of our simulation. The following points summarise how the (raw and cleaned) ENT surgical datasets and the supplementary dataset will be used in our simulation.

  1. 1.

    The raw dataset is used to obtain the relevant information about the surgical lists (such as the allocated rooms, session owners, and session durations), surgery information (such as the surgery types and performing surgeons), and booked durations.

  2. 2.

    The cleaned dataset provides the surgery durations that give the appropriate empirical distributions which are required for the computation of the OR scheduling metric.

  3. 3.

    The supplementary dataset is used to obtain, for each surgery, the priority score and the placement date which is the date that a surgery is placed on the waiting list. The placement dates lie between 5 May 2016 and 16 August 2017. The waiting times for surgeries can be computed from this dataset.

We emphasise that not all data from the ENT surgical datasets (e.g., actual order of surgeries in surgical lists, actual surgeons performing particular surgeries, and actual waiting times of particular surgeries) are used during the construction of our simulation. Instead, this data will be used to evaluate the surgical lists generated by OR scheduling during the simulation.

We give an outline of the remainder of this section. The various simulation details are given in Sect. 4.1. Section 4.2 gives the results of the verification and validation of our simulation. Finally, we implement and evaluate simulated alternative scenarios that incorporate the OR scheduling metric in OR scheduling in Sects. 4.3 and 4.4 respectively.

4.1 Simulation details

To model OR scheduling that creates surgical lists in the simulation, we will have to determine scheduling dates (Sect. 4.1.1), find alternative surgeons whenever necessary (Sect. 4.1.2), and allocate surgeries to OR sessions, thus creating surgical lists (Sect. 4.1.3). The following subsections describe these procedures in greater detail.

4.1.1 Scheduling date

Recall that the supplementary dataset provides the placement date (on the waiting list) and the priority score for each surgery. This information shall be used in our simulation. In addition, we introduce the scheduling date which is the date that the booking clerk allocates the surgery to a surgical list. Note that the scheduling date is not the surgery date. We remark that the scheduling date must be after the placement date and that a confirmed surgery date cannot be altered subsequently. If there are multiple surgeries on a scheduling date, then their priority scores decide the sequence of surgeries to be scheduled. Therefore, the effect of introducing the scheduling date is a combined queue from three surgery queues (for each priority score).

In the simulation, the scheduling date is the same as the placement date for priority 1 surgeries. For priority 2 and priority 3 surgeries, the scheduling date is at least 28 days and 35 days from the placement date respectively so that higher priority surgeries will be scheduled first and all scheduled surgeries will not be subsequently rescheduled. The exact number of days and accompanying justifications are provided in the supplementary information.

4.1.2 Alternative surgeons

Recall from the simplified elective surgery booking procedure (described in the introduction of Sect. 4) that each surgery is requested by a patient owner. Ideally, the patient owner should be the performing surgeon. However, there are situations where the patient owner is unable to take up the role of the performing surgeon. For example, the patient owner is physically away for extended periods of time or the OR sessions allocated to a patient owner are fully booked. These situations require a surgeon other than the patient owner to perform the surgery. The procedure for determining alternative surgeons is provided in the supplementary information.

4.1.3 OR scheduling

We examine the actual processes that are followed by the booking clerks before describing a simulated version of these processes. The master surgical schedule (MSS) is a timetable that provides information about the room, session owner, and allocated session duration for each OR session. The MSS is usually determined in a cyclic manner which could be in the order of weeks or months. We assume that the booking clerks have complete information about the OR sessions before and on any potential surgery date for a surgery; otherwise, a surgery date could not be confirmed.

OR scheduling to create surgical lists requires the booking clerk to decide on the allocation of new surgeries to OR sessions based on several criteria. Examples of criteria include the surgeons’ estimates of surgery durations, the availability of surgeons, and the waiting times for surgeries. There may be other criteria that may be specified by the surgeon or the patient as special requests, such as preferences for surgery dates. The booking clerks do not frequently encounter special requests, so these other criteria shall not be considered in our simulation. We remark that the use of surgeons’ estimates of surgery durations for allocating surgeries to OR sessions can be viewed as an expectation-based measure that is mentioned in Sect. 2.

After the booking clerk confirms a surgery date, both the tentative booked start and end times are specified within the corresponding surgical list. Note that these two booked times are mentioned in the introduction of Sect. 3 as the booking information. The time interval between the two timestamps gives the booked duration of the surgery. We observe from the cleaned dataset that the booked durations (or surgeons’ estimates) are always rounded to the nearest 5 or 10 minutes and have high tendencies to overestimate the actual surgery durations by a multiplicative factor of approximately 1.2. The booked durations may include a safety factor for robustness in the planning of surgical lists, but we shall interpret each booked duration as the estimated upper bound of the surgery duration in our simulation.

Before we describe the procedure used in our simulation to allocate surgeries to OR sessions, we make a few assumptions. Consider a (new) surgery whose estimated upper bound of the predicted surgery duration by a surgeon in minutes is denoted by u. Assume that the estimate of the surgery duration provided is the same regardless of the surgeon performing the surgery. Also, assume that 20 minutes is the maximum duration of session overrun that is acceptable. This value is determined while calibrating our simulation against the historical data. While we aim for the simulation to be as realistic as possible, this assumption is necessary because the maximum duration of session overrun that is acceptable in reality depends on many factors (e.g., staffing and resources).

Now, we state the following scheduling rules that allocate a new surgery to an OR session (and, hence, the resulting surgical list for that session).

  1. 1.

    On each scheduling date, the order of surgeries for scheduling is based on the priority score, with priority 1 surgeries scheduled first. Surgeries are scheduled on a first-come, first-served basis. No surgery is removed from a surgical list after it is scheduled.

  2. 2.

    For each new surgery, the booking clerk attempts to allocate the surgery in one of the patient owner’s OR sessions that is within the acceptable waiting time. If an allocation is not possible, then the booking clerk finds alternative surgeon(s) using the procedure described in Sect. 4.1.2. When multiple alternative surgeons are available, the surgery will be scheduled to the alternative surgeon’s OR session that corresponds to the minimum waiting time, subjected to the adherence of the remaining scheduling rules and must also be of no less than 7 days after the placement date.

  3. 3.

    We require \(u \le d + 20\) for the first surgery scheduled to a half-day session of duration d minutes.

  4. 4.

    There is no restriction for the first surgery scheduled to a full-day session of duration d minutes. However, if \(u \ge d + 20\), then no further new surgeries can be scheduled to this session.

  5. 5.

    Suppose that a surgical list has U minutes currently scheduled. Recall that the turnover duration is 15 minutes. A new surgery may be added to the surgical list for the OR session of duration d minutes if \(U + 15 + u \le d + 20\).

Note that a minimum of 7 days wait on the waiting list is always imposed for all surgeries, since the patient and the surgical staff require time to prepare for the surgery. The scheduling rules described here are reasonable in the sense that they strive to avoid extended durations of session overruns. We do not claim that these scheduling rules are always followed by all the booking clerks at Waitemata DHB, i.e., there is no hospital-wide policy regarding the scheduling rules. We also emphasise that these scheduling rules do not involve the OR scheduling metric. Note that we will not be determining the order of surgeries in a surgical list.

We remark that cancelled surgeries are not considered in the simulation as such cancellations are extremely rare in our surgery dataset – only 1 cancellation is recorded. Note that a cancellation is not the same as re-scheduling which has been catered for in the simulation by setting scheduling dates to be at least 28 and 35 days after placement dates for priority 2 and priority 3 surgeries respectively (see Sect. 4.1.1). We also note that any surgery in a surgical list that does not take place as planned (e.g., a power outrage in an extreme case leading to the cancellation of most surgeries) will result in a lower room utilisation rate which is not attributable to the OR scheduling metric, so this is irrelevant to the simulation objective.

After the surgeries in a surgical list are confirmed, we can use the OR scheduling metric to evaluate the surgical list. Note that all surgeries performed between 1 February 2015 and 14 May 2018 are used to obtain the empirical distributions of surgery durations. Even though we are only considering surgeries that occur between 1 September 2016 and 31 August 2017 in our simulation, we have included a larger number of historical observations to model as broad a range of surgeries as possible.

4.2 Validation of simulation

In the simulation, we evaluate the surgical lists that occur between 1 September 2016 and 31 August 2017. We report that all the surgeries from the supplementary dataset are successfully scheduled in the simulated base scenario, so we can compare the historical outputs with that from the simulation. Figure 4 compares the average waiting times of the historical surgeries with that of the simulated surgeries. The waiting times are comparable for each of the three surgery priority scores, which is the result of our calibrated simulation with historical data (see Sect. 4.1.3).

Next, we compare the plots for historical and simulated surgical lists in Figs. 5 and 6, and conclude that the outputs in each figure are rather similar. We therefore determine that our simulation is a sufficient replica of the physical system. We acknowledge the remaining subjective elements in our determination. In particular, we note that the surgeons in practice may be willing to perform more surgeries (or at least compelled to, possibly because of the long waiting times), but this will be at the cost of incurring more session overruns (see Sect. 3.3). However, we do not have the data to quantify this change in preference in the simulation. If we are able to quantify this change, we believe that the simulated outputs will be a closer match to the historical outputs. Nevertheless, after considering the various assumptions and simplifications made for the scheduling process as well as events that we have not considered in our simulation on each day of surgery (e.g., cancellation and rescheduling of surgeries), we are reasonably satisfied with the simulation outputs.

Finally, we validate that the OR scheduling metric values are comparable between the historical and simulated surgical lists, i.e., that simulation provides a good estimate of actual OR scheduling metric values. This is important for comparisons of OR scheduling metric value with our upcoming simulations of alternative OR scheduling using that metric. In Fig. 5, we present the room utilisation rates of historical surgical lists with that of the simulated surgical lists using their empirical density plots. For both the half-day sessions and the full-day sessions, we perform two-sample Kolmogorov-Smirnov tests to determine, for each case, if there is evidence to support the alternative hypothesis that the empirical distributions are different. We report that the p-values are 0.522 and 0.276 for the half-day sessions and the full-day sessions respectively, so there is insufficient evidence to reject the null hypothesis that the empirical distributions are the same. We also present the values of the OR scheduling metric evaluated on both the historical and simulated surgical lists for sessions that occur between 1 September 2016 and 31 August 2017 in Fig. 6. Similar to Sect. 3.1, we apply, in Fig. 6, the base-10 logarithm after restricting the values of \(\rho (S)\) to our chosen interval \([10^{-4},10^{4}]\), for which the two extreme values represent under-booked and over-booked lists respectively. Values of \(\rho (S)\) below \(10^{-4}\) or above \(10^{4}\) are represented as \(\le 10^{-4}\) and \(\ge 10^{4}\) respectively.

Fig. 4
figure 4

A comparison of the average waiting times for historical (black/dotted) and simulated base scenario (blue/bold) surgeries for different priority scores: 1 (left), 2 (middle), and 3 (right)

Fig. 5
figure 5

A comparison of the empirical density plots of room utilisation rates between the historical and the simulated base scenario surgical lists corresponding to half-day sessions (top) and full-day sessions (bottom). The sessions occur within 1 year starting on 1 September 2016

Fig. 6
figure 6

Evaluation of the historical (top) and simulated base scenario (bottom) surgical lists for sessions that occur within 1 year starting on 1 September 2016. The lists shown in the plots on the left are sorted in chronological order, and the values of the OR scheduling metric are restricted in the plots to \([10^{-4},10^{4}]\). Values below \(10^{-4}\) or above \(10^{4}\) are presented as part of the \(\le -4\) or \(\ge 4\) collection of points. The resulting values for the OR scheduling metric are taken to base-10 log. The charts on the right are the discrete probability densities

We shall proceed to consider simulated alternative scenarios where the OR scheduling metric plays a significant role in the scheduling of surgeries.

4.3 Using the OR scheduling metric in simulated alternatives

This subsection provides three alternative scenarios to the approach described in Sect. 4.1.3. All of these scenarios use the OR scheduling metric to perform OR scheduling, but the scenarios differ in their scheduling rules that make use of the OR scheduling metric. We use our developed and validated simulation to perform scenario analysis. In particular, we apply ceteris paribus, i.e., “holding other things constant”, to examine the room outcomes when the OR scheduling metric is considered within OR scheduling to determine high quality surgical lists. Since we have established the credibility of the OR scheduling metric in making inferences about the room utilisation rates in Sect. 3.2, we now explore whether it will prove beneficial if it is used for OR scheduling.

As with the simulation described in Sect. 4.1, we begin from the supplementary dataset. Each new surgery will be allocated to an OR session based on several criteria that include the OR scheduling metric. Three different alternative scenarios are considered; the updated scheduling rules for each scenario are stated as follows. Note that the first two rules are the same as for the traditional OR scheduling approach in Sect. 4.1.3.

  1. 1.

    On each scheduling date, the order of surgeries for scheduling is based on the priority score, with priority 1 surgeries scheduled first. Surgeries are scheduled on a first-come, first-served basis. No surgery is removed from a surgical list after it is scheduled.

  2. 2.

    For each new surgery, the booking clerk attempts to allocate the surgery in one of the patient owner’s OR sessions that is within the acceptable waiting time. If an allocation is not possible, then the booking clerk finds alternative surgeon(s) using the procedure described in Sect. 4.1.2. When multiple alternative surgeons are available, the surgery will be scheduled to the alternative surgeon’s OR session that corresponds to the minimum waiting time, subjected to the adherence of the remaining scheduling rules and must also be of no less than 7 days after the placement date.

  3. 3.

    For a surgery whose mean duration does not exceed 500 minutes, it may be allocated to an OR session as long as the resulting value of the OR scheduling metric for the resultant surgical list does not exceed 100 (scenario 1), 250 (scenario 2), and 1000 (scenario 3) – cf. rules 3 and 5 from Sect. 4.1.3.

  4. 4.

    Any surgery whose mean duration exceeds 500 minutes is assigned as the first and only surgery scheduled for a full-day session – cf. rule 4 from Sect. 4.1.3.

In practice, the threshold value in rule 3 of the updated scheduling rules is determined after benchmarking and will be determined prior to any scheduling of surgeries. We shall elaborate on benchmarking in Sect. 5. As the OR scheduling metric not been benchmarked or used to plan the historical surgical lists by Waitemata DHB, we instead consider three different thresholds (or potential outcomes from benchmarking) of the OR scheduling metric for our scenario analysis (to provide some insight into the sensitivity of the OR scheduling metric to its threshold value). For each threshold, we classify a surgical list as nicely-booked if it is neither under-booked with too few surgeries nor over-booked with too many surgeries, and use the three categories (under-booked, nicely-booked and over-booked) to describe the booking status of any surgical list. As we shall see later, the choice of thresholds which use different classification of booking statuses leads to an observable impact on room outcomes.

In order to determine whether a new surgery (historically scheduled to take place between 1 September 2016 and 31 August 2017) can be allocated to an OR session in each alternative scenario using the OR scheduling metric, we need to calculate the resulting increase in the value of the metric from adding that surgery to the surgical list for the session. However, we shall expedite the calculations involved in OR scheduling and assume that the empirical distributions of all surgery durations are normal. This assumption allows us to use only the mean \(\mu _{i}\) and the standard deviation \(\sigma _{i}\) of the empirical distribution for scheduling. The parameters \(\mu _{i}\) and \(\sigma _{i}\) are determined using the (approximate) 2.5-year cleaned dataset, i.e., surgeries performed between 1 February 2015 and 31 August 2016, and between 1 September 2017 and 14 May 2018. The following step(s) are taken for the calculation of \(\mu _{i}\) and \(\sigma _{i}\).

  1. Step 1.

    Use all surgeries of the same surgery type and surgeon. The number of surgeries should be at least 5. If not, go to Step 2.

  2. Step 2.

    Use all surgeries of the same surgery type only. The number of surgeries should be at least 5. If not, go to Step 3.

  3. Step 3.

    Use all surgeries that are related to the identified surgery type. For a surgery type with k descriptors (or procedures), we consider a surgery related to that surgery type if it has \(k-1\) descriptors in common with that type. The number of surgeries should be at least 5. If not, go to Step 4.

  4. Step 4.

    Use all surgeries that have the same surgeon’s estimate of surgery duration as that of the identified surgery. Note that this step is required for new surgery types that are not performed previously. For our dataset, the number of identified surgeries always exceeds 5.

The implementation for the aforementioned normality assumption for surgery durations is as follows. Suppose that there are \(I-1\) surgeries in a surgical list and the \(I^\text {th}\) surgery is added to this list. Recall that the turnover duration is set to 15 minutes in our earlier simulation. Using Eq. (6) of Sect. 2.2, we have the following expression that approximates the OR scheduling metric for this list:

$$\begin{aligned} \rho (S)&\approx {\left\{ \begin{array}{ll} \dfrac{\sum _{i=1}^{I} \sigma _{i}^{2} }{2 \bigl [ d - 15(I-1) - \sum _{i=1}^{I} \mu _{i} \bigr ]} \\ \qquad\text { if } d - 15(I-1) - \sum _{i=1}^{I} \mu _{i} > 0,\\ \infty \quad\text { otherwise.} \end{array}\right. } \end{aligned}$$
(8)

This approximation eliminates the need to obtain the empirical distribution of the surgical list duration and expedites the calculations for OR scheduling. However, this is at the expense of the accuracy of the OR scheduling metric. We remark that this approximation is not necessary in practice because the metric calculations required on a daily or weekly basis for OR scheduling are far fewer than that required in the simulation (which performs 1 year of OR scheduling within a short period of time). We also emphasise that this approximation is used only for allocating surgeries to sessions in the alternative scenarios, particularly when implementing rule 3 of the updated scheduling rules. After the surgical lists are confirmed, the OR scheduling metric (computed from empirical distributions) is used to evaluate the confirmed lists.

In the next subsection, we present and discuss the results of the simulated alternative scenarios for OR scheduling.

4.4 Results and discussion

We evaluate the performances of the alternative OR scheduling scenarios in our simulations. For the scheduling of surgeries from the supplementary dataset in the demonstration, we report on the waiting list for surgeries as follows.

  1. 1.

    For scenario 1, there are 63 unscheduled surgeries. These surgeries have scheduling dates on or after 20 July 2017.

  2. 2.

    For scenario 2, there are 26 unscheduled surgeries. These surgeries have scheduling dates on or after 7 August 2017.

  3. 3.

    For scenario 3, all surgeries are successfully scheduled.

For both scenarios 1 and 2 with unscheduled surgeries, we note that these surgeries will be scheduled if the planning horizon of our simulation is extended beyond 31 August 2017, but that more unscheduled surgeries results in a longer waiting list which, in turn, implies reduced patient throughput.

In practice, users could be extremely risk-averse when benchmarking the OR scheduling metric, which would lead to risk-averse OR scheduling – similar to scenario 1 – that will create surgical lists that mostly do not overrun but consequently increase the waiting list for surgeries (i.e., decrease throughput). Clearly, scenario 1 is not desirable, but this is due to the threshold selected for use in OR scheduling metric for scenario 1, not the metric itself. In fact, the successful scheduling of all surgeries in simulated alternative scenario 3 provides evidence that the OR scheduling metric can be used to provide the same throughput as the more traditional approach, indicating that a poorly selected threshold is the cause of the longer waiting list for surgeries in scenario 1. Hence, we emphasise the importance of performing careful benchmarking before using the OR scheduling metric for OR scheduling.

Figure 7 shows that the average waiting times for surgeries are similar in both the simulation of the current OR scheduling approach and simulated alternative scenario 3 that uses the OR scheduling metric with the largest threshold value (of 1000). It is crucial to ensure that the waiting times for surgeries are not sacrificed for the sake of improving the surgical lists and, hence, the room outcomes.

Figure 8 summarises the classification of surgical lists in the demonstration. From the figure, we observe that there is a slight upwards shift in the OR scheduling metric values for the surgical lists. We expect this will result in different room outcomes for the different scenarios. Indeed, as shown in Fig. 9, the room utilisation peak rates for both the half-day and the full-day sessions from the simulated alternative scenario 3 are closer to and do not exceed the optimal value of 1.0. We compare the empirical distributions for utilisation from the simulated base scenario that perform OR scheduling using a traditional approach to the three scenarios that perform OR scheduling using the OR scheduling metric. Two-sample Kolmogorov-Smirnov tests are performed to determine, for each scenario, if there is evidence to support the alternative hypothesis that the empirical distributions are different. The p-values calculated from the tests are as follows.

  • Scenario 1: \(<10^{-5}\) (half-day), 0.707 (full-day).

  • Scenario 2: \(<10^{-4}\) (half-day), \(1.95 \times 10^{-2}\) (full-day).

  • Scenario 3: \(<10^{-4}\) (half-day), \(6.30 \times 10^{-3}\) (full-day).

For either the half-day sessions or the full-day sessions in scenario 3, the null hypothesis is rejected in favour of the alternative hypothesis, i.e., there is a difference in the utilisation between the simulated base scenario and the simulated alternative scenario 3. Figure 10 demonstrates that the simulated alternative scenario 3 gives an improvement in the room utilisation rates over both historical and simulated base scenario OR scheduling. For the under-booked full-day sessions (labelled “UB-Full”) in scenario 3 of Fig. 10, we remark that the low room utilisation rates is attributed to a lack of surgeries that can be scheduled to these sessions. This observation is very encouraging as it indicates that the rooms may have the capacity to meet additional demand for surgeries in reality. Finally, the results of Fig. 10 agree with the conclusion of Sect. 3.2 that the OR scheduling metric, which makes inferences about the room utilisation rates, can be a valuable tool for evaluating surgical lists and, hence, OR scheduling.

Fig. 7
figure 7

A comparison of the average waiting times for simulated base scenario (dotted) and simulated alternative scenarios (bold) for different priority scores: 1 (left), 2 (middle), and 3 (right). Note that an average is not computed if there is at least one surgery that cannot be scheduled in the demonstration for a particular year and month (i.e., “YearMonth”)

Fig. 8
figure 8

Evaluation of the surgical lists from the sessions in simulated alternative scenarios 1 (top), 2 (middle), and 3 (bottom) that occur within 1 year starting on 1 September 2016. The lists shown in the plot on the left are sorted in chronological order, and the values of the OR scheduling metric are restricted in the plots to \([10^{-4},10^{4}]\). Values below \(10^{-4}\) or above \(10^{4}\) are presented as part of the \(\le -4\) or \(\ge 4\) collection of points. The resulting values for the OR scheduling metric are taken to base-10 log. The chart on the right is the discrete probability density

Fig. 9
figure 9

A comparison of the empirical density plots of room utilisation rate for surgical lists in the simulated base scenario (dotted) and the simulated alternative scenarios (bold) that correspond to half-day sessions (top) and full-day sessions (bottom). The sessions occur within 1 year starting on 1 September 2016

Fig. 10
figure 10

Boxplots of room utilisation rates from the historical, simulated base scenario, and simulated alternative scenarios for sessions that occur within 1 year starting on 1 September 2016. The plots are segregated by the type of session (half-day or full-day) and by the booking statuses or values of the OR scheduling metric. The lower threshold for nicely-booked surgical lists is set to 1; the upper threshold is set to 100, 250, and 1000 for simulated alternatives 1, 2, and 3 respectively. The abbreviations for under-, nicely-, and over-booked surgical lists are “UB”, “NB”, and “OB” respectively. A missing boxplot indicates that there are no surgical lists classified under that category

Table 2 shows the key performance indicators for the simulated base scenario (that uses a traditional metric for OR scheduling) and the simulated alternative scenarios (that use the OR scheduling metric). We highlight that scenario 3 maintains patient throughput. It keeps the waiting times of the patients close to those of the simulated base scenario OR scheduling. It utilises half-day sessions less on average, but has significantly less sessions overruns as a result. It utilises full-day session more on average, with a small number of extra overruns as a result. In addition, both half-day and full-day session overruns have lower mean durations. Scenario 3 (i.e., using the OR scheduling metric with a higher threshold) looks like an appealing alternative to the simulated base scenario.

Table 2 Key performance indicators for the simulated base scenario and the simulated alternative scenarios for OR scheduling

The simulation of the three alternative scenarios perform OR scheduling using the OR scheduling metric while the simulated base scenario uses the estimated upper bounds of surgery durations (see Sect. 4.1.3). The estimated upper bounds may be viewed as values from benchmarking an expectation-based measure (e.g., the expected duration for completing at least a certain percentage of surgeries from a particular surgery type with the benchmarking used to select an appropriate value for this percentage). This form of expectation-based measure is also considered by Zhou and Dexter (1998) for the purpose of predicting whether a surgery will probably be completed within a specified time period. In the case of our surgery dataset, the benchmarked values are implicit and dependent on the booking clerks. After comparing the surgical lists from the three alternative OR scheduling scenarios with the simulated base scenario surgical lists, we conclude that scenario 3 enhances the credibility of the OR scheduling metric which appears to be superior to this particular expectation-based, benchmarked measure. In general, there is no single standardised set of benchmarked measure(s) that will work for every hospital. Indeed, Guerriero and Guido (2011) state that even though hospital managers may aim to maximise OR utilisation rates, it is neither easy to define the optimum utilisation rate nor clear how to establish the trade-offs required to achieve this optimal rate.

In summary, we have built a simulation that is a reasonable replica of the actual scheduling processes. The simulation takes the historical elective surgeries as inputs. For the scenario analysis, we change only the scheduling rules of the validated simulation such that the scheduling decisions are based on the (approximated) OR scheduling metric instead of the surgeons’ estimates. The results of the scenario analysis demonstrate improvements in the actual room utilisation rates while maintaining the patient throughput in simulated alternative scenario 3 when an appropriately benchmarked OR scheduling metric is used.

In the next section, we discuss how the OR scheduling metric may be benchmarked with the aid of historical surgical lists in practice by hospitals.

5 Benchmarking the OR scheduling metric

Recall that three potential thresholds for the OR scheduling metric are considered in the alternative scenarios of the simulation (Sect. 4.3). In practice, the OR scheduling metric \(\rho (S)\) should be carefully benchmarked, e.g., against acceptable risks of session overrun, before it can be implemented in the planning of surgical lists.

The process of identifying or fine-tuning a suitable OR scheduling metric benchmark is iterative. To summarise, the key steps for benchmarking using historical surgical lists are as follows.

  1. Step 1.

    Determine the distribution of event duration for each event in a surgical list.

  2. Step 2.

    Compute the value of OR scheduling metric \(\rho (S)\) for each surgical list.

  3. Step 3.

    Construct a frequency table of \(\rho (S)\) and/or a scatter plotFootnote 4 of \(\log _{10}\rho (S)\) for the historical surgical lists.

  4. Step 4.

    Propose candidate OR scheduling metric threshold(s).

  5. Step 5.

    Evaluate each candidate threshold by examining boxplots of the actual historical room utilisation rates. Other types of evaluation may be included, such as relating the OR scheduling metric with other performance measures (see Sect. 2.4), adherence to users’ requirements, scenario analysis using simulation (see Sect. 4.3), etc.

  6. Step 6.

    Make a decision on whether to adopt a candidate threshold. Stop if a candidate threshold is accepted to be used as the benchmark. Otherwise, construct a new candidate threshold and/or fine-tune an existing candidate threshold. Go to Step 5.

We recognise that the notion of acceptable (or unacceptable) risks, which should be determined prior to any OR scheduling, could differ across users of the OR scheduling metric and may further be influenced by the presence of external factors that are still accounted for by the random variable S, such as an elevated risk of overrun when an experienced surgeon is involved with mentoring during certain surgeries. However, these are matters of users’ preferences that will be present regardless of the performance measure(s) used. In other words, it is not a limitation that is specific to the OR scheduling metric.

Henceforth, for simplicity, we shall take the position that all users at a hospital agree to adopt common thresholds for acceptable risks that are based solely from the result of benchmarking against all possible values of \(\rho (S)\), or equivalently, the outcomes representing all possible risks. Assuming that the preferences do not change over time, the chosen thresholds by users should not vary across surgical lists or when there are significant changes to surgeries (such as the use of new medical equipment) that affect surgery durations and S eventually. Both the position on common thresholds and the assumption that the preferences (informing these thresholds) do not change are permissible because of the monotonicity property of the OR scheduling metric which implies the statement “If Pr(\(S_{1} = S_{2}\)) = 1 for \(S_{1}, S_{2} \in \mathbb {S}\), then \(\rho (S_{1}) = \rho (S_{2})\).” The reliability of the OR scheduling metric is guaranteed as identical surgical lists evaluated using this metric will always yield the same result, so fixed thresholds lead to impartial list evaluation outcomes.

In the remainder of this section, we illustrate the use of historical data to facilitate the benchmarking of the OR scheduling metric. For this illustration to be valid, it must be assumed that these historical surgical lists were determined to be desirable by booking clerks during surgery planning, so that the resulting benchmark from historical data is aligned with the hospital’s requirements.

We consider the use of our historical surgical lists from 1 February 2015 to 31 August 2016 to benchmark the OR scheduling metric. For these surgical lists, Fig. 11 shows the values of \(\text {log}_{10}\rho (S)\) computed both using the steps described in Sect. 3.1 and under the assumption that empirical distributions of durations represent the population distributions adequately.

Fig. 11
figure 11

Evaluation of surgical lists using the surgical list durations provided in the raw dataset. The values of the OR scheduling metric are restricted in the plots to \([10^{-4},10^{4}]\). Values below \(10^{-4}\) or above \(10^{4}\) are presented as part of the \(\le -4\) or \(\ge 4\) collection of points. The resulting values for the OR scheduling metric are taken to base-10 log. The lists are sorted in chronological order from 1 February 2015 to 31 August 2016. The chart on the right is the discrete probability density

Figure 11 is the result of performing Steps 1 to 3 of the aforementioned benchmarking process using historical surgical lists from 1 February 2015 to 31 August 2016. The same information is shown in Table 3 along with the thresholds selected for simulated alternative scenarios 1-3 (see Sect. 4.3). Based on the frequency of occurrences for each interval, users can identify several candidate thresholds which they believe will lead to desirable surgical lists. The three candidate thresholds used for the scenarios are presented in Table 3 as an illustration.

Table 3 Frequency table for the values of \(\rho (S)\) corresponding to the historical surgical lists from 1 February 2015 to 31 August 2016 as shown in the \(1^\text {st}\) two columns. The number of surgical lists classified as under-booked, nicely-booked or over-booked based on a threshold is shown in the last three columns. These thresholds categorise nicely-booked surgical lists using \(0.1 \le \rho (S) < 30\) (\(3^\text {rd}\) column), \(1 \le \rho (S) < 100\) (\(4^\text {th}\) column), and \(10 \le \rho (S) < 500\) (last column). Outside the range of values of \(\rho (S)\) for nicely-booked surgical lists, surgical lists with low (resp. high) values of \(\rho (S)\) are categorised as under-booked (resp. over-booked)

Users may also examine boxplots of the actual historical room utilisation rates for each booking status in order to evaluate potential scheduling policies (for adding surgeries to surgical lists). Figure 12 gives an example of such boxplots for the three candidate threshold scenarios that were simulated in Sect. 4.3 and shown in Table 3. By comparing the boxplots, particularly the nicely-booked surgical lists which should reflect surgical lists desired by users, it is clear that scenario 1 is more risk-averse than scenario 2; each of the two scenarios is in turn more risk-averse than scenario 3. The choice of (scheduling) policy depends on the level of session overrun risk users are willing to accept, user requirements (e.g., meeting a target set by the hospital), user preferences (e.g., willingness to work overtime) as well as the situation “on the ground” (e.g., length of waiting lists).

Fig. 12
figure 12

Boxplots of the actual room utilisation rates using the historical surgical lists from 1 February 2015 to 31 August 2016. The plots are segregated by the type of session: half-day (left) and full-day (right), and by the booking statuses of surgical lists from three different thresholds categorising nicely-booked surgical lists as \(0.1 \le \rho (S) < 30\) (“T1”), \(1 \le \rho (S) < 100\) (“T2”), and \(10 \le \rho (S) < 500\) (“T3”). Outside the range of values of \(\rho (S)\) for nicely-booked surgical lists of each threshold, surgical lists with low (resp. high) values of \(\rho (S)\) are categorised as under-booked (resp. over-booked). The ideal actual room utilisation rate for a session is 1.0

We conclude this section by remarking that simulation, such as those performed in Sect. 4.3, may also be used to comprehensively evaluate candidate thresholds for the OR scheduling metric. The alternative scenarios in Sect. 4.3 study the three potential thresholds we have also explored in this section. The observations from Sect. 4.4 align with the observations from benchmarking, i.e., that the scenarios are progressively less risk-averse. Using simulation within benchmarking to select candidate thresholds will enable these thresholds to be evaluated by considering the performance of the metric over an extended period of time. This is our preferred approach for evaluating thresholds, if such a simulation is available, as it enables the longitudinal effects of the metric to be observed, e.g., its effect on the volume of surgeries during that time period. However, if no simulation is available, then the process described in this section is sufficient.

6 Conclusion

The OR scheduling metric is a single measure for evaluating surgical lists prior to their realisations. This measure is presented as an alternative to the probability- and/or expectation-based measure(s) currently used to evaluate surgical lists. It is suitable as a replacement for such measures within OR scheduling. The use of the OR scheduling metric also simplifies the analysis of the evaluations. We further motivate its utility by presenting the desirable properties of the OR scheduling metric in Sect. 2.3. Even though the OR scheduling metric lacks a physical interpretation, its meaning may be interpreted from the comparison against values from probability- and/or expectation-based measure(s) (see Sect. 2.4).

In Sect. 2.1, we formally define the OR scheduling metric for a surgical list. A closed-form expression can be obtained from this definition and the use of moment-generating functions when the population distributions are normal. These expressions are presented in Sect. 2.2 under different assumptions of the turnovers. If the normality requirement is not satisfied, then a closed-form expression may not exist. For example, moment-generating functions are not defined for heavy-tailed distributions such as the log-normal distribution (Heyde 1963). However, the OR scheduling metric can be easily computed using empirical distributions even when no closed-form expression exists.

From our analysis of a historical surgical dataset (in Sect. 3) and the simulation outputs (in Sect. 4), we observe that the OR scheduling metric, which measures the risks of surgical list overruns, can be used to make inferences on the actual room utilisation rates at the time of surgery planning. Subsequently, after we validate our simulation for OR scheduling, we demonstrate via simulation of alternative scenarios for OR scheduling that desirable changes to the actual room utilisation rates are possible when the OR scheduling metric is used as a decision tool within OR scheduling. This is in spite of the approximation used in the simulated alternatives where we assume normality of surgery durations to expedite calculations that involve the OR scheduling metric. Even though the OR scheduling metric may not perform as well as possible and we can do better by using the empirical distributions for surgery durations in these calculations, the simulation results, in particular scenario 3 of the simulated alternative, are still very promising. For these reasons, we strongly recommend the use of the OR scheduling metric for OR scheduling and to evaluate surgical lists in hospitals. The OR scheduling metric can be used to avoid and/or minimise over-booked surgical lists, provided that the OR scheduling metric is benchmarked effectively prior to its use.

In adapting the OR scheduling metric, it is assumed that a hospital has the necessary data collected from recent surgeries. From this historical data, OR schedulers can decide on an initial threshold using the steps in Sect. 5. Extra time may be needed to determine the most appropriate threshold if the historical data does not either indicate desirable outcomes or clearly show the desirability of one threshold. In these cases, a further study involving simulation is recommended. A periodic re-evaluation of the chosen threshold is subsequently required to ensure that changes in users’ requirements and preferences are accounted for, but these should not be time consuming given that healthcare organisations have already been through the benchmarking process and should, as a consequence, have existing data analytics tools to support future iterations of benchmarking. We recommend periodic benchmarking to align with the master surgical schedule (MSS) planning that determines the allocated session duration. The possibility of any overrun duration that may be acceptable to a surgeon may be considered at the same time (see Sect. 3.3 with the use of \(\Delta\) in the modified OR scheduling metric).

With regards to OR session under-runs, the OR scheduling metric is able to determine that surgical lists are under-booked and hence predict whether sessions will under-run. This situation happens when there is a high number of planned OR sessions relative to a low demand for surgeries that can be scheduled to these OR sessions. As the OR scheduling metric does not have control over the number of planned OR sessions and the demand for surgeries (these are inputs for calculating the values of the OR scheduling metric), the metric is generally unable to avoid under-booking of surgical lists during OR scheduling. The use of optimisation with the metric may alleviate under-booking as the optimal solution can balance values of the OR scheduling metric across different surgical lists (e.g., by setting appropriate lower bounds in the formulation of an optimisation problem). However, such formulations can also return an infeasible solution, particularly when there is far less demand for surgeries relative to the number of planned OR sessions.

We also remark that booking clerks may use the OR scheduling metric as a calculator during OR scheduling, and possibly within an interactive “drag-and-drop” interface where each modification to a surgical list will immediately provide feedback on the risk of a list overrun (both probability of overrun and expected duration of an overrun) based on the resulting modified value of the OR scheduling metric. As booking clerks have been performing calculations using probability- and/or expectation-based measures, their workflows are not anticipated to change significantly with the introduction of the OR scheduling metric, either as an additional calculator or a replacement calculator (the OR scheduling metric can be used on its own).

More research is required to test the OR scheduling metric extensively for general use in hospitals, but we expect similar results due to the desirable properties of this metric. Further research on the OR scheduling metric (to improve the matching of users’ needs) may study the possibility of including other measures such as the projected financial costs of room underrun and/or overrun, but any revision of the metric should not compromise on its desirable properties (see Sect. 2.3).