Can statisticians beat surgeons at the planning of operations?

The planning of operations in the Academic Medical Center is primarily based on the assessments of the length of the operation by the surgeons. We investigate whether duration models employing the information available at the moment the planning is made, offer a better alternative. We conclude that substantial cost reductions can be achieved by employing statistical techniques. This does not imply that the surgeons' predictions do not contain valuable information. This information is a key explanatory variable in our statistical models. What our conclusion does entail is that a correction of the predictions of surgeons is possible because they are often underestimating the actual length of operations. All ML-routines used in this paper are either performed by using standard routines from Stata or are carried out using R (free software, for information see http://www.r-project.org/). Corresponding author. Full address: Department of Quantitative Economics, Faculty of Economics and Econometrics, University of Amsterdam, Roetersstraat 11, 1018 WB Amsterdam, The Netherlands. Email: j.c.m.vanophem@uva.nl. Phone: +31 20 5254222. Fax: +31 20 5254349.


Introduction
Health care expenditures in western economies appear to be ever rising and are becoming a growing concern for both governments and residents. The burden to cover the costs invokes all the inventiveness of policy makers to come up with new ideas intended to decrease the rate of growth of, or even better, reduce these expenditures. The reason for the growing consumption of health care is threefold (cf. Chiappori et al (1998), Okunade and Murthy (2002), and Bago d'Uva and Jones (2009) for more thorough discussions): (1) the demographic shifts towards the geriatric age groups, (2) the ongoing development in medical care technology, and (3) the existence of large-scale health insurance schemes. Governments have trouble to influence the first two causes simply because they are largely out of their control or not popular due to elective reasons. In countries with publicly provided or financed health care system or insurance, governments have some direct control and many attempts have been undertaken to influence the growth rate of health care expenditures. Bago d'Uva and Jones (2009) give an extensive overview of the different methods European governments have used to regulate the demand for health care in order to slow down or even reduce health costs. Influencing the costs through the supply side usually takes the form of increasing the efficiency, cf. van Houdenhoven et al (2007) and Wullink et al (2007).
In this paper we will investigate whether it is possible to improve the efficiency of the planning of surgical operations at the Academic Medical Center (AMC) in Amsterdam, The Netherlands. In the present situation and in most hospitals, surgeons determine this planning to a large extent, cf. Dexter et al (2007) and Eijkemans et al (2010). They estimate the expected duration of an operation and based on this information the planning of the operating room (OR) is made. A question that should be raised is whether surgeons make their estimates in the interest of the hospital or on the basis of their own interests.
At the AMC, a large academic hospital in the Netherlands with 1200 beds and a budget of €728 million (2007), over 55.000 surgical operations where carried out in 2007 (Annual Accounts, 2007). The costs involved with operations are high. For example, according to a study by Macario et al (1995), OR costs make up for around 33 percent of the Stanford University Medical Center budget. Improvements in the planning of operations might therefore have a substantial impact in the reduction of the costs.
The difficulty of OR planning is balancing between schedules that are too wide and schedules that are too tight, while the duration of individual procedures listed in a schedule is often highly volatile and uncertain. If the planning is too wide there is a risk of empty OR time in between operations or at the end of the day. On the other hand, if the planning is too tight, OR cases will often cause overtime of OR personnel or even cancellations.
Cancellations have to be avoided as much as possible in order to maintain a good level of patient satisfaction. On the other hand, the option to let the OR run overtime instead of canceling cases is costly and unpopular with OR personnel. Currently, the amount of overtime and cancellation of operations at the end of the day are a large problem at the AMC.
Approximately 36% of programs ran late and average overtime resulting was around 50 minutes (Benchmarking OR, 2008). Only 4% of programs finished on time. It is for these reasons that OR management at the AMC seeks to improve the accuracy of daily OR planning and there appears to be plenty of scope.
More accurate prediction of individual OR case durations is one of the ways to reduce the current size of the problem of overtime and cancellation of operations. Here an OR case is defined as all that happens between entrance and exit of the OR by a patient. Generally, it consists of a pre-incision period for anesthesia induction and surgical preparations, the surgical procedure (possibly multiple) itself and the postsurgical period for anesthesia 'deduction'. At most departments of the AMC, surgeons currently predict the duration of an operation at the intake of a patient based on their experience and preferences. The first element as such is no problem but the second might be driven by self-interest.
Unfortunately the surgeon's estimates of the case duration are not very accurate. For example, 18% of the ophthalmologic cases carried out in the AMC between 2003 and 2008 finished more than 15 minutes early and 34% finished more than 15 minutes later than planned. For other clinical specialties with longer procedures, these numbers are even larger.
Since 2008, pilots have been running at the Neurosurgery and Gynecology departments to use also the historical averages per procedure per surgeon instead of personal predictions of surgeons alone. Previous investigation by Dexter et al (2007) indicates however that the historical average is unlikely to predict the variation in duration better than current predictions.
In our investigation we will predict the duration of operations on the basis of a number of different hazard models and we will compare the results with the predictions provided by surgeons. The predictions will be made on the basis of the ex ante information available, including the estimate of the duration by the surgeon. As such, using more complex statistical techniques is not a new idea, but thus far only the lognormal regression model appears to have been employed (Strum et al (2000a), Strum et al (2000b) and Eijkemans et al (2010)). Here we will use the exponential model, the Weibull model, the loglogistic model, the Burr or Weibull-Gamma mixture model, the generalized Gamma model and the piecewise-constant hazard model as well.
We have data available of all ophthalmologic, neurosurgeric and gynecologic operations performed in the last twenty years in the AMC. Because the registration of case characteristics became more complete in 2003 only data from 2003 onwards are used. The remaining period is divided into a 'historical' or 'estimation' period (2003 -2007), which is used for the estimation of econometric model, and a 'prediction' period (January -November 2008). The performance of the different prediction methods is compared within this out-ofsample prediction period. Not only do we consider the prediction on the individual case level, we will also investigate the performance of the different prediction techniques in terms of overtime, undertime and the number of cancellations for all the operations in the prediction period. 3 In the next section the general problem of efficient OR planning and the relation with prediction of OR case duration is explained in more detail. Also, some relevant literature on prediction of individual case duration is reviewed. In section 3, we briefly discuss the statistical estimation methods and we will also discuss how the performance of the different methods will be evaluated. Section 4 contains a description of the available data and section 5 presents the empirical results. The conclusions are listed in section 6.

The planning of operations
A daily OR program consists of elective cases and ambulatory cases. In this paper we define elective cases as all those cases that can be planned up to 10.30 am the day before, when the final planning has to be ready for the next day. Ambulatory cases are all cases coming through after that time. For some specialties of the hospital like general surgery there are separate emergency rooms for ambulatory cases and these cases do not disturb regular planning. For other specialties however, like Ophthalmology, where cases are usually less urgent, there is no separate emergency room. For the last category of specialties, planning of elective cases is likely to be disturbed and delayed by the ambulatory cases coming through. Usually planners account for the possibility of ambulatory cases by leaving some spare time at the end of a daily program (see figure 1). For this reason we will ignore ambulatory cases. On top of that, for ambulatory cases no expected duration of the operation is recorded. Even though we do not consider ambulatory cases, a completely accurate planning of the OR capacity is 3 We use the term undertime as the counterpart of overtime. Undertime, resulting in an underemployment of an operating room, will be considered to be a negative attribute by hospital managers as well as overtime and cancellations. impossible due to randomness or unpredictable variability in case duration. For example unforeseen complications can occur during the surgical procedure. Moreover, the unpredictability of case durations is worse than average for the AMC, due to the academic nature of the hospital which attracts relatively many of the more rare or complex cases.
Because of the impossibility of completely accurate planning, optimal planning of OR capacity is a matter of balancing between several interrelated interests for the AMC. On the one hand, the hospital is reluctant to plan too tight or 'offensive', with the consequence that programs are likely to delay. As mentioned in the introduction this means that either cases have to be canceled 4 at the end of the program or that the OR runs overtime. The first result conflicts with the wish of the hospital to satisfy patients and the second result is not only costly but also unpopular among personnel. These problems can be avoided by leaving enough empty space at the end of the program, called 'slack', or by wide or 'defensive' planning (see figure l), but it is not hard to imagine that planning too defensive is not efficient either. If a case finishes earlier than planned, the next patient has to be prepared in advance in order to continue operating. Assuming that a patient is waiting in the preoperative waiting room no more than half an hour before he or she is scheduled to be operated, it is likely that no patient is ready to be operated after several cases have finished earlier than planned. In this case precious OR time is wasted while personnel waits for the next patient. More important even, if the entire program finishes earlier than planned, then there is almost certainly no patient at hand to fill the space remaining at the end of the day. So on the other side of the coin is the risk to plan too defensive and not fully exploit the OR capacity in between operations or al the end of the day.
Most specialties within the AMC currently tend to plan offensively. This explains the numbers presented in the introduction: 36% of programs ran late and the average overtime resulting was around 50 minutes (Benchmarking OR, 2008).
There are several ways to improve OR efficiency. A first way aims at reducing OR case duration by planning 'straights' of the same procedures. The idea is that surgeons or their assistants gain skillfulness during the straight resulting in reduced duration per case. This solution would have the positive effect that more procedures can be carried out on daily basis, but it does not directly address the problem of unpredictable variability in program duration ). 4 In the AMC delays lead to cancellation of operations if the last operation(s) planned can not be started before 4 pm, the deadline to initiate a non-ambulatory case. A third method to increase OR efficiency is to allow operating schedules to be more flexible. In the AMC the available OR time of a specific department is subdivided to individual surgeons at the beginning of the year and this subdivision is more or less fixed. For example, a surgeon always operates on Monday and Wednesday morning. More flexible schedules could improve daily and weekly planning because planners would be less constrained in finding the optimal daily portfolio of procedures.
Finally there is the solution of more accurate prediction of individual case duration, which is the central issue of this paper. This solution would first of all reduce the risk of individual cases finishing earlier or later than planned. Additionally, it is likely to reduce the risk or variability in an entire daily program as well however. This second effect would mean that less final slack is required in daily programs and therefore, that the OR can be used more efficiently without an increased risk of overtime and cancellations.
Currently there are two different methods to predict OR case durations at the AMC: prediction by surgeons and prediction using historical averages. The first method was used by Ophthalmology, Gynecology and Neurosurgery, and based solely on the experience and preferences of surgeons. For Ophthalmology, the surgeon writes an estimate of the duration of surgery at the intake form of a patient, accompanying a code for the most important surgical procedure. This estimate is supplemented by the planners of the department with a fixed amount of time for local or total anesthesia to determine the planned duration of an entire case. In 2008, the ophthalmologic surgeons underpredicted the case duration with less than 3 percent on average. The Ophthalmology department has neither an explicitly defensive nor offensive planning strategy. The 'imprecision' of planning measured in average absolute difference between planning and actual duration was nearly 29 percent however. Over all departments, most surgeons seem to underpredict case duration to avoid idle OR time resulting in offensive planning. Apart from an average tendency of underprediction of 17 percent AMC wide, predictions are generally imprecise with an average absolute difference between planned and actual duration of 36 percent (Benchmarking OR, 2008).
In 2008, the Gynecology and Neurosurgery departments started to plan OR cases using the historical average of the last ten 'similar' cases conducted by the same first surgeon as well. Here an historical case is regarded as similar if the main procedure that characterizes the newly accepted case was at least performed within the historical case. Whether additional procedures are carried out (or other specialties operated simultaneously) does not matter for regarding the case as similar. Since multiple procedures within a case occur quite frequently, approximately 25 percent of neurosurgery cases for example, it is evident that this method of estimation is often quite inaccurate. However, the historical average is only meant as a guiding figure. Ultimately surgeons and planners still decide on the actual time to be reserved for a case. Both Gynecology and Neurosurgery seem to have benefited from the new planning method because the inaccuracy of planning was approximately 16% lower in 2008 than in the five years prior to 2008.
The inaccuracy of prediction of OR case duration on the basis of the experience of surgeons or anesthesiologists or historical averages is discussed in Dexter et al (2007). They show that although using historical averages probably reduces underestimation of OR case duration, the larger problem of imprecision remains. In literature a number of alternative (statistical) methods have been suggested to predict OR case duration more accurately. The statistical distribution of the duration of surgery was investigated as early as 1963, when Rossiter and Reynolds (1963) noted that the distribution of the duration of surgery appears to fit a lognormal distribution well. An improvement of this method can be achieved by subdividing the data into more homogeneous subgroups (Dexter and Zhou (1998)). In Strum et al (2000a) the emphasis is on the appropriateness of the lognormal model (compared to the normal model) to describe case duration. It is considered category wise for categories with respect to Current Procedural Terminology (CPT) code and anesthesia type (general, local, monitored or total). They use a Friedman test to compare goodness-of-fit of the normal and the lognormal model and find that the lognormal model is preferable in 93 percent of cases.
According to the authors, rejection of the lognormal model occurs if the subsample size is large, short procedure times are rounded or in case of outliers. The lesson of Strum et al (2000a), is not however, that the lognormal model is the most appropriate model overall to describe the distribution of case duration. In fact this topic has received little attention in literature at all and is therefore the most important topic of this paper.
In Strum et al (2003) earlier findings were supplemented by comparing the normal and the lognormal model for cases consisting of exactly two procedures, resulting in even higher preference of the lognormal model. Like in Strum et al (2003) and Eijkemans et al (2010), discussed below, cases with multiple procedures occur in the dataset of our investigation as well.
In Eijkemans et al (2010) a comparison is made between prediction of surgical duration by surgeons on the basis of historical averages and prediction on the basis of a lognormal regression model. The authors use five basic groups of regressors: operation characteristics, e.g. type of surgical procedure, session characteristics, e.g. the number of procedures, team characteristics such as experience of the team, patient characteristics such as age and Body Mass Index (BMI) and other characteristics such as the estimate of duration by the surgeon (without knowledge of an historical average). They find all categories except patient characteristics to contribute a considerable amount to the explanatory power of the model. Adding all explanatory variables significant at 30% they find an adjusted R-squared of 0.796. More importantly, the authors report a reduction in over-and underprediction of case duration by 19% and 17% respectively. Whereas Eijkemans et al (2010) applies only a lognormal regression model, they have more information on cases and therefore potential explanatory factors. In our investigation we apply several other methods, but less information is available from the information system. Also we have fewer observations available.
In the papers of Dexter and Zhou (1998), Strum et al (2000a) and Strum et al (2000b) it was identified that procedure, surgeon and anesthesia seem to be statistically significant explanatory factors for the duration of OR cases. Strum et al (2000b) and Strum et al (2003) estimate a lognormal regression model that they call 'aggregate' for the entire set of cases, in addition to fitting two-parameter lognormal or 'individual' models to subclasses of the data.
As additional explanatory variables to CPT code and anesthesia technique they have the age of the patient, a variable indicating physical status (ASA), emergency and surgical specialty category as explanatory variables. They do not identify any of the additional factors to be statistically significant determinants of variability in duration, comparing differences in duration after tabulation with respect to the variables.
In Dexter et al (2008) a summary of articles is provided on explanatory factors for case duration. In this study first of all they explain differences in components of case duration by different medical conditions, different anatomic procedures used for the same medical condition and different approaches to achieve the same anatomic result. They too find that for prediction on the basis of the scheduled procedure(s), the operating personnel and anesthetic(s) considerable inaccuracy remains. Therefore they have searched for studies that use information from outside OR information systems such as medical records of surgeons, radiology pictures and patient demographics. They find little evidence however of these alternative explanatory factors significantly contributing to increased accuracy in prediction.

Statistical methods
The variable of interest is the duration of an operation. The natural method of analysis of durations is hazard models. Lancaster (1990) and Cameron and Trivedi (2005) give an extensive overview of these models. Since our objective is not so much the understanding of the contributing factors to the duration of operations but to get optimal predictions of the duration and since there are no clues to which model to use, we will apply a broad range of hazard models and simply evaluate important sample statistics to see what hazard model is the optimal one and whether we can outperform the predictions of surgeons. As stated before we will estimate the model on part of the available data (about 80% of the data) and make predictions on the remaining part (about 20% of the data). We will consider the following duration models: • the exponential hazard model The Burr-hazard model is a 'mixture' model and contains a number of the other models listed above. Originally the Burr stems from allowing for a gamma distributed unobserved heterogeneity in the Weibull model. The Weibull hazard belongs to the class of proportional hazard specifications and this means that the hazard function can be written as: where t denotes the duration, xi is a vector of explanatory variables and θ = (ψ,β) are unknown parameters. The usual choice on the specification of is exp ' x i . Allowing for unobserved heterogeneity means that an error is added to this last specification: Under the assumption of a gamma-distrubuted ε i and using the Weibull hazard, the Burr hazard model results. The cumulative distribution function of the Burr is where α > 0. σ 2 reflects the variance of the unobserved heterogeneity term ε i . This distribution function contains as special cases the Weibull distribution for  2 0 and the exponential distribution by setting also α = 1. The log-logistic distribution is yet another special case that can be obtained by putting  2 =1. et al (2010). It assumes that the natural logarithm of duration is normally distributed with mean  ' x and variance σ 2 . The model is most intuitively presented as a linear regression model:

The lognormal hazard model is already applied by Sturm et al (2000b) and Eijkemans
where u i is normally distributed with mean 0 and variance σ 2 . This model can be estimated with OLS and this might explain the popularity of this model in the literature.
The generalized gamma family of models belongs to a different class of models than the previous models described, namely the class of Accelerated Failure Time (AFT) models. This means the model can be expressed as follows: where in this case ui = wi/α and exp(wi) is Gamma(m) distributed and  ' x i  is the hazard function (Lancaster, 1990, p.38). The u term is a disturbance term that allows for unobserved heterogeneity. The distribution of the disturbance term implies that the generalized gamma family of models is characterized by the following density function: where Г(m) is the gamma function. α, m, and  i > 0 are the parameters of the model. The piecewise constant hazard model belongs to the class of proportional hazard characterized by (1). The main characteristic of the piecewise constant hazard model is that it allows the baseline hazard λ 0 (t) to be a step function so that this hazard is constant in prespecified time intervals. In this sense it is a generalization of the standard exponential model for which the hazard is restricted to be constant across the entire range of t. So, in the piecewise constant hazard model we have where c 0 = 0 and cM = ∞ and the other thresholds are specified, but the αj's have to be estimated. As before, regressors are brought in by letting  x i , =exp ' x i  in (1).
Depending on how small the intervals are taken over which the hazard is assumed to be constant, the model can be made as flexible as needed but at the cost of introducing additional parameters that have to be estimated.

Prediction performance measures
To evaluate the predictions for the durations of operations following from the above listed models and stated by the surgeons we will consider the following performance measures: • TOTAL: the total of the estimated operation time needed to process all operations of the prediction period according to the OR planning 5 • MEAN: the mean of the estimated operation time • AD: the average difference between prediction and actual duration • AAD: the average absolute difference between prediction and actual duration • rMSE: the root mean squared error • UPx: the proportion underprediction by more than x = 10, 20 and 30 minutes • OPx: the proportion overprediction by more than x = 10, 20 and 30 minutes Performance is optimal when an unknown 'loss function' is minimized. The choice for the symmetric and rather straightforward loss function above, that is quite similar to the Mean Squared Error (MSE) for small losses, is probably not optimal for the AMC. Our loss function is defined this way because no formal research has been done into the actual losses that result when cases end for example 10, 20 or 30 minutes late or early. Only when these losses are calculated or estimated a really sensible loss function can be defined. For example one could imagine that larger penalties are given to delays than to early finishes.
An implication of our choice of loss function, i.e. that we minimize AD in addition to AAD, is that we prefer predictions centered on average around actual duration to predictions centered around some higher or lower percentile of duration. On average, we choose therefore neither for offensive nor for defensive planning but for a neutral planning instead. In fact we are willing to give up some of the accuracy for the sake of minimal AD in absolute terms. The optimal method from our perspective is not necessarily the most accurate in terms of AAD.

Data
The AMC has started registration of case duration and some characteristics as early as 1988.
In this investigation we have decided however to use the data from 2003 onwards. The first reason is that so much has changed in the OR and in operation technology since 1988 that the early information is not likely to be relevant for current case duration prediction. What is 5 We decided not to present the deviation of the total time needed and the deviation of the means for the planned and actual durations because we believe that the presented figures are of interest themselves. A comparison with the actual total time and mean can be done by using the information from Table 1. more, many case characteristics that are available through the OR information system today, were not registered until 2003. We retrieved information on operations performed by three different specialties: Ophthalmology, Neurosurgery and Gynecology. The selection of specialties allows for the investigation of a wide variety of OR cases that is more or less representative for the AMC. Neurosurgical cases are generally very complex and demanding and accordingly have the longest average duration as well as the largest spread in duration.
Many unpredictable complications can occur during a case. Ophthalmologic cases are usually shorter and less unpredictable. Gynecology combines the extremes of Ophthalmology and Neurosurgery, consisting of many very short procedures as well as relatively many of the more complicated and especially long-lasting cases. Together these specialties make up for an interesting and widespread collection of cases to investigate statistically.
Because of the unavailability of the surgeon's expected duration of the operation, we had to discard all ambulatory operations. Apart from lacking this information, other information on unexpected or emergency operations is often not available as well. As a result, the case duration of ambulatory cases will be much harder to predict. As discussed before, the AMC solves the occurrence of ambulatory operations by allowing for some slack in the daily operation schedule or by using the emergency room.
Sample statistics on the actual and planned duration of the estimation and prediction samples can be found in Table 1 whereas average duration is 245 minutes. Especially the right tail of the distribution is spread out much more for Neurosurgery therefore than for Ophthalmology. The planned duration appears to systematically underestimate the actual duration. The difference between planned total duration and actual total duration of all operations in the estimation sample is almost 30%. The planned spread is also substantially smaller than the actual spread. The underprediction of the duration of operations appears to be systematic. Gynecology entails a combination of short procedures and very long procedures, although not as long as the longest neurosurgeric procedures. Because of this combination, the average duration of 111 minutes lies somewhere in between. The 95 th percentile is near 300 minutes. The spread also lies somewhere in the middle. Also for Gynecology the planned duration differs considerably from the actual duration and again there appears to be an underprediction. The total number of observations is 4268 and 796 (18.7%) observations lie in the prediction period. Note that the sample statistics differ for the samples distinguished but the conclusions drawn before hold also for the prediction sample.  Table 1 illustrate that the surgeons tend to underestimate especially in the neurological and gynecological specialties. Note, however, that in all three cases the performance of the surgeons is better in the prediction than in the estimation period. Whether this is a coincidence or not is not clear, although we know that the AMC has put more emphasis on the importance of good estimation of operation duration in latter years and our prediction sample consists of the more recent observations. 6 The question is to what extent are the surgeon's estimations of the length of operations biased and whether other information has some explanatory power such that we can improve on the predictions. Note that the information we have additional to the expectation of the surgeon, is available to the surgeon as well.
Unfortunately, we experience a significant amount of missing values. To solve this problem we replaced the missing values by the average of the variable (in case that an average has a meaning) or by zero values (in the case of e.g. dummies). In each of these cases a separate binary variable is generated that is equal to 1 for the missing information. Especially the group of patient characteristics is registered very irregularly and the discrete variables indicating health are nearly constant at zero (no complications). As a result, these particular variables are expected to have limited explanatory power. 6 As we have stated before, from the beginning of 2008 the departments of Neurosurgery and Gynecology also use information on the historical average duration per surgeon in the planning of operations. We have experimented with rounding off predictions to a five minute precision level and we concluded that the rounding off does not appear to have a systematic effect.
The second variable that is known to be subject to measurement error is the first surgeon. The first surgeon reported a priori is not always the one who is actually performing the surgery. Although the first surgeon is the one responsible for the operation, the second surgeon or an assistant surgeon may be taking all or part o the action. If this is the case it is no longer possible to determine the correct effect of a surgeon on duration. Moreover, other parameter estimates might be biased as well. Unfortunately there is little that can be done about this flaw. Evidently, our predictions as well as current AMC predictions could have benefited to some extent from correct information concerning the surgeon.
Another complication is the fact that part of the cases consists of multiple procedures.
For a rough sketch, approximately 29% of ophthalmologic cases, 27% percent of gynecologic cases and 25% of neurosurgeric cases between 2003 and 2008 consisted of 2 to maximally 8 procedures. To make the final insight into the applicability of statistical methods as complete as possible, we deliberately consider these cases as well. For the multiple-procedure cases we have chosen to use only the main procedure and the total number of procedures within the case as explanatory variables, instead of using all information and adding each performed procedure. The latter approach is not expected to deliver better results because the additional time required for extra procedures is usually less than the time required for the procedure if it stands by itself. The most important explanation for this difference is the fact that multiple procedures usually overlap in time. The second approach would introduce a measurement difficulty that would not be solved easily. At least many more explanatory variables would be required. The former approach, also taken by Houdenhoven (2007), is preferred mainly because the corresponding parsimony is expected to weigh more heavily on prediction performance than the loss of information attached to it.

Empirical results
We estimate the duration of an operation for the three specialties Ophthalmology, Neurosurgery and Gynecology separately with several hazard specifications and with the use of all information available at the moment operations are scheduled. We do not strive to get a model that is capable of explaining the duration but we are interested in the best prediction possible. As a result we decided to plug in all information available to us. To investigate the quality of a duration model we split up our three samples into two parts: (1) an estimation subsample, on which the model is estimated, containing about 80% of the complete sample and (2) a prediction subsample, on which we predict durations, containing about 20%. 7 The following hazard models are estimated: the exponential (Exp) hazard, the lognormal (Lnorm) hazard, the Weibull (Weibull) hazard, the loglogistic (Loglog) hazard, the Burr (Burr) hazard, the generalized gamma (GenΓ) hazard, the piecewise constant (PCH5: with five minutes intervals. PCH10: with ten minutes intervals) hazard.
The estimation results will not be discussed in detail. We will only present some common features across the three specialties. The estimated prediction of the length of the operation tends to be underestimated by the surgeons. This result is stronger within the neurosurgical and gynecological specialties. In all estimations the surgeon's expectation contributes significantly to the model. Other strongly significant variables are the number of 7 The subsample sizes are approximate because the actual division of the sample was based on a date. surgical procedures performed during the operation, characteristics of the first surgeon and the type of operation. Patient characteristics do not appear to have a strong impact.  conclude that the statistical techniques do in most cases better than the surgeons. The differences are not very large but apart for the average deviation between the predicted and actual duration of the operations in the prediction sample (AD) and the percentage of predictions with a difference of more than +30 minutes between the predicted and actual duration of the operation 8 , the statistical techniques always do better. Note that maximizing a likelihood function does not imply that the best predictions will be found. The results with respect to the Burr hazard are in some instances worse than those of nested models like the Weibull, loglogistic and exponential hazard. Especially the loglogistic model appears to 8 In this case the actual duration is smaller than the predicted duration. perform well. Futhermore, the often used lognormal specification is certainly not one of the best statistical methods to use. For none of the measures it performs best. Finally, note that based on the average deviation between prediction and reality, all methods are very accurate.
The average fault is in all cases less than a minute. In absolute deviations the error is much larger ranging from about 15.3 minutes (loglogistic hazard) to 18.6 minutes (surgeons).  Surgeons underestimate the duration of neurological by more than half an hour on average.
Statistical methods overestimate durations. The best result is found for the loglogistic-hazard with an overprediction of 14 minutes. The Weibull and piecewise-constant hazard perform even worse than the surgeons. The absolute average deviations are closer but still most statistical methods outperform the surgeons. We can also add that in this case the prediction measures are far worse than in the case of Ophthalmology. This conclusion is not surprising.
As we have noted before neurosurgical operations tend to be much longer than ophthalmological operations. The surgeons appear to score good at the overprediction percentages, but, of course, this is a result of the strong tendency to underpredict of surgeons.
The Weibull and the loglogistic models seem to obtain the best scores and the scores are quite similar, where we prefer the loglogistic model. It scores much better on the AD, AAD, MSE and LOSS measures than the Weibull does. The Weibull scores best on the measures reflecting underprediction by more than 10, 20 and 30 minutes, but does a bad job in the prediction of actual durations of operations. Again, the lognormal hazard does not distinguish itself as a particularly good method to use. Shaded entries represent the best result across the row. No convergence was achieved for the "-" entries. The predicted duration and actual duration are measured in minutes. Table 4 present the results for Gynecology. As we argued before, the durations of operations in this specialty are somewhere in between the previous specialties considered. In this case again the surgeons are clearly outperformed by the statistical methods. Whatever performance measure considered, we can always find at least three statistical methods with a better score. The best predictions are found for the Burr hazard. Note the relative good performance of the semiparametric hazard with the five minutes time interval (PCH5). It scores best for two measures (OP10 and OP20). The loglogistic hazard performs almost as good as the Burr.

The planning of operations.
Looking at individual operations, as we do in Tables 2, 3 and, 4, does give information on the quality of the prediction methods but does not show the full and most interesting picture. In most cases more than one operation is scheduled every day and it might be that mispredictions of the duration of individual operations lead to less misprediction or even stronger misprediction of the entire day. In order to investigate this, it would be optimal to employ the actual planning algorithm of the AMC. Unfortunately, this is far to complex to be employed in our cases. For example, in the actual planning degree of urgency of operations is taken into account and this information is not entered in the information system and therefore, not available to us. Many other elements of the necessary information to make this planning are not available to us as well. To get an idea about the quality of the prediction methods we decided to adopt a very simple planning method. We use the prediction samples with the operations arranged according to the actual operation date and time, and simply plan the operations according to the predicted duration of the operation. After having created a fictitious operation schedule in the way, we confronted the schedule with the actual durations of the operations and calculated some performance measures. As far as we can see this is a straightforward and fair way of evaluating the different planning methods. If it favors any of the methods it will be the one based on the surgeon's evaluations since the order of the operations is determined on the basis of these expectations.
We adapt four simple planning strategies. For all strategies we impose that at least one operation is scheduled every day. In this way we allow for operations with an expected duration beyond the operation time available per day. In the first strategy, panel A in Tables 5 up to 10, we plan up to eight hours per day and overtime is never allowed, except for the first operation that day. In the second strategy (panel B) we allow for some slack at the end of the day by only planning for six operating hours. Overtime is not allowed. In the third and the fourth strategy (panels C and D) we do allow for overtime, but only to a limited degree, either in a relative or absolute manner. Overtime is allowed if it suffices the following condition: the expected duration of the marginal operation minus time left that day, relative to the time left that day is smaller than 1. This means e.g. that an operation expected to last 60 minutes will not be scheduled if less than 30 minutes operating time is left for that day. In The correction factor for Gynecology is 6.4%. So the surgeons' number of planned days have to be increased to 221, 303, 181 and 191 days in order to get a fair comparison. The numbers do not compare favorably with the statistical methods, although we also should make a similar, but smaller, correction for the statistical methods because these under-or overestimate the actual duration of the gynecological operations as well. Note that the Weibull hazard predicts the least number of days necessary. Table 3 reveals that this is the only statistical method that severely underestimates the actual duration of the operations.
Another consequence of the underestimation of the duration of operations is that the score on vacant operation rooms, measured by undertime, is relatively good whereas the score on overtime and cancellations is relatively bad. A brief glance on tables 6 and 7 indeed reveals that the surgeons score usually better on the undertime indicator, but worse on the overtime and cancellation indicators. A surprise is that the same conclusion holds for Ophthalmology, even though, the predictions on the duration of the operations are right on the spot. Note that the undertime results across the three different specialties are quite different. This is not explained by a substantial difference in total operations times, but is explained by the nature of the operations. In Ophthalmology the average duration of an operation is much shorter than in Neurosurgery, whereas Gynecology is somewhere in between. The same explanation applies to the relative differences in the overtime and cancellation indicators. The general conclusion has to be that there exists some trade off between the quality indicators undertime, overtime and cancellations. As such, this is not a surprise but it can be clearly found in Tables 5 to 7. In order to evaluate the quality of the planning procedures we need to introduce a cost function that weighs the different quality indicators. This issue will be discussed in due course.
If we compare the different strategies some foreseeable observations can be made.
Allowing for more flexibility, either by having more operation hours available or accepting overtime to a limited extent, decreases the number of days planned and undertime. Obviously overtime will be higher if we allow for it but there is a different impact of the method used.
For Ophthalmology using the absolute criterion creates more overtime than using the relative  considerably if overtime is allowed. It is hard to decide on what strategy is the best one. As we saw before, low amounts of undertime is accompanied by a relative large amount of over time and many cancellations. To make a decision we need to combine these measures in a cost function.
The differences between the statistical methods considered are large in some instances.
The deviations between the results on the basis the loglogistic and the Burr hazard are quite similar, but especially the results of the Weibull hazard are quite different. Especially for Gynecology, the results of the Weibull are very poor. The undertime-score is very good but at the expense of large overtime and many cancellations. The popular lognormal distribution does do better but is not as good as the loglogistic or the Burr.
To make an assessment about the quality of the prediction methods a straightforward way to proceed is define a cost function that combines the quality measures in a single quality measure. Apart from Pandit and Carey (2006), no attempts in this direction appear to have been made, although also Stepaniak et al (2009) and Stepaniak et al (2010) do mention this possibility. The quality measures we will consider are undertime, overtime and the number of cancellations. 11 We will ignore the number of days necessary to program all operations of our prediction sample since this is heavily influenced by the underestimation of the duration of the operations. As we have shown, if we correct for this underestimation, the number of days necessary are quite similar across the prediction methods. Assuming a linear cost function, we have: where γ 1 and γ 2 are positive weights. The problem now is to determine these weights. In the optimal situation, hospital managers would give us the information necessary to determine te weights to allow us to make an objective comparison of the prediction and planning methods.
Unfortunately we do not have such information and we have to rely on our potentially subjective instincts. We propose to use two sets of weights. The first one, which we will not justify because of its objective nature, is to put γ 1 and γ 2 both equal to 1. In the second cost function we assume that γ 1 = 1.5 and γ 2 = 2. Although, given the information we have, it impossible to justify the exact magnitude of these weights, we do believe that 1 ≤ γ 1 ≤ γ 2 . 11 Pandit and Carey (2006) only consider overtime and cancellations.
The problem with undertime is that the operating room is possibly vacant for some time, but since there is no time pressure, it is unlikely that there will be repercussions on the quality of the operation. In the case of substantial undertime, fewer operations will be scheduled than in the optimal situation, and this might have financial consequences to the hospital as well.
Depending on the demand for operations, the number of operation rooms available and the method of planning of operations, undertime in the case of a particular specialty might also have consequences on the planning of the operations of other specialties. An advantage of undertime is that emergency operations are more easily accommodated. The consequence of overtime are more severe. Since an operation can not be stopped halfway the operation there is no other option than to proceed. The result of overtime will be the postponement or even cancellation of other operations, a reduction of the quality of the operations due to the time pressure and additional financial costs because the operation staff has to prolong there working day. The first disadvantage is comparable to the main disadvantage of undertime, but the others are additional. Consequently we believe that γ 1 > 1. Cancellations affect the reputation of hospitals and more importantly the mental well being of the patients. On top of that, if an operation is canceled, it usually will have to be rescheduled within a couple of days.
This will cause additional strain on the operating schedule that might result in overtime or the necessity to put extra slack in the schedule. It is quite hard to weigh the reputation and mental effects with the more financial consequences but we believe, but in this case it is basically only belief, that γ 2 is even higher than γ 1 . Finally, we measure the costs in minutes. The alternative of measuring in days gives very similar results. If our evaluation of the relative importance of the three arguments of the cost function is correct, the equal weight cost function will favor the planning based on the surgeons' expectation of the length of the operations for the neurosurgical and gynecological specialties.
The alternative specification will favor the statistical methods for these disciplines. Tables 8 (Ophthalmology), 9 (Neurosurgery) and 10 (Gynecology) present the relative costs of the planning methods for the four different planning strategies we considered earlier. Since we have a relative measure, we normalize on the costs following from planning according to the surgeons' assessments of the duration of the operations.
For Ophthalmology (Table 8) we find that for the planning strategies that do not allow for overtime the differences in costs across the methods are small. In some cases the surgeons do better but in other cases several statistical methods do better. In the planning strategies that do allow for overtime, the statistical methods outperform the surgeons. In that case, a cost reduction of more about 15% can be achieved. Note that there are no large differences between the two cost functions we employ. Furthermore, note the relative good performance of the planning based on the predictions of the lognormal distribution. For Neurosurgery the choice of the cost function does matter. This is due to the heavy underprediction of the length of operations. Whatever planning strategy and what ever cost function that is used there is allows a statistical method with lower costs. Especially for the planning methods not allowing for overtime and the equal-weight cost function the differences are really small, but for the other methods the differences are quite large. A cost reduction of 10% or more is not exceptional.
As we experienced earlier, the results for Gynecology lie somewhere in between.
Again there is always a statistical method with lower costs than making a planning on the basis of the surgeons' expectations. The bad performance of the Weibull hazard is again striking. The loglogistic hazard performs very well. It indicates that a cost reduction of 5.7% t to 16.4% is possible. Also in this case the choice for the weights in the cost function appear to be non-consequential. Whatever the weights, a cost reduction of 6% or more is possible by applying a statistical method.

Conclusion
We have investigated the planning of operations in the Academic Medical Center for three different specialties. At present, the operations are scheduled according to the surgeon's estimation of the case duration. The average length of the operations performed by the Ophthalmology, Neurosurgery and Gynecology departments are quite different and in general we see that the longer an operation lasts the more difficult it is for the surgeon to predict the length of the operation correctly. Moreover especially in the Neurosurgery department and to a lesser extent in the Gynecology department, the surgeons seriously underpredict the duration of operations. We have investigated the potential of several statistical methods to see whether they do a better job than the surgeons with respect to predicting the duration of operations correctly. In many cases this appears to be the case. Moreover in the future, the prediction period can be extended and the statistical estimations will probably be even more accurate.
In the literature the lognormal model is proposed as an adequate method to represent the duration of operations. From our investigation it follows that this choice, especially for longer durations, is not the optimal one. Especially the Burr distribution, or its special case the loglogistic distribution, appears to be more suitable in many situations.
Due to the complexity of the planning algorithm used by the AMC we were unable to apply it directly to our results. We created four alternative planning strategies that we use to quantify the effect of more accurate predictions of case durations on undertime, overtime and cancellations. Whatever strategy is used, significant cost reductions appear to be possible.
Also, the specific functional form of the cost function utilized does not appear to be very important.
We did not engage in further fine tuning of the statistical methods. For instance, it might be worthwhile to define subclasses of expected case durations and to optimize per subclass. We could distinguish short/medium/long expected durations, according to frequencies of types of operations or according to the number of procedures in the operation. Dexter and Zhou (1998) indicates that this is a useful way to proceed. A brief investigation on our own data has shown us that there indeed is some potential here.
Finally, we want to state that the surgeons' expectations of the case duration is far from worthless. This expectation is an important explanatory variable in our statistical models. Our recommendation, therefore, is not to use statistical methods exclusively, but only in combination with information supplied by the surgeon.