Recording and transformation of surgical data
The dataset used contains 40 cases of laparoscopic hysterectomy (LH), which were recorded between November 2010 and April 2012 in the Bronovo Hospital in The Hague, The Netherlands, for the purpose of a study on surgical flow disturbances by Blikkendaal et al. [2]. The procedures were recorded using three cameras and four audio signals using an audiovisual recording system (MPEG Recorder 2.1, Noldus Information Technologies, Wageningen, The Netherlands). More detailed information about the methods used can be found in a previous publication [13].
The LH surgery was separated into 10 surgical phases and 36 surgical steps based on the method of perioperative analysis of surgeries by Den Boer et al. [2, 13], see Table 1 for a description. The phases do not necessarily occur in a chronological order. The annotated event log was exported to a plain-text file for further analysis and contained start and endpoints of all observed surgical steps, together with the 12 instruments used in predefined steps. These events represent the features used in building the surgical phase model (SPM). A single entry in the time-based log does not capture all relevant information that could be used to train the model to distinguish phases. Therefore, extra features, such as surgical time, cumulative used time of each instrument and total number of instruments currently in use, were derived from the indicators of instrument to improve the model performance. These additional data transformation and the model generation were performed using the R programming language (R Foundation for Statistical Computing, Vienna, Austria) [14] and RStudio IDE (RStudio Inc., Boston, U.S.A.) [15].
Table 1 Intra-operative surgical phases and steps commonly occurring during a laparoscopic hysterectomy procedure.
Surgical phase modelling
For the purpose of this study, a Random Forest (RF) surgical phase recognition model was used [16]. This is an ensemble model consisting of a collection of decision trees, where each node represents a subset of the data and poses a certain question (e.g. x < 5). The answer to this question is used to further split the dataset and leads to another question at the following node. Finally, at the so-called leaf node, a categorical or numerical prediction of the outcome variable is obtained. Each decision tree is trained on a random subset of the training set and considers a random subset of features at each split. The prediction of each tree counts as a vote for the overall prediction. The modal (in case of classification) or mean (in case of regression) prediction of all trees provides the final prediction of the model.
Model optimisation
An important aspect of modelling is out-of-sample validation, which involves the partitioning of the data into test and training sets. The model is generated based on the training data; validation of the model is performed on a set of unseen test data. In the current study, we use k fold cross-validation, in which the data are split into k folds, in which each acts as a single out-of-sample test set, while the model is trained on the remaining data.
Another important consideration is the choice of a performance metric for use in the out-of-sample validation. In case of a numerical prediction, a commonly reported metric is the mean absolute error (MAE). Further, at each split in the tree, a random subset of features is evaluated for deciding the best split. The number of features to select at each split is one of the most important parameters in RF. The default value for the number of selected features is \({\text{floor}}\left( {\sqrt D } \right)\), with D being the number of features of the object [17].
In this paper, model optimisation was performed using 10 mutually exclusive folds, each containing four surgeries. The number of features considered per split was varied with a grid search of 12 log-spaced integers between 1 and 99. During the optimisation, n = 100 trees were grown for each RF model. The model performance was assessed by the out-of-sample accuracy, defined as the fraction of correct predictions on an unseen set of test data.
Surgical end-time prediction
The performance of the RF model is evaluated with respect to a relevant task in clinical practice in the OR: the prediction of surgical end-times. This refers to the number of minutes that the prediction is off compared to the real duration of the surgery. For this, a second model is obtained that uses the phase predictions to estimate the remaining surgical time. The end-time prediction is given by a multiple linear regression model using the elapsed surgical time, the phase, the number of seconds that the surgery has been in that phase and the interaction terms between phase and seconds in phase as independent variables. The mean absolute error (MAE) in the end-time prediction was also calculated.