Surgical phase and instrument recognition: how to identify appropriate dataset splits

Purpose Machine learning approaches can only be reliably evaluated if training, validation, and test data splits are representative and not affected by the absence of classes. Surgical workflow and instrument recognition are two tasks that are complicated in this manner, because of heavy data imbalances resulting from different length of phases and their potential erratic occurrences. Furthermore, sub-properties like instrument (co-)occurrence are usually not particularly considered when defining the split. Methods We present a publicly available data visualization tool that enables interactive exploration of dataset partitions for surgical phase and instrument recognition. The application focuses on the visualization of the occurrence of phases, phase transitions, instruments, and instrument combinations across sets. Particularly, it facilitates assessment of dataset splits, especially regarding identification of sub-optimal dataset splits. Results We performed analysis of the datasets Cholec80, CATARACTS, CaDIS, M2CAI-workflow, and M2CAI-tool using the proposed application. We were able to uncover phase transitions, individual instruments, and combinations of surgical instruments that were not represented in one of the sets. Addressing these issues, we identify possible improvements in the splits using our tool. A user study with ten participants demonstrated that the participants were able to successfully solve a selection of data exploration tasks. Conclusion In highly unbalanced class distributions, special care should be taken with respect to the selection of an appropriate dataset split because it can greatly influence the assessments of machine learning approaches. Our interactive tool allows for determination of better splits to improve current practices in the field. The live application is available at https://cardio-ai.github.io/endovis-ml/. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-024-03063-9.


Introduction
Technologies that enable next-generation context-aware systems in the operating room are currently intensively researched in the domain of surgical workflow recognition [1].Recent studies that apply machine learning algorithms to this task have shown the most promising results [2].To further support advances in this area, academic machine learning competitions are hosted regularly [3,4].However, despite the progress in surgical workflow recognition, the developers of machine learning algorithms are faced with several challenges that result from the heterogeneous nature and complexity of surgical workflows, and the temporal correlation of sensor data.
Specifically, one of the major challenges of surgical workflow data lies in the unequal distribution of classes (i.e., surgical phases), which is commonly referred to as data imbalance in the machine learning literature.This issue is further exacerbated by the fact that some phases can occur several times during surgery while other phases may not occur at all.This results in an imbalanced representation of classes in the dataset which in turn hinders the ability of machine learning classifiers to accurately predict the underrepresented classes.To ensure that a machine learning model can properly learn to discriminate surgical phases, all dataset splits into train, validation, and test sets must follow similar distributions.Besides, the surgical phases strongly correlate with the instruments that are used during the phase.Therefore, unequal distribution of phases also affects the distribution of sub-properties in the datasets, such as surgical instruments.
In this work, we present an interactive data visualization application that facilitates the assessment of dataset splits for surgical phase and instrument recognition with regard to the aforementioned challenges.The main goal of this work is to provide a data visualization tool that can be used by machine learning practitioners as well as biomedical challenge organizers to gain insights into dataset splits of surgical workflow data.

Related work
With the advent of deep learning, the topics of automatic phase and instrument recognition have gained considerable traction.In one of the earliest studies on this topic, Twinanda et al. [5] fine-tune a convolutional neural network for joint phase and instrument recognition and apply a hidden Markov model to enforce temporal dependencies of phase predictions.Another study by Jin et al. [6] present an improvement upon the previous work by training a deep convolutional network and a recurrent neural network in an end-to-end manner.Recently, a multi-stage temporal convolutional network has been successfully applied to the task of surgical phase recognition by Czempiel et al. [7].In the latest studies, the focus has shifted towards the transformer architectures [8][9][10][11][12].
Data visualization techniques represent a promising approach that can facilitate the exploration of surgical workflows.Yet, only limited research on visualization techniques for the analysis of surgical workflows has been conducted so far.Previously, Blum et al. [13] used a hidden Markov model to derive a workflow model from a set of procedures and visualize it as a graph.One of the most recent studies by Mayer et al. [14] presents an interactive visualization method that focuses on the analysis of temporal relationships within the surgical workflow data and provides means for comparing sets of procedures (e.g., stratified by surgeon, pathology, etc.).

Visualization Framework
The proposed visualization framework aims to facilitate interactive exploration of dataset splits for surgical workflow recognition.The scope of this work is limited to the analysis of surgical phase and instrument annotations.This data contains information that is crucial for creating representative dataset splits.With this data, information about the duration of surgeries, the occurrence of phases, as well inter-and intra-phase instrument usage can be directly inferred.Furthermore, transitions between phases, co-occurrence of surgical instruments, and idle segments of surgeries also need to be considered during the preparation of the dataset split.
The framework comprises two main views that focus on the visualization of surgical phases and instruments.Further supplementary views provide a general overview of the dataset.The colors red, green, and blue are used consistently across all views to encode attributes of the training, validation, and test set respectively.In this section, we use eight proctocolectomy surgeries from the "Surgical Workflow Analysis in the sensorOR 2017" challenge dataset [3] and also select frames that are annotated with both phases and instruments.The live application can be accessed at https://cardio-ai.github.io/endovis-ml/.

Phase view
In this view, phases are visualized as nodes along the horizontal axis, ordered according to their conceptual order from left to right (see Figure 1).Each node contains a donut chart that represents the proportion of frames that are assigned to the corresponding dataset split.Furthermore, the center of each node shows the number of surgeries in which the phase occurs.Phase transitions are visualized as arcs between individual nodes, whereas the number of times a transition between two phases happened is mapped to the width of arcs.Since transitions can occur in both directions, forward transitions are displayed in the upper half, while backward transitions are placed in the lower half of the chart.The overall distribution of frames across surgical phases is displayed as a bar chart below the phase nodes.Finally, the horizontal bars at the bottom of the view visualize the frequency of instrument occurrences during each phase and can be re-scaled in various ways depending on the analysis goal.In order to support interactive exploration of the data, several interaction techniques are implemented in the phase view.By selecting individual phase nodes, filtering is applied across other views to display frames for the selected set of phases.Furthermore, surgeries can be filtered by the occurrence of a particular phase transition.The phase view and other views are updated accordingly to display the surgeries that contain the selected transition.Besides, the occurrence of phase transitions in the training, validation, and test sets can be displayed upon selecting the corresponding option in the phase view menu.

Instrument view
The instrument view targets the visualization of instruments as well as the combinations of instruments that have been used at the same time, i. e. instrument co-occurrences (see Figure 2a).This visualization approach is based on the work by Alsallakh et al. [15] which targets analysis of set memberships of data elements.The centered bar charts which are arranged radially show the total number of frames a surgical instrument was visible in each set.Additionally, a bar chart that reflects the number of frames in which no instruments are visible, so-called idle frames, is also included in this view.The combinations of instruments are displayed as nodes in the center of the instrument view.The nodes themselves are represented as pie charts, whereas each segment of the pie chart shows the prevalence of this instrument combination in the training, validation, and test set.The positioning of the nodes is determined by a force-directed layout algorithm.To facilitate the exploration of the surgical instrument data, several interaction techniques are implemented in this view.By selecting an instrument, all instrument co-occurrence nodes that involve the selected instrument are highlighted in the instrument view.Besides, co-occurrence nodes can be selected individually which reveals the proportion of co-occurrence frames in relation to the frames of the involved instruments (see Figure 2b).Upon filtering of instruments or instrument co-occurrences, other views of the visual framework are updated accordingly to view the selected frames.

Supplementary views
The main views are enhanced by two supplementary views which provide a general overview of the dataset.The first supplementary view represents a table that shows the partitioning of surgeries into the training, validation, and test sets.The individual surgeries can be interactively re-assigned to a different set via drag and drop.The second supplementary view encompasses two bar charts that display the total number of surgeries and frames for each set.Additionally, a set of bar charts displaying the number of frames for each individual surgery are arranged on the right side of the view.The average number of frames for each set are shown as dashed lines in the bar charts.

Evaluation and results
The proposed visualization framework is evaluated through a user study which is performed on a different dataset, namely Cholec80 [5].In addition to the user study, we perform analysis of the most commonly used dataset splits of the Cholec80 dataset using the proposed visualization framework.

User study
In total, ten participants with data science background have been recruited to participate in the evaluation study of the proposed visualization framework.After a brief introduction into the domain of surgical phase recognition, the participants were asked to solve ten tasks covering a wide range of possible exploratory analyses that can arise during the preparation of a dataset for surgical phase recognition.Descriptions of the tasks are enclosed in the supplementary information.To measure the results of this study, task completion rate was used, which has the value of 1 only if the participant solves the task correctly, 0 otherwise.Overall, the majority of the tasks were completed successfully by ≥ 80% of participants (see Figure 3).The tasks T2 and T6 represent exceptions with the overall worst completion rate, solved correctly by 30% and 40% of participants respectively.Fig. 3 Overall task completion percentage with the corresponding 95%-confidence intervals.
After completing the tasks, the participants were asked to fill out the System Usability Scale (SUS) [16] questionnaire.It consists of ten statements that the study participants ranked on a 5-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree).The ranking of the statements are then used to calculate the SUS score which expresses the usability of the system.The value of the score ranges between 0 and 100, with higher values expressing better usability.The proposed application reached the SUS score of 81.25.

Analysis of common Cholec80 dataset splits
In order to validate the proposed framework, we analyze the most common splits [17] of the Cholec80 dataset [5] using our visualization framework and report our observations.We downsample phase annotations of the Cholec80 dataset to 1 fps to obtain frames with both phase and instrument labels.For simplicity, this analysis does not cover cross-validation splits.

40/-/40 split
In the 40/-/40 split, which is used in the studies [5,18], all surgical phases are represented in both sets.However, a closer inspection of phase transitions unveils a group of nine surgeries (10,13,19,22,23,29,32,33,38) that deviate from the standard workflow by skipping the first phase and initiating the surgery directly in the second phase (see Figure 4a).Notably, all of the nine surgeries are assigned to the training set, therefore the evaluation of the model's performance on the test set does not include this special workflow.In addition, another unique workflow that only occurs in three surgeries (12,14,32) in the training set can be identified using the proposed visualization (see Figure 4b).After the Gallbladder packaging phase, these three surgeries move on to the Gallbladder retraction, thus omitting the Cleaning coagulation phase.Subsequently, the surgeries return to the previously skipped Cleaning coagulation phase which is also the final phase of the three surgeries.Since this unique sequence of phases only appears in the training set, they are not included in the evaluation of the machine learning model.With this information at hand, the split can be optimized by re-assigning the surgeries 32, 33, and 38 to the test set, as interactively determined in our tool.Accordingly, three randomly selected surgeries 58, 66, and 71 from the test set are assigned to the training set to retain the 40/-/40 split.As a result of this re-partition, the aforementioned cases of phase transitions now also appear in the test set.
Regarding the instrument use, the proposed visualization shows that all of the individual instruments are represented in all sets and also follow similar distributions.Nevertheless, there are several instrument combinations that do not occur in one of the sets (see Figure 4c).However, these instruments combinations mostly represent rare cases, as they account for only a small fraction of the dataset and appear in single surgeries.

32/8/40 split
To perform model selection or hyperparameter search, studies [6,7,11,19] use eight surgeries from the training set for validation, resulting in a 32/8/40 split.This split yields sufficient representation of phases across sets.However, surgeries from the validation set have fewer frames on average (≈ 1, 900 frames) than the training and test sets with ≈ 2, 200 and ≈ 2, 500 frames respectively (see Figure 5a).Especially, the disparity between the average duration of surgeries from the validation and test set (≈ 10 min) might affect the performance estimation on these sets.As the surgery duration can indicate its complexity, the surgeries from the validation set may be easier to infer.
Similar to the 40/-/40 split, the surgeries skipping the first phase are found exclusively in the training and validation sets.This can be solved with our tool by assigning the surgeries 10 and 13 from the training to the test set, and randomly selected surgeries 48 and 64 from the test to the training set.Besides, the 32/8/40 split entails reduction of the training set size.This becomes especially apparent in case of two phase transitions (Gallbladder dissection, Cleaning coagulation) and (Cleaning coagulation, Gallbladder packaging) as they are reduced from three occurrences to just a single occurrence in the training set, as opposed to two and nine occurrences in the validation and test set respectively (see Figure 5b).This will presumably hinder the generalization of the model.Regarding the instruments, the co-occurrences of surgical instruments that are missing in one of the sets are more prevalent in this split due to the additional validation set.One considerable example is the simultaneous use of grasper, bipolar, and irrigator occurring in 503 frames in the training set and in 154 frames in the test set (see Figure 5c).

40/8/32 split
Instead of setting aside eight surgeries from the training set, two studies [7,20] select eight surgeries from the testing set for validation, thus creating a 40/8/32 split.In this split, all phases as well as single instruments are present in all sets and also follow similar distributions.Similar to the original 40/-/40 split, surgeries starting in the Calot triangle dissection phase, are exclusive to the training set.Furthermore, the three surgeries that move on from Gallbladder packaging to Gallbladder retraction and end in the Cleaning coagulation phase are also found only in the training set.This particular issue can be addressed by moving the surgery 32 to the validation set, the surgeries 33 and 38 to the test, and randomly selected surgeries 46, 58, and 70 to the training set to retain the 40/8/32 split.
Compared to the 32/8/40 split, the validation set holds a larger amount of frames, thus resulting in a better coverage of various cases (see Figure 6a).Furthermore, the phase transitions (Gallbladder dissection, Cleaning coagulation) and (Cleaning coagulation, Gallbladder packaging) now appear three times in the training set, thus providing more samples for the training of the model (see Figure 6b).Considering the co-occurrence of instruments, an improvement over the 32/8/40 split can be observed, as the combination of grasper, bipolar, and irrigator now also appears on 47 frames in the validation set (see Figure 6c).

Summary of unrepresented cases
Table 1 shows common dataset splits of the Cholec80 dataset as well as the number of phase transitions, intra-phase instruments, and instrument combinations that are not represented in one of the sets.As can be seen in the table, training set is unaffected by different split variants, however, the number of unrepresented cases in validation and test sets dramatically changes on different splits.Visualizations of the splits are provided in the supplementary information.5 Discussion and future work This work presents a publicly available visualization framework that facilitates interactive assessment of dataset splits for surgical phase and instrument recognition.The motivation for this has been previously outlined in some studies.Zisimopoulos et al. [22] report a high discrepancy of the model's performance on validation and test sets which is attributed to some phases missing in the validation set.Moreover, the effects of imbalances of instrument co-occurrences on model's performance have been previously highlighted by Sahu et al. [23].
The visualization framework presented in this work is specifically designed to address these cases.
Using our tool, we were able to pinpoint several aspects of the splits that can distort the evaluation of the model's performance.Moreover, the application enabled us to eliminate some of these issues by manually re-partitioning the sets.Furthermore, we discovered that the number of unrepresented cases significantly varies for different dataset splits.In future work, algorithms for the generation of optimal dataset splits [24] can be explored.The user study shows promising results, as most of the tasks were correctly solved by at least 80% of the participants.The positive outcome of the user study is further supported by the SUS score of 81.25 which indicates above average usability of the system.
The scope of this application is limited to the analysis of phase and instrument annotations.However, visual features, such as bad lighting conditions, over or underexposed instruments, and occlusions have high influence on the performance of the model [6].Future work should also include analysis of distributions of various visual features.Correspondingly, it can be also extended to support adjacent tasks including instrument and pathology detection or segmentation with bounding-box or pixel-level predictions.Finally, we also believe that integration of more fine-grained surgical activity information, such as action triplets [25], can provide a more sophisticated overview of surgical workflows.

Conclusion
In this work, we presented a publicly available application implemented for the research community that aims to facilitate visual exploration of dataset splits for surgical phase and instrument recognition.To validate the design of our application, we conducted a user study with ten participants.Further, we performed an analysis of the most common Cholec80 splits and identified improvements of the splits using our tool.The results indicate that the proposed application can enhance the development process of machine learning models for surgical phase recognition by providing insights into the dataset splits, potentially resulting in more reliable performance evaluations.Furthermore, we believe that organizers of biomedical challenges can also greatly benefit from the proposed framework during the preparation of challenge datasets.

Analysis of common Cholec80 dataset splits
The following Table 2 shows common Cholec80 dataset splits as well as the allocation of surgeries to sets.Further, we provide visualization of these splits in Figure 1, Figure 2, Figure 3, and Figure 4.

Fig. 1
Fig.1Phase view of the proposed application with eight proctocolectomy surgeries from the "Surgical Workflow Analysis in the sensorOR 2017" challenge dataset.

Fig. 2
Fig.2Instrument view of the proposed application with eight proctocolectomy surgeries from the "Surgical Workflow Analysis in the sensorOR 2017" challenge dataset (A) and selected combination of grasper and ligasure (B).

Fig. 4
Fig. 4 Characteristics and shortcomings of the 40/-/40 split.Surgeries starting in the Calot triangle dissection phase are only present in the training set (A).The ending sequence Gallbladder retraction to Cleaning coagulation occurs only in the training set (B).The instruments bipolar and scissors co-occur only in the training set (C).

Fig. 5
Fig. 5 Characteristics and shortcomings of the 32/8/40 split.Surgeries from the validation set have fewer frames on average, compared to the training and test sets (A).The phase transitions (Gallbladder dissection, Cleaning coagulation) and (Cleaning coagulation, Gallbladder packaging) occur only once in the training set (B).The simultaneous occurrence of the instruments grasper, bipolar, and irrigator is not represented in the validation set (C).

Fig. 6
Fig. 6 Characteristics of the 40/8/32 split.Surgeries from the validation set contain more frames on average than surgeries from other sets (A).Furthermore, this split provides a better coverage of the phase transitions (Gallbladder dissection, Cleaning coagulation) and (Cleaning coagulation, Gallbladder packaging) in the training set, compared to the 32/8/40 split (B).The combination of grasper, bipolar, and irrigator appears in all sets (C).

Fig. 1
Fig. 1 Screenshot of the application with the 40/-/40 split of the Cholec80 dataset.

Fig. 2
Fig. 2 Screenshot of the application with the 32/8/40 split of the Cholec80 dataset.

Fig. 4
Fig. 4 Screenshot of the application with the 40/24/16 split of the Cholec80 dataset.

Table 1
Number of unrepresented cases in the training, validation, and test sets that were discovered using the proposed visualization framework.