The aim of Surgical Data Science (SDS) is to analyze data sources acquired during surgical treatment to improve patient safety and clinical outcomes [1]. Opposed to traditional clinical research centering on preoperative patient characteristics and postoperative outcomes, SDS focuses on the whole data stream of surgical treatment. To unravel the “black box” of the operation room (OR) and the impact of the understudied intraoperative phase on patient outcomes, SDS particularly analyzes data streams captured in the OR during surgery.

Since the introduction of video technology in minimally invasive surgery, video recordings of surgical interventions are easily recorded and therefore readily available. As analysing surgical videos is time consuming, costly, and often lacks objectivity, the full potential of video analysis was often not tapped in the past [2]. In the last decades however, the evolution of computer vision (CV), which is the analysis of visual information by computer algorithms, boosted the potential of surgical video analysis.

One of the most analysed surgeries in SDS is laparoscopic cholecystectomy (LCHE). Classical CV tasks in the analysis of surgical videos are phase recognition and tool presence detection. They were developed for LCHE [3]. Moreover, safety feedback [4, 5] and surgical skill assessment algorithms [6] were trained on LCHE videos. However, the disadvantage of LCHE as a model intervention for SDS is that there are hardly any intraoperative events or postoperative complications to study given the limited size of datasets. Therefore, recent SDS research focuses on longer and more complex procedures like colorectal [7, 8] or bariatric surgery [9, 10].

Laparoscopic Roux-en-Y gastric bypass (LRYGB) is one of the most performed bariatric surgeries worldwide [11, 12]. The reported numbers of postoperative complications range between 4 and 13% [13,14,15]. Its technical standardization, the moderate duration, and frequent postoperative outcome events, make it an excellent candidate for CV-assisted video analysis.

Surgical interventions can be hierarchically decomposed into phases (e.g., access, mobilization, resection, reconstruction, disassembling), that consist of more fine-grained steps (e.g., cavity exploration, trocar placement, retractor placement, etc.) [16]. In contrast to the technical standardization an ontology defines how to describe a surgical intervention in a structured and generic way [16, 17]. The word ontology is derived from the ancient Greek words ὄν (being, that which is) and λόγος (reason, rational, principle, logical reasoning) and refers to the ‘study of being’.

However, most datasets in SDS are small, monocentric, and lack external validation. To overcome these limitations, larger and multicentric datasets are warranted. Furthermore, variability of ontologies and its application on datasets limits generalization across centers [18]. To ensure data quality and reliable algorithms procedure-specific ontologies need to be defined and validated multicentrically.

This work is the first to propose a LRYGB ontology for phases and steps and to validate it on a multicentric dataset in terms of inter- and intra-rater reliability. The application of this ontology enables workflow analysis across surgeons and centers. Furthermore, it facilitates downstream applications of SDS, as the training of phase and step recognition algorithms using artificial intelligence for automated surgical video analysis.

Methods

Ontology

The proposed LRYGB ontology was developed by the surgical staff of the Department of Digestive and Endocrine Surgery at Nouvel Hôpital Civil (University Hospital of Strasbourg), France [10]. Based on pre-recorded anonymized surgical videos, the procedure was hierarchically broken down into phases and steps. Phases were defined as all first level temporal components that must be executed sequentially to allow the achievement of the surgical objectives. While the steps were defined as the set of actions that must be accomplished during the phases to yield the task of choice. The hierarchical structure of the ontology is displayed in Fig. 1. Subsequently, a temporal annotation framework to define the start and end time of each phase and each step was agreed upon by consensus. This ontology was presented, discussed, and validated by a panel of international faculty that attended the Laparoscopic and Endoluminal Bariatric and Metabolic Surgery Course held at IRCAD France from November 28 to December 01, 2018. It was adapted for multicentric use and contains 12 phase and 46 step definitions as outlined in the Supplementary Material (Tables S1, S2).

Fig. 1
figure 1

Hierarchical structure of phases and steps in the proposed laparoscopic Roux-en-Y gastric bypass ontology. Facultative phases and steps have a dashed border

Datasets

Two board certified visceral surgeons (referred to as raters) with over 10 years of clinical expertise applied the proposed ontology to two datasets. (1) The StraBypass40 dataset consists of 40 LRYGB videos recorded at Nouvel Hôpital Civil, University Hospital of Strasbourg, France [10]. (2) The BernBypass70 dataset consists of 70 LRYGB videos recorded at Inselspital, Bern University Hospital, Bern, Switzerland.

Intervention

Inter-rater reliability (inter-RR) defines the extent of agreement among observers, whereas intra-rater reliability (intra-RR) defines the consistency of observations of a given observer over time. To assess the inter-RR of the proposed LRYGB ontology ten randomly chosen videos of the StraBypass40 and BernBypass70 datasets were annotated according to the step and phase definitions as provided in the Supplementary Material (Tables S1, S2) by both raters using the in-house video annotation tool MOSaiC. The annotations of both raters were compared. To assess the intra-RR of the proposed LRYGB ontology ten randomly chosen videos were annotated a second time by the same rater after a wash out phase of 1 month. The two sets of annotations were compared.

Evaluation

Inter- and intra-RR was calculated using accuracy, precision, recall and F1-scores. Accuracy is the proportion of correct predictions among the total number of observations. Precision is the proportion of true positives among all (true and false) positives and referred to as the positive predictive value. Recall is the proportion of true positives among all relevant observations (true positives and false negatives) and referred to as sensitivity. F1-score is the harmonic mean of precision and recall and is a measure of accuracy.

Furthermore, average transitional delay, noise level and a coefficient of transitional moments were calculated to apply application dependent metrics as proposed in [19]. Every transition from phase to phase or from step to step is considered a transitional moment. Average transitional delay is the average delay between the annotated and the real transitional moment. It can be positive or negative. Noise level is the proportion of annotated phases or steps not being part of a real transitional moment among all annotated phases or steps. The coefficient of transitional moments is the ratio of annotated to real transitional moments. The transitional delay threshold was set to 5 s. Cohen’s kappa has been used to calculate inter-rater reliability to account for agreement of raters by chance [20].

The comparison of two sets of annotations is not symmetric. In the validation of computer algorithms, the human annotation always serves as ground truth. Given that this study compares two set of human annotations, each set was treated once as ground truth and metrics were averaged across both comparisons. All metrics were applied for every video separately on phases and steps on a millisecond level and averaged across datasets.

Results

For StraBypass40 the mean ± SD video duration was 108 ± 33 min An average LRYGB video consisted of 10 phases and 33 steps.

For BernBypass70 the mean ± SD video duration was 75 ± 21 min An average LRYGB video consisted of 8 phases and 27 steps.

Average phase and step durations in the StraBypass40 and BernBypass70 datasets are displayed in Fig. 2.

Fig. 2
figure 2

Average duration of a phases and b steps. The labels correspond to the respective phase / step as outlined in Fig. 1

Quantitative inter-RR results for StraBypass40, BernBypass70 and overall inter-RR results are shown in Table 1. Intra-RR results are shown in Table 2. Qualitative results in form of a visual comparison of the best and worst matching phase and step annotation pairs for StraBypass40 and BernBypass70 are shown in Fig. 3.

Table 1 Validation of the laparoscopic Roux-en-Y gastric bypass ontology: Inter-rater reliability results
Table 2 Validation of the laparoscopic Roux-en-Y gastric bypass ontology: Intra-rater reliability results
Fig. 3
figure 3

Visual comparison of annotations. a Phase annotation, b Step annotation. In the top row comparison of the best matching annotation pairs, in the bottom row comparison of the worst matching annotation pairs of the StraBypass40 and BernBypass70 datasets. The width of each phase / step corresponds to its relative duration and the labels correspond to the respective phase / step as outlined in Fig. 1

Across datasets inter-RR and intra-RR metrics show better results for phase compared to step recognition. Furthermore, inter-RR metrics on StraBypass40 show better results than on BernBypass70. The application dependent metrics show a 0.4–1.2% boost compared to the classical metrics.

Discussion

The excellent inter-RR of 95.9% for phases and 80.8% for steps of the proposed ontology demonstrate its easy application and reliable use for the annotation of LRYGB phases and steps by multiple raters in multiple institutions. Moreover, the excellent intra-RR of 98.4% for phases and 88.1% for steps shows that annotations of the same rater are consistent over time.

Given these tremendous results, we advocate for the routine use of the proposed ontology in LRYGB. This will standardize video analysis of LRYGB surgery and will allow comparison of surgical workflows across surgeons and centers. The routine use of this ontology facilitates standardized video review for educational purposes, performance assessment, and quality improvement programs. Furthermore, it enables downstream applications as the training of artificial intelligence algorithms to automatically recognize phases and steps, to give intraoperative feedback or assistance.

For laparoscopic cholecystectomy, one of the most analyzed surgeries in SDS, a systematic review identified 8 different phase definitions in the literature [18]. Multiple phase definitions are a hindrance to comparison of results across datasets and institutions. Therefore, with the definition and multicentric validation of an LRYGB phase and step ontology we aim to prevent the use of multiple competing ontologies. To implement the use of the proposed ontology on a global scale, awareness for surgical video recording in general, and in particular, larger consensus among bariatric surgeons using the Delphi method must be created.

Inter-RR and intra-RR metrics are higher for phase compared to step annotation. Comprising 12 phases and 46 steps the proposed ontology is less granular on the phase than on the step level. This leads to lower variability in phase compared to step annotations. Therefore, the ontology performs better in terms of inter- and intra-RR on a phase than on a step level.

As the two datasets are from different institutions and therefore represent different surgical techniques, the aim of this study is not to compare them. However, to understand the performance difference of the proposed ontology on StraBypass40 and BernBypass70 it is crucial to elaborate, how they differ. When comparing StraBypass40 with BernBypass70, there is a considerable difference in average video duration (108 vs. 75 min). This is also reflected by the greater average number of phases and steps in StraBypass40 compared to BernBypass70 (10 vs. 8 phases, 33 vs. 27 steps). In StraBypass40 the creation of the gastric pouch (phase 2, 26 vs. 15 min) and the creation of the jejunojejunal anastomosis (phase 8, 24 vs. 18 min) takes considerably longer when compared to BernBypass70. The main differences in surgical technique between datasets are the routine division of the omentum (phase 3, 95 vs. 36%), Petersen space (phase 7, 98 vs. 16%) and mesenteric defect closure (phase 9, 100 vs 21%) in StraBypass40 compared to BernBypass70.

Inter-RR metrics on StraBypass40 show better results when compared to BernBypass70. This is likely an effect of the difference in average video duration between datasets. Given the same number of phase and step transitions, the longer a video is, the less the metrics are influenced by a single transitional delay.

Using application dependent metrics, a 0.4–1.4% boost in accuracy, precision, recall and F1-score can be observed. This is due to the relaxation of transitional moments by extension of the acceptable transitional delay. Setting the transitional delay threshold allows to tailor the metrics application dependent to the desired use case. To estimate the remaining time of an intervention based on a phase and step recognition algorithm a 5 s delay is reasonable. However, for real-time intraoperative decision support 5 s delay are too long and will limit the acceptance of the application. Considering the first use case, in this study the transitional delay threshold was set at 5 s to calculate application dependent metrics as proposed in [19].

Limitations

Despite being multicentric this study includes only two raters and two institutions. As surgical video annotation is time consuming and needs domain expertise, it is expensive. Therefore, annotation resources must be attributed carefully.

Conclusion

The proposed ontology shows an excellent inter- and intra-RR and should therefore be implemented routinely in phase and step annotation of LRYGB videos. This will facilitate education, performance assessment, and quality improvement programs. Moreover, the application of the proposed ontology will enable the development of downstream tasks as automated phase and step recognition, intraoperative feedback, or assistance by use of artificial intelligence.