Deep learning-based automatic surgical step recognition in intraoperative videos for transanal total mesorectal excision

Background Dividing a surgical procedure into a sequence of identifiable and meaningful steps facilitates intraoperative video data acquisition and storage. These efforts are especially valuable for technically challenging procedures that require intraoperative video analysis, such as transanal total mesorectal excision (TaTME); however, manual video indexing is time-consuming. Thus, in this study, we constructed an annotated video dataset for TaTME with surgical step information and evaluated the performance of a deep learning model in recognizing the surgical steps in TaTME. Methods This was a single-institutional retrospective feasibility study. All TaTME intraoperative videos were divided into frames. Each frame was manually annotated as one of the following major steps: (1) purse-string closure; (2) full thickness transection of the rectal wall; (3) down-to-up dissection; (4) dissection after rendezvous; and (5) purse-string suture for stapled anastomosis. Steps 3 and 4 were each further classified into four sub-steps, specifically, for dissection of the anterior, posterior, right, and left planes. A convolutional neural network-based deep learning model, Xception, was utilized for the surgical step classification task. Results Our dataset containing 50 TaTME videos was randomly divided into two subsets for training and testing with 40 and 10 videos, respectively. The overall accuracy obtained for all classification steps was 93.2%. By contrast, when sub-step classification was included in the performance analysis, a mean accuracy (± standard deviation) of 78% (± 5%), with a maximum accuracy of 85%, was obtained. Conclusions To the best of our knowledge, this is the first study based on automatic surgical step classification for TaTME. Our deep learning model self-learned and recognized the classification steps in TaTME videos with high accuracy after training. Thus, our model can be applied to a system for intraoperative guidance or for postoperative video indexing and analysis in TaTME procedures. Supplementary Information The online version contains supplementary material available at 10.1007/s00464-021-08381-6.


3
However, during TaTME, surgeons have experienced intraoperative technical difficulties in approximately 40% of the cases; these technical difficulties include inaccurate plane dissection, pelvic bleeding, and visceral injuries [6]. Expert surgeons and early adopters of the TaTME procedure have acknowledged that these technical difficulties are partly due to unfamiliar views and difficulty interpreting the anatomy from below, which could make it hard to recognize correctly the appropriate tissue planes. This is likely to have been the cause of early reports of urethral injuries reported in the TaTME international registry data [7], which are complications rarely observed in the case of conventional TME surgery.
Video-based learning for minimally invasive surgery is considered a useful teaching aid [8,9], and it is especially valuable in the case of TaTME with the risk of unexpected complications in patients. Consistent review of intraoperative laparoscopic videos could facilitate understanding of common errors during surgery and increase the awareness of potential injury mechanisms by acknowledging error-event patterns [10,11]. In addition, several studies showed that video-based learning contributed to reducing surgical error and improving surgical skill [12,13]; however, manual video review by humans is a time-consuming task.
Convolutional neural networks (CNNs) [14] are a type of artificial intelligence (AI) tool that can be utilized in the field of computer vision for deep learning-based image analysis [15]. Notably, CNNs could be used to review surgery videos in order to identify specific segments of a surgery [16][17][18][19]. This would make video-based learning for TaTME considerably more efficient by reducing the effort required in manual video indexing.
Thus, in this study, we constructed an annotated video dataset for segments of the TaTME surgical procedure using a deep learning model to promote video-based learning for TaTME. Moreover, we evaluated the performance of the proposed deep learning model for analyzing intraoperative videos to identify different surgical steps during TaTME.

Study design and patient cohort
This was a single-institutional retrospective feasibility study. Intraoperative video data for 50 patients who underwent TaTME at the Department of Colorectal Surgery at National Cancer Center Hospital East (Kashiwa, Japan) between May 2018 and July 2019 were randomly extracted for the study. However, intraoperative video data for cases wherein the perineal procedure was not properly recorded were excluded from this study.

Video dataset
In the video dataset, all perineal procedures of TaTME were performed laparoscopically, instead of robotically, and five attending colorectal surgeons performed the procedures. Among the five surgeons, one was a TaTME expert, three had performed 10-30 TaTME surgeries, and the remaining surgeon had performed less than 10 TaTME surgeries.
During preprocessing, the intraoperative TaTME videos were converted to MP4 video format with a display resolution of 1280 × 720 pixels and a frame rate of 30 frames per second (fps). After preprocessing, the video dataset was divided into training and testing sets with 80% and 20% of the data, respectively (i.e., 40 videos were utilized to train models, while 10 videos were utilized to test them). The data were split on a per-video rather than a per-frame basis; thus, frames from a video that were included in the training set were not present in the test set.

Annotation of surgical steps
The surgical steps of TaTME for annotation in intraoperative videos were determined based on a previous study by Lacy et al. wherein the stepwise procedure for TaTME is described [3,20]. Given the nature of supervised deep learning, it is considered reasonable to define the surgical steps for the automatic classification task based on the stylized stepwise procedure. Each intraoperative video was manually annotated at 30 fps and parts of the video were manually classified into the following major steps: (1) purse-string closure; (2) full thickness transection of rectal wall; (3) down-to-up dissection; (4) dissection after rendezvous; and (5) purse-string suture for single stapling technique (SST). Steps 3 and 4 were each further classified into four substeps; specifically, dissection for anterior, posterior, and both bilateral planes. In this study, the areas of neurovascular bundle and pelvic splanchnic nerves were considered to be a part of the bilateral planes. Every annotation label was manually assigned by two colorectal surgeons (DK and TI) independently, and both surgeons underwent sufficient annotation training and had sufficient knowledge of TaTME. Every discrepancy about the annotation label was solved via discussion. Details on each step including the definitions of the start and end of a step are summarized in Table 1.

CNN model
In this study, a CNN model, Xception [21], was used for the TaTME surgical step classification task. The model 1 3 was pre-trained using the ImageNet dataset, which consists of 14 million images of general objects, such as animals, scenes (e.g., beaches, mountains), and food [22]. Data augmentation was not performed.

Computer specifications
All modeling procedures were performed using a script written in Python 3.6. Furthermore, a computer equipped with an NVIDIA Quadro GP 100 GPU with 16 GB of VRAM (NVIDIA, Santa Clara, CA) and an Intel® Xeon® CPU E5-1620 v4 @ 3.50 GHz with 32 GB of RAM were utilized for model training and testing.

Evaluation metrics
To evaluate the performance of the CNN model in the surgical step classification task, precision, recall, F1 score, and overall accuracy were measured. The following calculation formulas were used for these metrics.
where TP, FP, FN, and TN denote true-positive, false-positive, false-negative, and true-negative cases, respectively. Notably, precision, recall, and F1 scores were utilized as performance metrics for each surgical step, whereas overall accuracy was utilized as the performance metric for the entire model. Descriptions of the evaluation metrics are provided in Table 2. Cross-validation was not performed.
Overall accuracy = (TP + TN) (TP + FP + FN + TN) Purse-string suture for SST Start: appearance of suture on screen End: disappearance of suture from screen Table 2 Descriptions of evaluation metrics

Evaluation metrics Description
True-positive Number of frames whose predicted step is Step X when the true step is also Step X. (Correct) False-positive Number of frames whose predicted step is Step X when the true step is not Step X. (Misclassification) False-negative Number of frames whose predicted step is not Step X when the true step is Step X. (Misclassification) True-negative Number of frames whose predicted step is not Step X when the true step is also not Step X.

Video dataset
Fifty patients were included in the study cohort, of which 30 were men. The median age was 64 years (range 33-83 years), and the median body mass index was 22 kg/m 2 (range 15-30 kg/m 2 ). In terms of preoperative diagnosis, rectal adenocarcinoma was observed in 42 cases, neuroendocrine tumors in five cases, and gastrointestinal stromal tumors in three cases. The most common clinical stage was I (31 out of 42). Furthermore, anastomosis was performed via SST in 43 cases with the median anastomotic height from the anal verge being 5 cm with a range of 1-8 cm.
The overall procedure operative time of TaTME was 188 min (with a standard deviation of 60 min), and the average total time for the five major steps in a TaTME was 71.5 min (with a standard deviation of 20.5 min); however, the duration of the individual surgical steps varied for different cases (Fig. 1).
Step 5 (i.e., purse-string suture for SST) was not annotated in six cases, because hand-sewn anastomosis was performed in those cases. In the dissection steps (i.e., Steps 3 and 4), sub-step transitions (i.e., transitions between each dissection plane during TaTME) occurred 27 ± 8 times. Rendezvous occurred 29 and 16 times out of 50 on the anterior and posterior sides, respectively. A trace of the surgical steps during two representative cases is shown in Fig. 2. In the figure, case A has a duration of 80 min with 22 surgical step transitions wherein rendezvous occurs on the anterior side at approximately 52 min. The characteristics of patients whose intraoperative videos formed the training and test sets of the video dataset used in this work are summarized in Table 3. As can be observed in the table, there were no statistically significant differences in patients' characteristics between the data subsets.

Surgical step classification
Precision, recall, and F1 score for each surgical step and overall accuracy metrics for the entire model are listed in Table 4. The overall accuracy for classification of all the five major steps was 93.2%. However, when sub-step classification was included in the calculation of the performance metrics, the overall accuracy deteriorated to 76.7%, and the mean accuracy of the model for classification of the 11 steps including sub-steps for 10 cases in the test dataset was 78 ± 5% with a maximum accuracy of 85% ( Table 5). The results for surgical step classification in a representative case are shown in Figure 3, and the confusion matrix of the results for surgical step classification is shown in Supplementary Appendix A. Fig. 1 A Duration of each surgical step and variation between different cases. B Duration of each dissection sub-step and variation between different cases. The duration for sub-step 3-2 (down-to-up dissection on the posterior plane) was the longest on average (16 ± 6.5 min), whereas that for sub-step 4-2 (posterior dissection after rendezvous) was the shortest (2.5 ± 2 min). (green: surgical step related to purse-string suture; yellow: rectotomy step; blue: dissection step before rendezvous; red: dissection step after rendezvous) (Color figure online) 1 3

Discussion
In this study, we demonstrated that our deep learning model could recognize the surgical steps of TaTME with a high degree of accuracy (93.2%). This result suggests that an AI-based model can self-learn, analyze, and index TaTME videos on behalf of humans.
In recent years, the use of AI in surgery has attracted significant attention from researchers. Although the use of AI-based methods has its challenges, these methods can improve surgical procedures in the operating room via different approaches [23], including preoperative planning [24,25], intraoperative guidance [26], and their integrated use in surgical robotics [27,28]. Annotated datasets are the foundation for several AI-based approaches; however, the complexity of surgery renders the interpretation and management of large amounts of intraoperative video data difficult. Thus, dividing a surgical procedure into a sequence of identifiable and meaningful steps can aid in data acquisition, storage, and analysis, among others.
Thus far, most studies related to surgical step recognition modeling have focused on laparoscopic cholecystectomy because of its standard and frequent execution [16,[29][30][31]. However, recently, to improve step recognition systems and extend their range of applications, increasingly diverse and complex procedures have been subjected to step recognition modeling, including laparoscopic total hysterectomy [32], robot-assisted partial nephrectomy [17], laparoscopic sleeve gastrectomy [18], and laparoscopic colorectal surgery [19]. Nevertheless, to the best of our knowledge, this is the first study based on the automatic surgical step classification task for TaTME.
Because TaTME is a complex procedure and requires specialized knowledge of pelvic anatomy, which is an unfamiliar topic for many surgeons, safe implementation of TaTME requires surgeons to undergo systematic and structured Fig. 2 Trace of surgical steps for two representative TaTME cases (green: surgical step related to purse-string suture; yellow: rectotomy step; blue: dissection step before rendezvous; red: dissection step after rendezvous; gray: extracorporeal step) (Color figure online) 1 3 training [33]; therefore, surgical trainers consider videobased learning to be a useful teaching aid to maximize learning. The automatic surgical step classification for TaTME using a CNN-based approach is a challenging task for the following reasons. First, the quality of intraoperative images is often poor because of an unstable pneumopelvis due to excessive smoking. Second, because the intraoperative field and instruments are seldom changed during TaTME compared with those during other laparoscopic abdominal surgeries, it is difficult to distinguish between different steps, especially sub-steps during dissection. However, this challenging task to classify the plane of dissection (anterior, posterior, or lateral plane) during TaTME should be  Table 5 Precision, recall, and F1 score of 11 surgical steps, including sub-steps, and overall accuracy of the entire model when sub-step classification was included in the calculation of the performance metrics tackled to develop a quick video dataset indexing system to make video-based learning for TaTME considerably more efficient.
In this study, we constructed the first annotated video dataset for TaTME. The initial purpose of this dataset construction was training and testing of our deep learning model. However, we observed significant differences among different intraoperative videos in terms of step duration, order of sub-steps, and frequency of sub-step transitions by analyzing the annotated dataset. As an example, progressions of surgical steps during two representative TaTME procedures are shown in Fig. 2. In the figure, although the total surgical times in both cases A and B were almost equivalent (80 and 82.5 min, respectively), the duration of each step, order of sub-steps, and frequency of sub-step transitions (17 and 38, respectively) were significantly different. In a future study, we will attempt to obtain correlations between novel parameters, skills, or intraoperative complications, using detailed analyses on a larger dataset, which could then be applied for skill assessment or complication prediction.
This study has several limitations. First, cross-validation was omitted in this study because there were no statistically significant statistical differences in patients' characteristics between the training and test sets (Table 3) and because we considered the number of frames in the dataset to be sufficiently large (> 650,000 frames); however, the number of analyzed procedures (n = 50) and surgeons performing them (n = 5) was limited. Therefore, considering the impact of a possible imbalance between the training and test sets (procedure techniques, anatomy, surgeon skill, and learning curve), cross-validation might have been more appropriate. With regard to validation methods, the most appropriate one for each situation should always be considered. Second, the videos that form our dataset were obtained from one institution; thus, the complexity of the data is limited to case variability. Training a deep learning model with such a dataset can lead to over-fitting, which could subsequently reduce the generalizability of the network. To obtain more generalized networks, videos from other medical institutions should be included to ensure higher variability in the dataset. Third, although the accuracy for classification of the five defined major steps was high, there was still room for improvement in the accuracy of classification when sub-steps were included in performance analysis. The difference between the two results could be attributed to the following: first, the fewer the steps to classify, the easier the task would be, and second, although the image features differed significantly between each major step (e.g., purse-string closure vs down-to-up dissection), the differences in image features between each sub-step (e.g., anterior vs right plane dissection) were too slight to classify accurately. In the future, verification using saliency mapping is required to determine whether the insufficient accuracy in sub-step classification task was actually due to the similarity of image features between each sub-step.
In conclusion, the results of this study demonstrated that our deep learning model could be utilized to automatically identify steps of TaTME from an intraoperative video with a high degree of accuracy. However, our classification model needs to be trained with a larger dataset of intraoperative videos before it can be applied in practice.

Compliance with ethical standards
Disclosures Drs. Daichi Kitaguchi, Nobuyoshi Takeshita, Hiroki Matsuzaki, Hiro Hasegawa, Takahiro Igaki, Tatsuya Oda, and Masaaki Ito have no conflicts of interest or financial ties to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.