Robot-assisted surgery is an advanced minimally invasive technique that provides more precision, control, and flexibility compared with conventional approaches.1,2,3 While robotic technology allows surgeons to perform complex interventions in limited anatomical spaces, the operators’ technical skills remain a key determinant of successful clinical outcomes. In this scenario, there is growing interest in the assessment of proficiency among trainees practicing robotic surgery. This would allow estimating the position of an individual on a learning curve.4

The da Vinci Skills Simulator (dVSS) has emerged as an interesting platform for objective evaluation of the abilities required during robot-assisted surgery.5 On the simulator, proficiency can be assessed through various virtual exercises (e.g., “ring and rail”) by means of built-in assessment criteria. While simulation-based training allows trainees to practice a procedure in a safe and controlled environment, it does not invariably reflect real surgical situations. One potential solution to address this issue is the use of the Global Evaluative Assessment of Robotic Skills (GEARS), a standardized and validated qualitative assessment tool for robotic surgical skills.6 However, GEARS is a Likert-scale measure susceptible to response biases. Another traditional possibility is to investigate the improvement in surgical performance over time, described as the learning curve. A learning curve can be defined as the number of cases required and/or the time taken by a surgeon to become proficient in key indicators (e.g., operating time or occurrence of certain index complications).7

In this issue of the Annals of Surgical Oncology, Takeuchi et al.8 describe a novel automated surgical step recognition system for robot-assisted minimally invasive esophagectomy (RAMIE). The tool was developed by applying deep learning algorithms to video analysis of standardized procedures. Specifically, the system was designed to quantitatively analyze the relationships between the duration of each step and the surgeon’s learning curve. While there have been several previous attempts to automatically identify surgical phases through artificial intelligence (AI), the application of this technique in the assessment of surgical proficiency is certainly innovative. By taking each surgical step into account in an automated manner, this tool holds great promise for robot-assisted evaluation of robotic surgery. In addition, longitudinal investigation of surgical indicators using AI tools may provide valuable information for improving surgical training.

However, certain hurdles that will likely impede the rapid and effective translation of the automated system developed by Takeuchi et al. should be briefly discussed. First and foremost, the vast majority of previous learning studies have relied on operative videos obtained from either a single or a limited number of surgeons.9,10,11,12,13 In this regard, the work by Takeuchi and coworkers is no exception. Specifically, all of the RAMIE video recordings derived from a single surgeon without previous experience in this procedure but with an extensive record in the field of minimally invasive esophagectomy (> 300 procedures). The fact that the surgical steps utilized in the study were not standardized and the lack of a consensus definition for surgical proficiency are other significant caveats. For example, the definition for proficiency used by Takeuchi et al. (i.e., a minimum of 20 procedures) was based on a limited number of previous studies, which limits the robustness of the analysis. These shortcomings call for international working groups to make recommendations vetted by the participating organizations. Second, marked discrepancies in the number of frames across different surgical steps may introduce bias in terms of recognition accuracy. For example, the number of frames was as low as 104 in the video depicting azygos vein division but as high as 4230 in the video showing left recurrent laryngeal nerve (RLN) lymph node dissection (LND). The F1 scores for these two videos were 76% and 89%, respectively, further suggesting the existence of imbalanced sampling. Data augmentation techniques, including image flipping, zooming, shifting, and generative adversarial networks (GANS) synthesis,14 should also be implemented cautiously. Accordingly, the resulting data are not always reflective of real anatomy. Finally, object delineation and motion tracking may be incorporated into the step recognition task for confirmation purposes. While step recognition algorithms are entirely based on automated feature extraction from whole pictures or selected image portions, the extracted features generally pose interpretation problems. Only the incorporation of more contextualized tasks (e.g., object delineation15 and motion tracking16) will facilitate proper assessment of the intrinsic discriminative ability of an algorithmic system, with the ultimate goals of improving its accuracy and addressing potential pitfalls. By comparing the instrument-handling abilities of trainees versus those of experienced surgeons, information from object delineation and motion tracking will also be paramount in the field of surgical training. Finally, establishing ground-truth labels, viz. data that accurately represent real-world situations, will be a crucial milestone for surgical training, although this will surely be more demanding than the relatively simple recognition step. That might sound ambitious, but transformative advances are still expected in the field of AI applied to surgical learning. There is much we can learn from robots, very much like they are currently learning from our own experiences.