Technical skill assessment in minimally invasive surgery using artificial intelligence: a systematic review

Background Technical skill assessment in surgery relies on expert opinion. Therefore, it is time-consuming, costly, and often lacks objectivity. Analysis of intraoperative data by artificial intelligence (AI) has the potential for automated technical skill assessment. The aim of this systematic review was to analyze the performance, external validity, and generalizability of AI models for technical skill assessment in minimally invasive surgery. Methods A systematic search of Medline, Embase, Web of Science, and IEEE Xplore was performed to identify original articles reporting the use of AI in the assessment of technical skill in minimally invasive surgery. Risk of bias (RoB) and quality of the included studies were analyzed according to Quality Assessment of Diagnostic Accuracy Studies criteria and the modified Joanna Briggs Institute checklists, respectively. Findings were reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement. Results In total, 1958 articles were identified, 50 articles met eligibility criteria and were analyzed. Motion data extracted from surgical videos (n = 25) or kinematic data from robotic systems or sensors (n = 22) were the most frequent input data for AI. Most studies used deep learning (n = 34) and predicted technical skills using an ordinal assessment scale (n = 36) with good accuracies in simulated settings. However, all proposed models were in development stage, only 4 studies were externally validated and 8 showed a low RoB. Conclusion AI showed good performance in technical skill assessment in minimally invasive surgery. However, models often lacked external validity and generalizability. Therefore, models should be benchmarked using predefined performance metrics and tested in clinical implementation studies. Supplementary Information The online version contains supplementary material available at 10.1007/s00464-023-10335-z.

The assessment of technical skill is of major importance in surgical education and quality improvement programs given the association of technical skills with clinical outcomes [1][2][3][4].This correlation has been demonstrated among others in bariatric [1], upper gastrointestinal [2], and colorectal surgery [3,4].In addition, data from the American Colleges of Surgeons National Surgical Quality Improvement Program revealed that surgeon's technical skills as assessed by peers during right hemicolectomy are correlated with outcomes in colorectal as well as in non-colorectal surgeries performed by the same surgeon [3], showing the overarching impact of technical skills on surgical outcomes.
In surgical education, technical skills of trainees are often assessed by staff surgeons through direct observations in the operating room.These instantaneous assessments by supervisors are frequently unstructured and might only be snapshots of the actual technical performance of a trainee.Furthermore, they often lack objectivity due to peer review bias [5].Aiming to improve the objectivity and construct validity of technical skill assessment, video-based assessment has been introduced [6].Video-based assessment allows for retrospective review of full-length procedures or critical phases of an intervention by one or multiple experts.Despite the improvement of technical skill assessment by video-based assessment, it is still limited by the need for manual review of procedures by experts.Therefore, technical skill assessment is timeconsuming, costly, and not scalable.
Automation of video-based assessment using artificial intelligence (AI) could lead to affordable, objective, and consistent technical skill assessment in real-time.
Despite the great potential of AI in technical skill assessment, it remains uncertain how accurate, valid, and generalizable AI models are to date.Therefore, the aim of this systematic review was to analyze the performance, external validity, and generalizability of AI models for technical skill assessment in minimally invasive surgery.

Methods
This systematic review is reported in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) [7] guidelines and was prospectively registered at PROSPERO (2021 CRD42021267714).The PRISMA checklist can be found in the Supplementary (Table S1).

Literature search
A systematic literature search of the databases Medline/ Ovid, Embase/Ovid, Web of Science, and IEEE Explore was conducted on August 25th, 2021.The first three databases account for biomedical literature and IEEE Explore for technical literature.A librarian at the University Library, University of Bern performed the literature search combining the following terms using Boolean operators: (1) Minimally invasive surgery including laparoscopic, or robotic surgery, and box model trainer.(2) AI including machine learning (ML), supervised learning, unsupervised learning, computer vision, and convolutional neural networks.(3) Technical skill assessment including surgical skill assessment, surgical performance assessment, and task performance analysis.The full-text search terms are shown in the Supplementary (Table S2).The literature search was re-run prior to final analysis on February 25th, 2022 and May 31st, 2023.

Eligibility criteria
Studies presenting original research on AI applications for technical skills assessment in minimally invasive surgery including box model trainers published within the last 5 years (08/2016-08/2021, updated 02/2022 & 05/2023) in English language were included.Review articles, conference abstracts, comments, and letters to the editor were excluded.
Any form of quantitative or qualitative evaluation of manual surgical performance was considered a technical skill assessment.

Study selection
Before screening, the identified records were automatically deduplicated using the reference manager program End-note™ (Clarivate Analytics).After removal of the duplicates, two authors (R.P. & J.L.L.) independently screened the titles and abstracts of the identified records for inclusion using the web-tool Rayyan (https:// www.rayyan.ai) [8].Disagreement of the two authors regarding study selection was settled in joint discussion.Of all included records the full-text articles were acquired.Articles not fulfilling the inclusion criteria after full-text screening were excluded.

Data extraction
Besides bibliographic data (title, author, publication year, journal name), the study population, the setting (laparoscopic/robotic simulation or surgery), the task assessed (e.g., peg transfer, cutting, knot-tying), the data input (motion data from video recordings, kinematic data from robotic systems or sensors), the dataset used (a dataset is a defined collection of data either especially collected for the aim of the study or reused from previous studies), the assessment scale (ordinal scale vs. interval scale), the AI models used [ML or deep learning (DL)], the performance and the maturity level (development, validation, implementation) of AI models were extracted from the included studies.Missing or incomplete data was not imputed.

Performance metrics
The performance of AI models in technical skill assessment can be measured as accuracy, precision, recall, F1-score, and Area Under the Curve of Receiver Operator Characteristic (AUC-ROC).This paragraph gives a short definition of the used performance metrics.Accuracy is the proportion of correct predictions among the total number of observations.Precision is the proportion of true positive predictions among all (true and false) positive predictions and referred to as the positive predictive value.Recall is the proportion of true positive predictions among all relevant observations (true positives and false negatives) and referred to as sensitivity.F1-score is the harmonic mean of precision and recall and is a measure of model performance.A ROC curve plots the true positive against the false positive predictions at various thresholds and the AUC describes performance of the model to distinguish true positive from false positive predictions.

Risk of bias and quality assessment
The risk of bias (RoB) of the included studies was assessed using the modified version of Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) criteria [9].This tool is commonly used for RoB evaluation in quality assessment studies.The quality of studies was evaluated using the modified Joanna Briggs Institute critical appraisal checklist for cross-sectional research in ML as used in [10,11].

Results
The literature search retrieved a total of 1958 studies.After removing all duplicates, the remaining 1714 studies were screened by title and abstract.Thereafter, 120 studies remained, of which 70 were excluded after full-text screening.In summary, 50 studies  met the eligibility criteria and thus were included into this systematic review (Fig. 1).Two of the 50 studies [34,61] included in this review were found to match the inclusion criteria during the process of full-text screening and were thus included through cross-referencing.Six studies [21,29,37,45,55,58] were obtained during the re-run prior to final analysis six months after the initial literature search and 13 [13, 17, 34, 36, 38, 42, 48, 50, 52-54, 56, 61] during the second update on May 31st, 2023.Table 1 gives an overview of the 50 studies included in this systematic review (for full information extracted see Supplementary Table S3).

Discussion
This systematic review of AI applications for technical skill assessment in minimally invasive surgery assessed the performance, external validity, and generalizability In general, technical skill assessment involves either classifying skill levels in ordinal scales (e.g., novice, intermediate and expert) through unstructured observations or assessing performance intervals using structured checklists (e.g., Objective Structured Assessment of Technical Skills (OSATS) [70], Global Evaluative Assessment of Robotic Skills (GEARS) [71]) (Fig. 2).OSATS for example evaluates technical skills in seven dimensions (respect for tissue, time and motion, instrument handling, knowledge of instruments, use of assistants, flow of operation and forward planning, and knowledge of specific procedure) assigning a 5-point Likert scale from 1 (low skill) to 5 (high skill) to every dimension.Thus, 35 points is the maximum OSATS score reflecting highest technical skills.The ideal automated skill assessment model would not just output a skill level or overall score, but rather multiple dimensions of skill to provide actionable feedback to trainees.
Two subfields of AI are particularly used to extract and analyze motion data from surgical videos or robotic systems to assess technical skill: ML and DL.ML can be defined as computer algorithms that learn distinct features iterating over data without explicit programming.DL designates computer algorithms that analyze unstructured data using neural networks (NN).NN are computer algorithms designed in analogy to the synaptic network of the human brain.The input data is processed through multiple interconnected layers of artificial neurons, each performing mathematical operations on the input data to predict an output.
The predicted output is compared to the human labeled output to optimize the operations of the NN, which makes it a self-learning system.From an AI perspective technical skill assessment is a classification (prediction of expert levels) or a regression task (prediction of a score).Figure 3 illustrates how different input data types are processed by AI models to predict technical skills.
The generalizability of the studies included in this systematic review is limited due to several fundamental differences between them.Most studies (56%) used private datasets of different settings, tasks, and sizes.However, 21 studies (42%) included in this systematic review used JIGSAWS, a robotic simulator dataset and the most frequently used dataset in technical skill assessment.The use of simulators for technical skill assessment has advantages and disadvantages.On the one hand, simulators allow to control the experimental setting and enable reproducibility of studies.On the other hand, box model trainers simulate surgical tasks and have only a restricted degree of realism.In addition, simulators are well established in surgical training but have limited significance in the assessment of fully trained surgeons.The use of video recordings and motion data of actual surgeries as input data improves the construct validity of technical skill assessment models.However, in actual surgeries the experimental setting cannot be standardized and therefore, lacks reproducibility.This brings up the potential of virtual reality (VR) simulation in technical skill assessment [72].VR enables simulation and assessment of complex tasks, as faced in actual surgery, without exposing patients to any harm.Furthermore, the management of rare but far-reaching intraoperative adverse events like hemorrhage or vascular injury can be trained to proficiency in VR simulation.
The comparison of studies is impaired by the different scales and scores used to measure technical skill.Some studies use ordinal scales with different numbers of skill levels (good vs. bad, novice vs. intermediate vs. expert).Dichotomous classification of technical skill in good or bad performance seems obvious, however, remains highly subjective.Skill levels distinguishing novice, intermediate, and expert surgeons are often based on quantitative measures like operative volume or years in training but fail to reflect individual technical skill levels.Other studies used different interval scales (OSATS scores, GEARS scores, or Likert scales).In contrast to expert annotated or quantitatively derived skill levels, OSATS and GEARS are scores, that have proven reliability and construct validity for direct observation or videobased assessment [70,71].However, for the purpose of AI model training there is no standardization of skill annotation.Which part of the task, using which ontology, and in which interval technical skill should be annotated by experts to reflect the overall skill level of study participants remains to be defined.
Most of the studies included in this systematic review have methodologic limitations.Overall, 84% of studies included in this review are at RoB.The quality assessment of the included studies revealed that only 36% of the studies discussed the findings and implications in detail.Furthermore, only four studies included in this review have a multicentric dataset.Only four of the AI models studied are validated on an independent external dataset.Therefore, it is questionable whether the AI models included in this review would generalize to other settings, tasks, and institutions.Out of 50 included studies, 35 (70%) report on accuracy.However, there is a large variation of reported performance metrics among the studies included in this systematic review.Due to the novelty of AI application in the healthcare domain and in surgery in particular, the literature lacks standards in the evaluation of AI methods and their performance.There is an urgent need for the application of guidelines to assess AI models and for studies comparing them head-to-head.Guidelines for early-stage clinical evaluation of AI [73] and clinical trials involving AI [74] have been published recently.However, the studies included in this review are all at a preclinical stage where these guidelines do not apply.A multi-stakeholder initiative recently introduced guidelines and flowcharts on the choice of AI evaluation metrics in the medical image domain [75].For surgical video analysis this effort still needs to be taken [76].
This systematic review is limited by the lack of generalizability and methodologic limitations of the included studies.Therefore, the direct comparison of AI models and a meta-analysis summarizing the evidence of included studies is not meaningful.To overcome these limitations valid and representative datasets, the use of predefined performance metrics, and external validation in clinical implementation studies will be essential to develop robust and generalizable AI models for technical skill assessment.In conclusion, AI has great potential to automate technical skill assessment in minimally invasive surgery.AI models showed moderate to high accuracy in technical skill assessment.However, the studies included in this review lack standardization of datasets, performance metrics and external validation.Therefore, we advocate for benchmarking of AI models on valid and representative datasets using predefined performance metrics and testing in clinical implementation studies.

Fig. 2
Fig. 2 Human technical skill assessment in minimally invasive surgery

Fig. 3
Fig. 3 Automated technical skill assessment in minimally invasive surgery by artificial intelligence

Fig. 4
Fig. 4 Quality assessment of the included studies.The numbers within the bars represent the respective number of studies

Table 1
Information summary of all studies included in this review to ensure legibility the data provided in Table1is limited to accuracy metrics of the best performing model presented in each study.For full information extracted see Supplementary