Optimizing surgical procedures by improving the technique and implementation of innovations have shown to improve clinical outcomes. This indicates that a surgical procedure is evolving over time, and can be performed with varying technique and surgical quality. Awareness of varying surgical quality has major implications for evaluating surgical performance in daily clinical practice as well as determining the impact of surgery on different clinical parameters in a research setting. However, most comparative studies in surgery are hampered by lack of quality assurance which might underestimate the clinical impact of a new surgical innovation, or might influence its relative contribution in multimodality treatment approaches (e.g. added value of perioperative chemotherapy). It has been shown that the quality of surgery has substantial impact on clinical outcomes which is also reflected by suboptimal outcomes in surgical learning curves [1,2,3,4,5].

Currently, surgical competency is not objectively measured in clinical practice using surgical quality assessment (SQA) tools. In surgical education, the competency of a resident to perform a specific operation independently is generally based on subjective rather than objective assessments. Since the evidence of the association between technical skills and patient outcomes is growing, the surgical community as well as health care organizations are seeking solutions to objectively measure a surgeon’s competence and avoid negative impact of variation and learning curves. Objective competence assessment is needed to improve the quality of surgery. This will lead to better performance adjusted surgical education, accommodate the certification of surgeons after successful training and help to obtain robust data in clinical trials investigating new surgical techniques.

Many different tools have been developed for surgical assessments: direct assessment in the operating room by an expert or supervisor, self-assessment after a surgical procedure and postoperative video-based assessment. Especially in laparoscopic surgery, multiple video-based SQA tools have been described, which can be divided in four main categories: (1) global assessment scales (GAS) focusing on overarching qualities such as tissue handling [6, 7], (2) error-based assessment scales (EBAS) in which errors are identified as a surrogate for the overall quality of the performance [8], (3) procedure-specific assessment tools (PSAT) in which key steps and phases of the operation are assessed separately [9], and (4) artificial intelligence (AI) machine learning algorithms which can recognize anatomical structures and movements of instruments to estimate or predict surgical quality [10].

Although many of these video-based SQA tools have been thoroughly investigated, validation of these tools remains complex [11]. Since the increasing need for SQA for education and clinical trial purposes, we aim to provide a clear overview of the available video-based SQA tools, their relation to clinical outcomes and evidence on their validity.

Methods

Protocol and registration

This systematic review was conducted in compliance with the guidance from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist [12]. This study including the review protocol are registered in PROSPERO (ID: 313,008).

Search strategy

PubMed, Embase.com and Web of Science were systematically searched by two reviewers (AG and AvL) from inception up to September 1st 2022 with the aid of a medical information specialist. The search strategy was created using terminology from studies that met the inclusion criteria, and was primarily focused on laparoscopic surgery, quality assessment tools of technical skills, video-based evaluation and tool validation. Details of the search strategies are provided in Supplementary Tables 1a–c. References of included studies were screened to search for other eligible studies.

Inclusion and exclusion criteria

Studies were included if video-based quality assessment of laparoscopic surgery in living patients was evaluated. No restrictions regarding type of research methodology was used. All domains of laparoscopic surgery were considered.

Studies were excluded if the focus was on endoscopic (i.e. endoluminal) procedures or robot-assisted procedures and if surgery was performed in the context of a box trainer or virtual reality (VR) setting. Non-human studies, reviews, comment letters and articles written in a language other than English or Spanish were also excluded.

Selection process and data extraction

Two reviewers (AG and AvL) selected the articles independently after removal of duplicates by screening title and abstract. Subsequently, they independently assessed the remaining potential articles in full text, including their potential relevant references. Discrepancies between the reviewers were discussed and resolved by consensus with a third person (JT). By using a data extraction template, AG and AvL independently extracted pre-defined characteristics of the identified studies, including study design, type of surgical procedure, number of videotaped procedures, number of surgeons, number of patients, name of the tool, number of reviewers, validation approach, results of validation and inter-rater reliability.

Validation methods and assessment of validity

All methods of validation were identified. Subsequently, the four most common validation methods were selected for analysis, which comprised validation by clinical patient outcomes, validation by experience level of surgeons, validation by expert opinion and validation using another available assessment tool.

In addition, all studies were rated by the same two reviewers (AG and AvL) for evidence of validity using a scoring system provided by Beckman et al. [13], which was later adjusted by Ghaderi et al. [11] and Haug et al. [14]. That scoring system was further modified for the purpose of this systematic review, thereby defining five dimensions of validity: content validity, response process, internal structure, relations to other variables and consequences (see Table 1). All included studies were rated for each dimension with a score from 0 to 3, which could count up to a total score of 15. A score of 1–5 is associated with limited validity, a score of 6–10 with moderate validity and 11–15 with substantial validity. The five domains of our validity evidence scoring list represent the subtypes of the concept ‘validity’ in which one domain is not superior to another. Therefore, these domains weighted equally when calculating the total validity scores. Supplementary Table 2 shows the individual scores per item for all the included articles separately.

Table 1 Validity evidence scoring list, adopted from Beckman et al. [13], Ghaderi et al. [11] and Haug et al. [14], and modified for this review

Results

Literature search

The literature search yielded 6492 records that resulted in 3584 unique articles after removal of duplicates. After title and abstract screening, 128 full text articles were assessed. A total of 73 articles were excluded for reasons as outlined in Fig. 1, which resulted in 55 studies [1,2,3, 8, 9, 15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64]. An overview of the included studies is provided in Table 2.

Fig. 1
figure 1

PRISMA flowchart of the literature search

Table 2 Overview of the included studies

Characteristics of the assessment tools

The literature search identified 55 articles, which presented 41 different video-based tools for technical skills assessment in 9 different fields of surgery including bariatric, gynecologic, general, upper gastrointestinal, orthopedic, urologic, colorectal, pediatric and pulmonary surgery (see Table 2). Described SQA tools could be divided into four main categories: “Global assessment scale (GAS)” was investigated in 21 studies [1, 15, 16, 19, 21, 26, 29, 33, 36, 39,40,41,42, 44, 47, 50, 55, 56, 61,62,63], “Error-based assessment scale (EBAS)” was investigated in 6 studies [8, 26, 27, 34, 49, 58], “Procedure-specific assessment tool (PSAT)” was investigated in 31 studies [2, 3, 9, 17, 18, 20, 22,23,24,25, 29,30,31,32,33, 35, 38, 42,43,44, 46, 48, 50,51,52,53,54, 57, 59, 60, 64] and 3 studies examined the use of “Artificial Intelligence (AI)” [28, 37, 45].

In total, 12 articles focused on the correlation between the assessment score and clinical outcomes of which 8 were performed in bariatric surgery and 4 in colorectal surgery (Table 3). A total of 26 tools were validated based on the experience level of surgeons. In most studies, assessment scores of experienced surgeons were compared with the scores of surgeons with an intermediate or beginners level (often surgical residents), based on either their years of practice or number of performed procedures. A total of 12 studies validated their assessment tool by another available assessment tool, with the vast majority using the Objective Structured Assessment of Technical Skills (OSATS) or Global Operative Assessment of Laparoscopic Skills (GOALS) as a comparative scale. Expert opinion was used in 15 studies to validate their assessment tool.

Table 3 Overview of studies with assessment tools validated by clinical outcomes

Global assessment scale (GAS)

In total, 21 studies investigated an assessment tool that could be categorized as GAS, of which 12 studies used the Objective Structured Assessment of Technical Skills (OSATS) or modified versions of this tool, for example the Bariatric Objective Structured Assessment of Technical Skills (BOSATS). Six studies validated their GAS with clinical patient outcomes, the majority of which were performed in bariatric surgery (see Table 2). Two articles examined whether the quality of surgery resulting from the OSATS correlated with clinical outcomes. The study of Fecso et al. showed that a lower performance score (OSATS ≤ 29/35) was an independent predictor for major-short term outcomes in laparoscopic gastrectomy (OR 6.49, 95% 1.60–26.34, P = 0.009) [26]. In contrast, the results of Scally et al. revealed no difference in clinical outcomes between the 75th percentile (25% highest rated surgeons) and the 25th percentile (25% lowest rated surgeons) based on the OSATS score [55]. The other four papers investigated whether BOSATS was correlated with patient outcomes showed conflicting results [1, 21, 61, 62]. In one of these studies, the anastomotic leakage rate was significantly correlated with the technical execution of the operation [61]. In the other two papers, a non-significant association was seen [1, 62]. In contrast, the study of Chhabra et al. showed that higher assessment scores of certain parts of laparoscopic sleeve gastrectomy were associated with increased leakage rates [21]. Three studies evaluated reoperation rates, of which two studies showed a significant correlation of the assessment score with the reintervention rate [1, 61, 62]. In two of the four studies focusing on surgical haemorrhage, a significant correlation was found [21, 62] while in the other two a trend was seen [1, 61]. In Table 3 a detailed overview of all studies with assessment tools validated by clinical outcomes is provided.

Error-based assessment scale (EBAS)

A minority of the tools were classified as EBAS. The Objective Clinical Human Reliability Analysis (OCHRA) and the Generic Error Rating Tool (GERT) were mostly used in the literature so far. Both OCHRA and GERT were used in three studies. However, OCHRA was limited to the field of gastrointestinal surgery, while GERT was investigated in bariatric and gynecologic procedures (see Table 2). Two studies looked at the correlation between EBAS and clinical outcomes. In terms of number of errors (P = 0.331), events (P = 0.758), and rectification (P = 0.433), Fecso et al. found no statistically significant difference between the group of patients without complications versus the two groups of patients with either Clavien-Dindo grade I/II or Clavien-Dindo grade III complications. Despite not being significant, it did show a trend with more number of errors, events and rectification in the second group [26]. In addition, Foster et al. did find a statistically significant correlation between total error frequency per case and total blood loss (rs = 0.61, P = 0.004), measured by OCHRA, [27], see Table 3.

Procedure-specific assessment tool (PSAT)

A total of 31 studies assessed surgical procedures with a procedure-specific assessment tool (PSAT). This type of tool has the most variety of tools since these are build based on step-by-step approach dependent on the type of surgical procedure. The most frequently investigated tool is the competency assessment tool (CAT), which was evaluated in three colorectal studies and one gynecological study. In total, five of the PSATs were validated by clinical outcomes (Table 3). In one of those studies, the quality of the surgeon was assessed with both OSATS and a procedure-specific Colorectal Objective Structured Assessment of Technical Skill (COSATS) based on one laparoscopic right hemicolectomy. They compared postoperative complications between the highest quartile and lowest quartile of surgeons and showed that patients operated by surgeons among the highest quartile had fewer complications (15.5% vs. 20.6%, P = 0.03), fewer unplanned reoperations (4.7% vs. 7.2%, P = 0.02) and lower rates of serious morbidity or death (15.9% vs. 21.4%, P = 0.02) compared to patients operated by surgeons belonging to the lowest quartile [3]. In addition, Varban et al. showed that a low PSAT score in a laparoscopic sleeve gastrectomy increased the risk of surgical complications, hemorrhage and reoperation [60]. The study of Karushima et al. focusing on laparoscopic distal gastrectomy also showed a correlation between the PSAT score (high vs. intermediate vs. low) and operative time (229 vs. 266 vs. 311 min, P < 0.001), intraoperative complications (0% vs. 11.8% vs. 27.8%, P = 0.01) and postoperative complications (0% vs. 0% vs. 22.2%, P = 0.002) [43]. Not only in bariatric surgery, but also in colorectal surgery, the association between quality of surgery and clinical outcomes was investigated. Curtis et al. showed a statistically significant difference in 30-day morbidity after laparoscopic total mesorectal excision (TME) between the upper quartile, interquartile and lower quartile (23.3% vs 55.3% vs. 50%, P = 0.008), based on a procedure-specific performance tool. Performance was also correlated with operative time (median 178 min vs. 255 min. vs. 290 min, P < 0.001) and blood loss (median 40 mL vs. 100 mL vs. 100 mL, P < 0.001) [2]. Furthermore, Mackenzie et al. showed that surgeons performing a right or left hemicolectomy with a high assessment score had more favorable patient outcomes: lower postoperative morbidity and surgical complications rates and higher lymph node yield [46], see Table 3.

Artificial intelligence (AI)

Three of the included studies used AI to calculate parameters which estimate and predict surgical quality. In one of the studies, videos of laparoscopic cholecystectomy were analyzed by Kinovea 0.8.15 software. Three parameters were calculated: “path length”, “average distance”, which the instrument tip moved per time frame, and “number of extreme movements”, defined as more than 1.0 cm movement per frame. A formula using these parameters calculated a score between 0 and 1, the higher the score the better the execution. Those videos were also scored by a CAT tool and a statistically correlation between both was observed (R2 = 0.844) [28]. In the other two studies, a convolutional neural network (CNN) was built based on multiple video fragments, which showed to be able to differentiate between different levels or score goups of surgical skills. In the study of Kitaguchi et al., the CNN was able to automatically classify video clips into three different score groups with 75% accuracy, while in the remaining study from Lavancy et al., the CNN could distinguish good from poor quality with an accuracy of 87 ± 0.2% [37, 45].

Evaluation of validity evidence

The assessment tools and AI in all articles were scored based on the content validity, response process, internal structure, relations to other variables and consequences, as shown in Table 1. The evidence of validity scores for those tools in all articles are presented in Tables 4 and 5. In total, 9 studies received a substantial evidence score (score between 11 and 15), 38 studies were scored as moderate evidence (score between 6 and 10) and the remaining 8 studies were given a limited evidence score (score between 0 and 5). Table 4 shows an overview of all studies and tools arranged by strength of validity based on the validity evidence scoring list from Table 1.

Table 4 Articles/tools arranged by strength of validity based on the validity evidence scoring list from Table 1 (substantial, moderate and limited evidence)
Table 5 Articles/tools with substantial evidence based on the validity evidence scoring list from Table 1

In Table 5, all nine studies with substantial validity evidence (score between 11 and 15) and their points per validity item are shown. In total, 7 of the 9 studies (77.8%) received the maximum score of 3 points for clear and accurate content of the tool, by creating the SQA tool using the Delphi method. For the item response process, which reflects the use of training or systems to reduce variation between assessors, only 1 study (11.1%) received the maximum score of 3 points. For the item internal structure representing variability, consistency and generalizability, 4 of the 9 studies (44.4%) received all 3 points. Finally, 3 of the 9 studies (33.3%) scored the maximum of 3 points for the item relation to other variables.

Discussion

This systematic review shows a comprehensive overview of all video-based SQA tools for technical skills in laparoscopic surgery. In total, 41 tools were identified, which can be divided in four categories: global assessment scale (GAS), error-based assessment scale (EBAS), procedure-specific assessment tool (PSAT), and artificial intelligence (AI). Both PSAT and GAS show the most relevant associations with clinical outcomes. GAS seems more appropriate for general surgical skills during the first training years, while PSAT might be more suitable for evaluating whether someone is able to perform every step of a specific operation accurately. A “good” surgeon based on a GAS does not necessarily mean that he or she is competent to perform a specialized surgical procedure independently. However, before implementing tools in education, clinical practice or research, validation of potential SQA tools is key.

Recently, Haug et al. [14] provided an adequate summary of assessment tools in laparoscopic colorectal surgery, however a clear overview of the available video-based SQA tools in all different fields of laparoscopic surgery including critical evaluation of their validity evidence has not yet been published. Although validation of these tools with experience of surgeons, other tools or expert opinion is interesting, the association between the assessment score and clinical patient outcomes is particularly relevant. Various surgical specialists such as general surgeons, urologists and gynecologists have investigated the value of SQA tools. However, studies that validated SQA with clinical patient outcomes are limited to bariatric and colorectal surgery. In bariatric surgery, a statistically significant positive correlation has been observed between two types of tools (GAS and PSAT) and intra- and postoperative outcomes including decreased anastomotic leakage rates [61], hemorrhage [21, 60, 62], rate of reoperations [60, 62], overall complications [1, 26, 60] and increased percentage of weight loss [21, 62]. The one study investigating EBAS, however, did not show an evident association between its score and clinical patient outcomes [26]. In colorectal surgery, only PSAT and EBAS have been investigated using patient outcomes. Higher PSAT scores seem to be associated with improved patient outcomes including decreased operative time, postoperative morbidity, reoperation, readmission and death [2, 3, 46], while EBAS only showed reduced blood loss [27].

Many studies showed a correlation between high SQA scores and improved clinical outcomes. However, they were heterogeneous and showed moderate validity evidence based on low content quality, no clear training of assessors and high inter-observer variability. The three studies of Kurashima, Curtis and Stulberg, using the JORS-LDG tool (PSAT), the combined tool of OSATS + COSATS (GAS + PSAT) and the Performance Tool (PSAT), respectively, showed both decreased short-term morbidity in case of higher assessment scores and received the best validity scores [2, 3, 65]. These tools for bariatric and colorectal surgery therefore seem the most promising SQA tools at the moment. When looking at the 9 studies with the highest validity (Table 5), it is clear that on some validity items there is room for improvement. Although a high percentage of 77.8% of those articles show high quality of tool content, in 8 of those 9 articles (89.9%) there is no clear response process in which assessors are trained in using this tool, which increases the chance of unwanted variation. In addition, only in 44.4% of those articles optimal internal structure measurements such as inter-rater, inter-item and inter-test variability analyses were performed, and only 33% compared their tool with clinical outcomes. Ideally, an SQA tool achieves maximum scores on all items before implementation: content made by a Delphi consensus with experts (widely used method to achieve consensus on a complex problem) [75], optimal training of assessors, multiple measurement on variability and generalizability and correlation with clinical patient outcomes.

Unlike aviation, where pilots must undergo certification every year to prove their competency in the aircraft [66], there is no objective assessment and (re)certification of surgeons based on their technical performance in current surgical practice in the Netherlands. In most countries, as in the Netherlands, surgeons apply for periodic recertification by providing proof of a minimum number of surgical procedures in their field and a minimal number of continuing medical education points. This, however, does not necessarily reflect technical proficiency in the execution of said surgical procedures. Since surgery is increasingly prone to new developments and research in which procedures and techniques change over time, the lack of competency assessment is notable. Within the UK, a national training program (LAPCO), in which surgeons were objectively assessed with a PSAT and a GAS tool, has shown to result in improvement of clinical outcomes after laparoscopic colorectal surgery [67]. Multiple surgical training programs utilize some form of competency assessment, but structured (inter)national training programs that embed assessment of surgical skills are still scarce.

To implement training, proctoring and (re)certification, a degree of standardization of surgical procedures is necessary. This is challenging as there are many acceptable surgical variations within any single surgical procedure. In many fields of laparoscopic surgery, there is a lack of evidence and consensus regarding the ‘best surgical technique’. Therefore, it is unknown what steps and elements an objective SQA tool should contain. However, some included studies performed Delphi rounds to agree on the best surgical practice in their field and developed a PSAT based on consensus. This seems to be an appropriate first step towards objective assessment, allowing detailed SQA tools with high level of objectiveness.

Clinical trials investigating new techniques often fail to demonstrate the real benefit of a specific change in a procedure. This may possibly be a result of variation or difference in surgeons proficiency. For example in the field of laparoscopic right hemicolectomy, studies have focused on the comparison of D3 lymphadenectomy versus D2 lymphadenectomy. However, whether a D2 or even D3 implies the same level of lymphadenectomy among or within these respective studies is subject of debate [68]. Also, randomized clinical trials comparing different laparoscopic techniques (ROLARR, ALaCaRT) have not used quality control of surgery which may have influenced the outcomes [69, 70]. The COLOR 3 study (an international randomized clinical trial comparing laparoscopic with transanal total mesorectal excision) is one of the first trials that performs video-based quality control using a CAT to either assess the competence of a potential participating center in a pretrial phase, and to control the quality throughout the study by assessment of videotapes of the surgery of all included patients [59, 71]. Robust competency assessment ensures quality of trials and allows for better comparison of surgical procedures in a research setting.

This systematic review has some limitations. The present study included only tools assessing technical skills. Since it is obvious that teamwork, leadership, decision-making, situational awareness and communication are as important to the whole surgical process as surgical technical skills, these non-technical skills have rightly gained a lot of focus in the last years [72]. The black box in the operating room is an example of an analytical data platform that could be accepted to aid process optimization and, as a result, to also improve the non-technical skills of the operating theatre team [73]. In the future, the combination of assessing both technical and non-technical skills should become important. In addition, a limitation is that we have only focused on video-based SQA tools and not on the live assessment of technical skills. We deliberately chose to do this because we believe that it is the way forward. Thanks to current use of minimally invasive techniques, it is relatively simple to record operations, which has the benefit of enabling postoperative and remote assessment.

The assessments were all based on videotaped cases, which has the advantage of allowing many assessors to evaluate the same procedure at the same time. Furthermore, independent scoring allows assessors to rewind a surgical step for repeated watching while remaining blind to the surgeon's identity and level of expertise, resulting in a more objective evaluation. On the other hand, video-based examination, might be labor intensive, time-consuming and prone to bias. AI could be used in the future to automatically and rapidly identify crucial steps and operational tasks without the assistance of reviewers. Although only one study was included in this review that described the use of AI to assess videos of laparoscopic surgery in the clinical setting [28], a systematic review published in 2022 has already found 66 studies detailing the application of AI for technical skill assessment in surgery [10]. In the near future, probably more developments will be put into practice.

Next to laparoscopic surgery, SQA tools could be of great use in quality control of minimally invasive robotic surgery which is rapidly emerging and will probably play a more important role in the next decade [74]. Since endoscopic and robotic procedure also make use of a camera, these approaches seem suitable for assessment using video-based SQA tools. For the robotic procedures the laparoscopic SQA tools can be used as these approaches are essentially similar and for the endoscopic procedures it would certainly make sense to develop separate SQA tools. However, objective video-based quality assessment of open surgery might be more challenging since adding a camera that provides a good and clear overview of the operation field might bring practical difficulties. In future research, it will be key that there is a focus on the use of SQA tools that incorporate both procedure-specific assessment as well as general skills. Future studies should ideally use tools that are developed using the Delphi technique, implement training for the assessors, use multiple measures of inter-rater reliability, internal consistency and generalizability, validate their tool by clinical outcomes and focus on the interpretation and future use such as cut-off values.

Conclusion

This systematic review evaluated a total of 41 different video-based SQA tools for technical skills used in 9 fields of laparoscopic surgery. These tools could be divided in global assessment scales, error-based scales, procedure-specific assessment tools and artificial intelligence machine learning. This study shows that well validated SQA tools enable objective assessment of technical skills of a surgeon, with major relevance for patient outcomes. Global assessment scales combined with a procedure-specific assessment tool could have the greatest potential for the use of education, research and certification.