Minimally invasive surgery has increasingly become standard of care in many fields of colorectal surgery. The assessment of the surgeons’ operative performance is highly relevant for quality assurance, training, and certification; it has been shown that technical skill scores vary significantly, even amongst experienced surgeons, and predict the likelihood of adverse clinical outcomes [1,2,3]. Prior results showed that the variation in the surgeons’ technical skills scored by an observational tool was directly related to the variation in patient complications [2]. Therefore, measures to identify individuals that require further training, to highlight specific training needs, and to define areas of improvement are desirable but often lacking in the clinical setting.

A range of tools to objectively assess surgical performance have been developed and validated in most surgical specialties. They can be divided into three main categories: global rating scales (GRS), procedure-specific tools (PST) and error-based rating scales (ERS). The GRS aim to assess general aspects of the technical expertise and can be applied across surgical procedures [4,5,6]. The most cited and widely used tool in this category is the Objective Structured Assessment of Technical Skill (OSATS), developed by Martin et al. in 1997 [6]. GRS are reliable and valid for numerous procedures, but they do not provide feedback on a specific step or a particular technique. PST are dedicated to a single specific procedure and each step or task area of an operation can be individually rated [7]. ERS aim to identify errors and near misses as a surrogate for the overall quality of the performance [8]. Analysis of error types or errors performed during parts of the procedure can give a detailed insight into skill or procedure specific areas that need further development.

Laparoscopic colorectal surgery and other minimally invasive techniques require some of the most complex skills in general surgery [9]. Especially in colon and rectum cancer surgery, surgical precision and completeness of the resection margins are highly relevant. The completeness of the mesorectal or mesocolic excision has been associated with reduced cancer recurrence rates and highlights the fragile relationship between surgical skill and patient outcome [10,11,12]. In such high-stake surgical environments, the use of objective formative and summative assessment during training and beyond is highly relevant for quality assurance. Although there is evidence of reliable and valid assessment tools, clinical implementation of tools for the assessment of operative quality in especially laparoscopic colon surgery is sparse. Also, little is known about the validity of such tools, supporting an appropriate interpretation of assessment results [13, 14].

Therefore, the aim of this scoping review is to comprehensively identify tools for skill assessment in laparoscopic colon surgery, and to assess their validity as reported in the literature.

Material and methods

This scoping review was conducted according to PRISMA guidelines with Extension for Scoping Reviews [15]. As scoping reviews are not included at the systematic reviews database, PROSPERO, the present protocol can be obtained on request to the corresponding author.

Eligibility criteria

Inclusion criteria were any research study assessing observational tools of technical skills in laparoscopic colon surgery, and the manuscript written in English. Studies performed on virtual reality simulators and studies solely assessing non-technical skills, such as communication skills, teamwork, leadership, and decision-making were excluded. Studies assessing tools for both technical and non-technical evaluations were included in this review. Conference abstracts, reviews, and editorials were excluded. No restrictions to the publication date were imposed.

Search strategy

The EMBASE and PubMed/MEDLINE databases were used to identify relevant studies, and the Cochrane database was also searched to include any reviews on the subject. All references of the included full-text articles were reviewed to identify studies that might have been overlooked. The PubMed/MEDLINE search was performed using free text words describing competency assessment, colon surgery, and laparoscopy. A combination of the Medical Subject Headings ([MeSH]) terms ‘clinical competence’, ‘colon resection’ and ‘laparoscopy’ was used. A similar search strategy was applied to EMBASE, though with modification as needed. The final search was performed on the 28th of May 2021 and the search string of use is presented in Supplemental Table 1.

Table 1 Definitions of validity sources.

Study selection

All studies examining assessment tools of technical skills in laparoscopic colon surgery were included. Assessment tools were defined as a blinded or non-blinded assessment of technical skills performed live or on video, based on pre-defined rating criteria. Step-by-step descriptions of procedures were excluded if surgical performance was not translated into a summative result on an arbitrary scale. Also not considered were non-observational tools such as dexterity-based systems (e.g. instrumental path length or number of movements) and studies examining technical performance at task-specific stations not considering full-length procedures. The number of procedures or registration of postoperative complications were not considered observational assessments of technical skill.

Further, studies were only considered if the assessment tool described were aimed towards laparoscopic colon procedures: right and sigmoid colectomies, total and subtotal colectomies were all included. Studies examining tools applied to ‘laparoscopic colorectal procedures’ in general, without specifying any further detail, were included in the review. No restrictions were made to the indication of the laparoscopic colonic procedure (benign/malignant) or to the development, validation, or implementation process of the tool. Studies assessing tools solely aimed towards laparoscopic rectal surgery were not considered. Also, tools developed for open colon surgery or robotic colorectal surgery were excluded.

Data collection and study assessment

All studies were screened individually by two authors (TH, MBO) using the systematic review software Covidence (Veritas Health Innovation, Melbourne). Full-text articles were retrieved for all eligible manuscripts. Details regarding the validation process were extracted separately by the two authors comprising whether the tool was applied to surgical trainees or consultants; the number of assessors; the type of procedures evaluated; video versus live assessment; and the validation setting. The same two authors then rated the included studies for validity evidence according to the score provided by Beckman et al. [16], which later have been broadened by Ghaderi et al. [13]. This scoring system provides a framework of five dimensions of validity: i) content, ii) response process, iii) internal structure, iv) relations to other variables, and v) consequences (Table 1).

In short, content validity describes the degree to which the tool’s content measures the construct of interest and refers to the themes, wording, and format of the tool items. The response process describes how the assessments given by the individual assessors are analysed. Evidence of internal structure refers to the degree to which the tool items fit the underlying constructs, and the relation to other variables describes the relationship between the tool scores and external variables e.g. surgeon experience level. Evidence of consequences is defined as the intended and unintended impact of the tool use. In the present study, each of these five dimensions was assigned with a score ranging from 0 to 3, for a total score of 15. The total validity score was then graded as follows; 1–5 limited evidence, 6–10 moderate evidence, and 11–15 substantial evidence. The definitions of validity evidence used, with examples of numerical scores, can be found in Table 1. Any disagreement between the two authors regarding study selection, data extraction, or validity evidence was resolved by discussion.

Results

Literature search and study selection

The study selection process is described in Fig. 1. In short, the primary literature search revealed 1,853 studies. After removing 558 duplicates, the remaining 1,295 titles and abstracts were screened for relevance. Of these, 63 studies underwent a full-text review, of which 19 met the inclusion criteria [1, 2, 7, 8, 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]. Three additional studies were included after reviewing full-text references [32,33,34].

Fig. 1
figure 1

Flowchart of the included studies. AT: assessment tool, lap. colon: laparoscopic colon, other: language, review, protocol paper, editorial, conference abstract, commentary

Characteristics of the assessment tools

The search process identified 22 studies, which presented 14 different tools for technical skill assessment in laparoscopic colon surgery (Table 2). On reviewing the included tools’ contents, the studies were grouped into the three main tool categories: five were GRS [17,18,19,20, 32], one was an ERS [8], and eight were PST [22,23,24, 27, 29,30,31, 33]. The studies were primarily conducted in the United Kingdom, Canada, the United States, and Japan.

Table 2 Characteristics*

The identified tools included seven original tools, five modified versions of previously validated tools, and two tools that were a combination of these (Table 3). Eleven were evaluated on surgical procedures performed in the operating theatre, two were used in a laboratory setting (animal models) and one provided no setting information (Table 4). Five tools were applied to surgical trainees, four to surgical consultants, and another four tools to a combination of these. Concerning the surgical procedure used for assessment, seven tools were applied to video-recorded cases, five to direct observation, one reported no preferences, and one tool was applicable to both. One assessor per case was reported for all tools using direct observation, whereas two or more assessors were described for tools using video-recorded cases. Use of the assistant was considered in five tools: SAS, OSATS, OCRS, CT and ASLAC. A large variation was observed for the surgical cases evaluated in the included studies, ranging from 0 to 750 [19, 31].

Table 3 Descriptive data of assessment tools
Table 4 Data describing the validation process of assessment tools

Evaluation of validity evidence

All tools were scored according to content, response process, internal structure, relations to other variables, and consequences, as exemplified in Table 1. The validity evidence score for all assessment tools is presented in Table 5.

Table 5 Evidence of validity

Content

The evidence of content validity varied across the tool categories (score 0–3). Eight studies provided moderate evidence (score 2) as these relied on previously validated tools or a combination of an original and a previously validated tool [8, 17, 19, 20, 22, 24, 32, 33]. Of these, three were modified versions of the OSATS [6]. Task analyses based on textbooks, articles, video recordings, and expert discussions was used to create the tool of Sarker et al. (TSALC) [22] and the GAS of Miskovic et al. [24]. More comprehensive methods that included systematic expert review (Delphi method) were used to establish content validity for the tools of Palter et al.(PSET) [7, 23], Miskovic et al. (CAT) [27], and Nakayama et. al [31]. In line with this, a consensus-achieving method was applied by Champagne et al. (ASCRS) [30], where a panel of experts modified previously validated tools by watching video-recorded laparoscopic right colectomies. Comprehensive methods supporting content validity could also be found in the paper by Glarner et al. [29], where the tool was piloted in the operating room and revised through an iterative process until the researchers and colon surgeons reached consensus. Oppositely, the tool by Wohaibi et al. (OpRate) [18] presented the lowest evidence (score 0), as this paper did not reveal how the content was chosen.

Response process

The evidence for the response process validity varied across all studies from 0 to 2. Some studies reported that a brief orientation was given to the assessors (Sidhu et al. (SAS) [17], Dath et al. (OCRS) [33], OpRate, PSET, and CAT) to obtain assessment consistency; others provided no information regarding the response process (Watanabe et al.(IRT-GOALS) [20] and the TSALC).

Structured training of the assessors before initiating the assessment process was reported by four studies, including the paper of Niitsu et al. (OSATS) [32], Miskovic et al. (OCHRA) [8], the Jenkins et al. (GMAS) [19], and the ASCRS studies. Although the ASCRS underwent modification in a pilot phase until the experts reached agreement, the assessors were not evaluated after they had completed rater training, which is why the ASCRS was graded with a moderate level of validity evidence. The GMAS exceeded others by reporting continuous training of the assessors during the study period, although no data was provided regarding the impact of the rater training. None of the tools reported multiple sources of data examining the response process (score 3).

Internal structure

The most common reported evidence of internal structure was inter-rater reliability, which was reported by seven tools (50%) [8, 17, 22,23,24, 30, 33]. No consistent method of calculating inter-rater reliability was used, and the strategies included interclass correlation coefficient, AC1 Gwet coefficient, Pearlson correlation, and Cronbach’s α. OCHRA was the only tool to report test–retest reliability, comparing error counts in cases performed by the same surgeon.

Six studies reported item analysis: internal consistency (inter-item reliability) was described for SAS, OpRate, GAS, PSET, and ASCRS; task-to-task variation (inter-station reliability) was analysed for OCRS.

The IRT-GOALS and CAT were the only tools for which extended measures of inter-item reliability was reported (score 3): Item response theory was used for the IRT-GOALS, and the reliability coefficient of generalizability theory was used for the CAT, examining the effect of an increasing number of assessors and cases by applying the D-studies.

Relations to other variables

The evaluation of this dimension revealed that most studies provided either poor (score 0–1) or excellent validity evidence (score 3). Nine studies (64%) compared performance scores across training levels or case experience; all reported improved scores with increased training levels or greater case experience. Comparison with other assessment modalities was described for three tools: GMAS was compared to Direct Observation of Procedural Skills scores; OCHRA was compared to an overall “pass/fail” global score, operating time, and a measure of efficiency (dissecting-exposure ratio); and CAT was compared to an overall outcome statement (fail/pass) as well as OCHRA error counts. Finally, the relationship between assessment tool scores and patient outcomes was examined for CAT and ASCRS, both reporting reduced risks of postoperative morbidity for high-skill level surgeons. Correlation to pathological examination was reported for CAT only, describing less lymph nodes harvested and a shorter distal resection margin for low-skill level surgeons [1].

Consequences

In line with Relations to other variables, the validity evidence revealed for the consequences of the presented assessment tools was either low (score 0–1) or high (score 3).

Four studies reported data regarding ‘time to complete the assessment tool’ [24, 29, 30, 33], whereas three studies describes implementation of the assessment tool in a clinical surgical training programs: GMAS was used in the multimodal training program at St. Mark’s Hospital in London (2006–2010), and GAS/CAT were used in the National Training Program for consultant surgeons in England (2008–2009/2010–2012). While GMAS and GAS were used to provide formative feedback, CAT was used for summative assessment reporting a cut-off score of 2.7 differing between ‘pass’ and ‘fail’ surgeons. The educational impact of the tool score was clearly described for GAS, reporting the number of surgical cases required before trainees felt confident in performing a surgical procedure independently (proficiency gain curve). Likewise, score accuracy was established for CAT and OCHRA using prediction models. Although not officially included in a national surgical education program, also the IRT-GOALS study provided a clear description of the impact of clinical implementation with interpretation of assessment scores using item response theory results.

Discussion

This scoping review identified 14 tools for skill assessment in laparoscopic colon surgery and described their characteristics and validity. Most of the tools were evaluated in small studies with fewer than 30 participating trainees and 90 operative cases.

A majority of the identified tools were procedure-specific, which reflect the technical complexity of laparoscopic colon surgery, as most surgeons would be expected to have mastered generic laparoscopic skills before embarking on laparoscopic colon resection surgery. Interestingly, side-specific versions were only available for two tools, although it is well known that right and sigmoid colectomies differ considerably in technical complexity. Therefore, for one-version tools, mastering of a complex procedural step, e.g. vascular dissection during a right hemicolectomy, might not be correctly evaluated. As a result, the one-version tool design challenges the content validity (how the tool content relates to the construct it intends to measure). However, it should be emphasised that most of the one-version tools included evaluation of both right and left-sided procedures when results were correlated to other relevant outcomes.

The assessment was predominantly based on video-recorded cases which offers the advantage of multiple assessors evaluating the same procedure at a chosen time. In addition, the independent scoring allows assessors to rewind a surgical step for repeated viewing and to be blinded to the surgeon’s identity and training level, rendering a more objective assessment. On the other hand, video-based assessment can be time consuming. A possible future solution could be the use of artificial intelligence to automatically identify key steps and operative actions, as suggested by Kitaguchi et. al for laparoscopic hemicoletomies [35]. A further limitation of video-based assessments from a purely laparoscopic view are the lack of an external view and audio to assess technical and non-technical skills. As the operating table and theatre are not recorded, the amount of required supervision and support cannot easily be assessed.

The expertise of the assistant was only considered by five tools in this review. Especially during laparoscopic colonic procedures, the tissue exposure relies heavily on the first assistant. Poor technical skills in camera navigation can cause prolonged operating time and increased frustration of the operating surgeon and decrease the quality of the submitted video for skill evaluation. It is obvious that the use of first assistants should be considered when surgical performance is evaluated, as it is the operating surgeons’ ultimate responsibility to always secure excellent exposure. However, the deliberate use of the assistant can be hard to assess watching video-recorded procedures, so it might be more appropriate to include this aspect when evaluating non-technical skills such as leadership and communication. Another possibility would be to adjust for poor camera navigation in the evaluation of surgical performance, due to the laparoscopic camera navigation scoring system by Huettl et. al [36]

More technical aspects should also be considered when evaluating the quality of video-recorded procedures. This has recently been addressed by the paper of Celentano et al. presenting the LAParoscopic surgery Video Educational GuidelineS (LAP-VEGaS) [37] as a standard framework for publication and presentation of surgical videos. When education program directors consider implementation of video-based assessments tools, the role and experience of the camera assistant as well as the LAP-VEGaS guidelines could be helpful in standardising the overall video quality for surgeons’ video-recorded procedures.

Overall, most tools in this review were validated in a clinical setting and reported with an average assessment time, as a common acknowledgment of clinical feasibility. Apart from assessment time, Glarner et al. measured feasibility by reporting the percentage of completed assessments [29]. Further, GAS utility was examined through surveys asking assessors about the perceived usefulness of the tool24. Similarly, surveys have been proposed to describe acceptability in the clinic, relevance of tool items, and educational impact for a novel tool in laparoscopic rectal cancer surgery (LapTMEpt) [3]. There seems to be broad agreement that the ease of using a tool may play an important role in the implementation process of a novel assessment tool into clinical practice.

In contrast to authors’ consideration of feasibility, none of the included studies evaluated the effect of rater training, which might be due to time constraints, increased cost, obligations to meet physically, or lack of priority. Though it has previously been shown that trained assessors are more comfortable performing direct observation and more stringent in their evaluations compared to not-trained assessors [38], the effect of rater training on assessment procedure is unclear [39,40,41]. This can be exemplified in the paper of Robertson and colleagues who examined the reliability of four established assessment tools for suturing and knot-tying for trained versus not-trained assessors [40]. In this paper, rater training tended to improve reliability among assessors but the impact on the performance scores was unclear. Therefore, further studies are needed to determine the effect of rater training and clarify how it should be implemented and evaluated.

Another prominent finding was the substantial number of tools which compared assessment scores to training level, often defined according to the postgraduate year (PGY) of the performing surgeon. As PGY simply refers to years of clinical experience, PGY levels do not necessarily reflect the quality of operative performance. The number of supervised procedures, and not just the number of procedures performed, has previously been reported to increase performance scores for laparoscopic colorectal surgeons [1]. Following this argument, technical abilities might vary considerably between trainees at the same PGY level. However, even though training level represents a small facet of construct validity, most of the authors made no further attempt to examine possible correlations with other variables. The relationship between assessment scores and patient outcome was examined for only two of the procedure-specific tools: CAT and ASCRS [1, 2]. In both papers, postoperative complications following laparoscopic colectomies were directly associated to the technical skill ass assessed by the tool.

For cancer surgery, the relationship between performance scores and results of pathological examinations are of particular interest, as the plane of surgery has previously been associated with improved patient survival [12]. Dissection performed in the wrong plane, damage to the mesocolon, or inadequate resection margins are all indicators for poor resection quality. Therefore, it would be beneficial to incorporate the specimen quality in future tool assessment criteria, as presented by Curtis et al. [3] for laparoscopic rectal cancer surgery or as in the right hemicolectomy scoring system for specimens by Benz et al. [42]. Although pathological evaluation was not included in the assessment criteria of the present tools, some authors did evaluate the relationship between assessment scores and the pathological specimen examination. This has been illustrated for CAT scores, where low ratings have been associated with a reduced number of harvested lymph nodes and a shorter distal resection margin in the specimen of laparoscopic colorectal surgery [1]. In rectal cancer surgery, a similar positive correlation has been observed between low error frequency described by OCHRA and the correct plane of dissection [43]. In light of the evidence above, it is obvious that well-established validity evidence describing relations to clinical variables is essential in future surgical improvement initiatives.

A limitation applying to most of the included tools in this review was the lacking evidence for the reproducibility of the results. Several of the included tools have been used regularly in educational settings for technical assessment in laparoscopic colon surgery beyond their initial development and validation phase [8, 18, 22,23,24, 27, 32]. Some of these tools have been validated in other procedures such as laparoscopic rectal surgery, hernia repair, and gynaecological procedures. However, none have specifically evaluated the validity evidence from the initial validation process in a different population of assessors or patients undergoing laparoscopic colon surgery. An assessment tool whose score provides valid inferences in a specific residency program under research conditions may need further evaluation before use at multiple institutions. Depending on the intended use and consequence of the assessment tool, validity should be demonstrated for each setting separately [44].

A single preferred tool for technical skill assessment in laparoscopic colon surgery has not been highlighted. However, we do recommend clinicians and training program directors to consider implementation of tools that are both easy to use and demonstrate well-established validity evidence. From the results of this review, GAS [24], ASCRS [30], and CAT [27] meet these requirements. Moreover, the assessment setting and endpoint should be considered; where e.g. GAS and ASCRS are used for formative evaluations, CAT is validated for summative evaluations. Further, where GAS is validated for live operations, ASCRS is validated for video-recorded procedures. As we move towards implementation of new techniques, such as laparoscopic complete mesocolic excision (CME), the development of a procedure-specific tool is still lacking, as none of the available tools adequately evaluate the most important procedural aspects of this technique.

It is a limitation of the present study that only tools for technical skill assessment were included. In recent years, non-technical skills in surgery have gained wide interest as it is evident that communication, teamwork, leadership, and decision-making are critical procedure-related skills, complementing the surgeons’ technical abilities [45,46,47]. However, non-technical skill assessment is a major topic in its own right, so to uphold a clear scope for the present review, studies solely examining tools for non-technical skill assessment were excluded in the study selection process. Tools solely aimed towards laparoscopic rectal surgery were not included, as the procedure-specific operative steps in rectal surgery differ too much compared to those in advanced laparoscopic colon surgery. Neither included were tools aimed towards robotic surgery, as the surgical skills required to use a robotic approach were thought to be substantially different from those required to control laparoscopic instruments and in a clinical setting often reserved for the most experienced surgeons. Furthermore, we chose not to include studies performed on virtual reality simulators (VR), although some simulators include laparoscopic colectomy procedures [48]. Even though VRs are effective at improving basic laparoscopic skills, procedure-specific techniques may not be generalised to the operating room as VRs lack tactile feedback and do not reflect the variation in patient anatomy. Finally, it should be emphasised that evidence for reproducibility of the results from Ghaderi et al.’s scoring system is still lacking, although it has been used in reviews describing assessment tools available for other surgical procedures [49, 50].

Conclusion

In conclusion, several tools are available for evaluation of laparoscopic colon cancer surgery, but few authors present substantial validity for tool development and use. As we move towards the implementation of new techniques in laparoscopic colon surgery, it is imperative to establish validity before surgical skill assessment tools can be applied to new procedures and settings. Therefore, future studies ought to examine different aspects of tool validity, especially correlation with other variables, such as patient morbidity and pathological reports, which impact patient survival.