Suboptimal surgical performance has been related to less favorable patient outcomes in complex minimally invasive surgical procedures [1,2,3,4,5]. Procedure-specific competency assessment tools (CATs) are an objective method to measure surgical performance with [2, 6, 7]. CATs are tools for structured (video-based) assessment used by experts to assess surgical performance and are currently viewed as the gold standard for measuring surgical performance for a given surgical procedure. A procedure-specific CAT for minimally invasive esophagectomy (MIE), the MIE-CAT, was developed and validated by our group previously [8]. However, an important limitation for using the MIE-CAT is that time of expert surgeons is scarce and video analysis is time-consuming and labor intensive, thereby limiting potential broad applicability [2].

A way to assist expert review for MIE might be using global performance assessment of video clips. Global performance assessment with Global Operative Assessment of Laparoscopic Skills (GOALS) [9] could provide an initial performance estimate, potentially followed by detailed feedback of the MIE-CAT if desired. In addition, using video clips instead of full-length videos might reduce assessment time without loss of relevant information, but could introduce bias [10]. Another way to assist expert review might be the use of crowdsourcing, in which large groups of anonymous and untrained workers evaluate surgical performance [11,12,13,14,15]. However, previous studies that investigated crowdsourcing for surgical performance video assessment have important limitations: (1) the outcomes of non-procedure-specific tools (e.g., GOALS or Global Evaluative Assessment of Robotic Skills (GEARS) [16]) were not compared to procedure-specific CATs; and (2) the studies were performed with videos of relatively simple tasks or procedures, often in dry-lab settings [15, 17,18,19,20,21,22]. Therefore, additional research is desired to explore both global performance assessment of video clips and crowdsourcing for video assessment of MIE.

There are two primary aims of this study: (1) to evaluate global performance assessment of MIE and (2) to evaluate the correlation between MIE experts and crowd workers assessments. We hypothesize that global performance assessment by both MIE experts and crowd workers can reliably be used to assist expert review, by providing rapid global feedback for MIE surgeons and reduce assessment time significantly.

Materials and methods

Two frameworks for scoring surgical performance of MIE were used in this study: (1) global performance assessment (GOALS) and (2) procedure-specific performance (MIE-CAT) (Fig. 1). GOALS was assessed by both crowd workers and MIE experts, while the MIE-CAT was only assessed by MIE experts. The use of the MIE-CAT for crowd workers was considered unfeasible given the technical complexity of the MIE-CAT. Reliability and convergent validity of the surgical performance assessments were analyzed as constructed by the COSMIN panel [23]. This study was carried out in accordance with applicable legislation and reviewed by the ethical committee of the Radboud university medical center.

Fig. 1
figure 1

Overview of the study with two types of video assessments: left GOALS and on the right MIE-CAT assessments

Video database and pre-processing

Full-length intraoperative MIE videos, performed thoraco-laparoscopically between 2011 and 2020, from the database of the Esophageal Center East Netherlands in Nijmegen were used. Eight transthoracic MIE with intrathoracic anastomosis videos for esophageal cancer were randomly included. Each included procedure was performed by two consultant surgeons from one surgical team with four consultant surgeons in total. Procedures assisted by surgeons in training were excluded. Videos were divided into four experience groups, based on the consecutive case date and the learning curve of 119 cases found by van Workum et al. [24]. These four groups include (1) novice (0 to 25 MIEs performed), (2) intermediate (26 to 119 MIEs performed), (3) advanced (120 to 200 MIEs performed), and (4) expert (> 200 MIEs performed) (Online Appendix A). Two randomly selected videos per experience group were included after written informed patient consent and were stripped from any patient or surgeon identifiers.

Video clips of maximum 10 min were used for the global performance assessment with GOALS. Every video was cut into eight video clips, with every clip representative to one of the eight phases of the MIE-CAT [8] (Table 1) based on procedural-phase landmarks (Online Appendix A), resulting in 64 video clips total. The eight full-length MIE videos (average 3.5 h per video) were used for the procedure-specific assessments with the MIE-CAT.

Table 1 Description of the eight phases of the MIE procedure from the MIE-CAT [8]

Reviewers

Crowd workers were recruited by Crowd-Sourced Assessment of Technical Skills® (C-SATS, Seattle, WA, USA) via Amazon Mechanical Turk (MTurk, https://mturk.com), an online marketplace that facilitates hiring crowd workers [25]. Crowd workers were eligible if they were 18 years or older, had more than 95% approval on previous evaluations and lived in the USA. All crowd workers received a training video prior to the assessment, including a skill comparison test with dry-lab videos of a high- and low-performing surgeon. Crowd workers were invited on a ‘first come, first served’ basis.

MIE experts, identified through a literature search in the field of MIE and previous research collaborations [8, 26,27,28], were contacted to participate in this study. In total, 9 MIE experts who performed at least 120 MIEs [24] and currently perform at least 50 MIEs annually from Europe (6), the USA (2), and Asia (1) were included. The average experience in esophageal surgery was 18 years (SD 8.3). All experts were trained in online video assessment using the MIE-CAT [8] during interactive online workshops of 1.5 h with two to five experts.

Performance assessment tools

All video assessments were conducted via the online C-SATS platform. First, global performance assessment was conducted by both crowd workers and experts with the non-procedure-specific validated GOALS tool [9] of the 64 video clips. Intraoperative laparoscopic performance of every video clip was assessed using four domains: (1) depth perception, (2) bimanual dexterity, (3) efficiency, and (4) tissue handling. GOALS’ autonomy domain was excluded because of the absence of sound. Each domain was scored with a 1–5 Likert scale. One GOALS assessment included all eight procedural phase-based video clips from one MIE procedure. To obtain a 95% confidence interval to be ± 1 point on the grading scale, at least 30 crowd workers per video clip assessment were deemed necessary based on previous research [17, 18, 29, 30]. In addition, two MIE experts were randomly appointed per GOALS assessment. An average GOALS score for each surgical phase video clip and for the full-length videos were analyzed. The average GOALS score was calculated as mean of all crowd workers or expert assessments.

Second, procedure-specific performance assessment of the eight corresponding full-length MIE videos were assessed with the MIE-CAT by the same two MIE experts. Performance was assessed using the eight procedural phases of the MIE-CAT (Table 1) [8]. Each procedural phase was scored with four quality components (exposure, execution, adverse events, and end-product quality) on a 1–4 Likert scale [8]. Crowd workers were excluded from procedure-specific assessments, since these assessments require procedure-specific expertise. Procedure-specific assessments resulted in one average MIE-CAT score per phase and per full-length video.

Global performance assessment of MIE

Convergent validity verifies whether the GOALS scores of MIE performance correlates with similar constructs to the degree that one would expect via hypotheses testing [31, 32]. If at least 75% of the hypotheses are correct, convergent validity is considered to be sufficient for GOALS assessment of MIE performance [31]. GOALS components scores that were investigated included the domain scores (4), phase scores (1), and total GOALS score (1) per video. For hypothesis testing the following related construct correlations with GOALS (components) were studied using Pearson’s correlations coefficients:

  1. 1.

    Experience: Correlations between the GOALS (components) scores and experience of the surgical team, defined by consecutive case date. With the six GOALS components in total six hypotheses were tested for each group (crowd workers and experts separately).

  2. 2.

    Clinical parameters: Correlations between the GOALS (components) scores and two clinical parameters: blood loss (in milliliters) and operative time (in minutes). With the six GOALS components in total 12 hypotheses (6 × 2 hypotheses) were tested for each group (crowd workers and experts separately).

  3. 3.

    Procedure-specific performance: Correlations between the GOALS (components) scores, assessed by crowd workers and experts, and the MIE-CAT (components) scores assessed by experts. Components scores were investigated in relation each other: GOALS domain versus MIE-CAT quality component (4), GOALS phase versus MIE-CAT phase (1), and total GOALS versus total MIE-CAT (1). Therefore, in total six hypotheses were tested for each group (crowd workers and experts separately).

First, positive correlations between 0.3 and 0.7 (‘moderate’) were considered acceptable for the hypothesized correlations between GOALS and experience [32]. Second, negative correlations between − 0.3 and − 0.7 (‘moderate’) were considered acceptable for the hypothesized correlations between GOALS and the two clinical parameters. Both the experience of the surgical team and the two clinical parameters were expected to be indicators for global performance of MIE, and this was also confirmed in our previous study for procedure-specific performance for MIE [8]. In addition, Vassiliou et al. [9] showed increased GOALS score with increased experience for laparoscopic cholecystectomy. Smaller correlations than (−)0.3 would indicate inadequate global performance assessment and a higher correlation than (−)0.7 would indicate GOALS only correlates with experience or clinical outcome, while it is expected that global surgical performance embodies both. Third, a strong but not perfect positive correlation between 0.5 and 0.8 between global performance assessment (GOALS) and procedure-specific performance assessment (MIE-CAT) was considered acceptable. Both scoring frameworks are expected to assess surgical performance of MIE and therefore correlate at least moderately, > 0.5. However, the GOALS scores were not expected to correlate more than strong, > 0.8. The MIE-CAT assesses more detailed performance and thus the two tools assess the construct of quality performance in different ways. See Online Appendix B for an overview of the hypotheses for convergent validity. Reliability and validity of the MIE-CAT were established in a previous study [8].

Correlation MIE experts and crowd workers

The average GOALS scores of both experts and the average GOALS scores of all crowd workers from the corresponding video clip were used to determine inter-rater reliability. This inter-rater reliability was calculated using a one-way intraclass correlation coefficient (ICC) with a mean rating and consistency agreement. An ICC above 0.7 was considered acceptable and ≥ 0.8 good. In addition, a rank order of the lowest to highest scoring video (clip) and Bland–Altman plot was made to display the inter-rater agreement. The data analysis was performed using IBM SPSS Statistics for Windows version 27.0 (IBM Corp., Orchard Road Armonk, New York, US).

Results

MIE videos

Patients from the eight randomly included videos had a median age of 68.5 years (IQR 55.5–70.5) and BMI of 25.6 (IQR 21.3–29.2). A detailed overview can be found in Online Appendix C.

Assessments

The nine experts conducted a total of 16 full-length video assessments (containing ~ 51 video hours) with the MIE-CAT and 128 video clip assessments (containing ~ 21 video hours) with GOALS in 3.5 months. Crowd workers assessed a total of 1984 video clips (containing over 330 video hours) with GOALS in less than 2 days. Overall, the experts and crowd workers showed comparable mean GOALS scores (Table 2). Interestingly, the GOALS video clips scores from experts had a wider range than those of crowd workers (7.50–19.50 versus 12.84–15.63). See Online Appendix D for an overview of the correlating MIE-CAT scores.

Table 2 Mean domain, video clip, and total video scores assessed with GOALS by both experts and crowd workers (n = 64 video clips)

Global performance assessment of MIE

Less than 75% of the correlations between GOALS component scores and the related constructs experience (Table 3), clinical parameters (Table 3), and MIE-CAT components (Table 4) were in agreement with our hypotheses for both the crowd and experts GOALS scores. For crowd workers, 14% of the correlations (5 of 36 hypotheses) and for experts, 56% of the correlations (20 of 36 hypotheses) were in agreement with our hypotheses.

Table 3 Correlations between GOALS components (domain, phase, and total scores) by both crowd workers and experts and the related constructs experience and the two clinical parameters
Table 4 Correlations between GOALS components (domain, phase, and total scores) by both crowd workers and experts and the related construct MIE-CAT components (quality components, phase, and total scores) by experts

There was a poor correlation (r < 0.3) between the GOALS scores and experience of the surgical team for either group (Table 3, Fig. 2a). None of these correlations (0/12) fell within the hypothesized values of 0.3 < r < 0.7. Between the GOALS scores and the two clinical parameters (blood loss & operative time), some correlations did fall within the hypothesized values (Table 3, Fig. 2b and c). For crowd workers, 5/12 correlations (blood loss versus ‘bimanual dexterity,’ ‘efficiency,’ ‘phase,’ and ‘total’ GOALS scores and ‘efficiency’ versus operative time) and for experts 4/12 correlations (blood loss versus ‘bimanual dexterity’/‘tissue handling’ and operative time versus ‘bimanual dexterity’/’efficiency’) were between -0.3 and -0.7, as hypothesized. The correlations between blood loss and ‘bimanual dexterity,’ and the correlation between operative time and ‘efficiency’, were within hypothesized values for both groups. Correlations between the expert GOALS scores and blood loss outside the hypothesized values were all stronger than expected (r > − 0.8).

Fig. 2
figure 2

Total GOALS video clip scores assessed by experts (blue) and crowd workers (orange) versus a experience of the surgical team in consecutive case date, b blood loss (ml) c operative time (min), and d total MIE-CAT score assessed by experts, all with a linear fitted line (Color figure online)

Almost all correlations (16/18) between the experts’ GOALS scores and the MIE-CAT scores were between the hypothesized values of 0.5 and 0.8 (Table 4). The two correlations that correlated less strongly than expected (< 0.5) were between GOALS domain ‘efficiency’ and MIE-CAT quality component ‘end-product quality’ (r = 0.40, 95% CI [− 0.42, 0.86]) and between GOALS ‘phase’ and MIE-CAT ‘phase’ scores (r = 0.45, 95% CI [0.23, 0.63]). None of the correlations between crowd workers’ GOALS scores and the MIE-CAT were above 0.5 (Table 4). Figure 2d visualizes the strong correlation between the experts’ total GOALS scores and the expert’ total MIE-CAT scores (r = 0.76, 95% CI [0.12, 0.95]) versus the minimal correlation between the crowd workers’ GOALS and expert MIE-CAT scores (r = 0.19, 95% CI [− 0.59, 0.79]).

Correlation MIE experts and crowd workers

A poor (ICC < 0.5) level of agreement on domain, video clip, and total GOALS scores was found between the experts and crowd workers (Table 2). All ICC values were below the acceptable 0.7, indicating insufficient agreement. In addition, a proportional bias can be seen (Online Appendix E): experts more frequently scored substantially lower or higher than average, while crowd workers scored the same video clip as ‘average.’ Additionally, only the lowest scoring video was agreed upon by both groups, when ranking the videos based on their total GOALS score (Online Appendix E).

Discussion

In this study, crowd workers and MIE experts scored 64 video clips of MIE using GOALS, which were compared to each other and to eight full-length video assessments using the MIE-CAT. The first aim was to evaluate global performance assessment with GOALS for MIE using convergent validity. We were unable to establish convergent validity for GOALS, whether scored by crowd workers or experts. However, GOALS scored by experts (but not crowd workers) strongly correlated with the experts’ MIE-CAT scores. This suggests video clip assessment by experts using GOALS could be useful to shorten assessment time when in-depth video analysis using the MIE-CAT is not required. The second aim was to evaluate the correlation between MIE experts and crowd workers assessments. A poor correlation between the crowd workers and experts’ GOALS scores was observed, despite promising results in previous studies and rapid results from crowd workers.

This study found a poor correlation between GOALS scores of crowd workers and experts, in contrast to earlier findings [13,14,15]. Previous studies, with a comparable sample size of crowd workers, found that dry-lab or relatively simple laparoscopic task video assessments show moderate to good agreement between crowd workers and experts [15, 17,18,19,20,21], while low agreement is seen in more complex laparoscopic global performance assessment videos [30]. Crowd workers might be used for relative simple procedures, whereas findings of the current study suggest performance assessment of real life surgeries, especially complex procedures, such as MIE, might technically be too difficult for laypeople to assess [18].

Although we were unable to establish convergent validity for GOALS, whether scored by crowds or experts, GOALS scored by experts (but not crowd workers) strongly correlated with the experts’ MIE-CAT scores [8]. In our view, GOALS is not comprehensive enough to fully capture the quality of a specific complex surgery, such as MIE. This could explain the poor correlation with experience, leading to loss of convergent validity. Moreover, GOALS cannot provide procedure-specific feedback that CATs do [2, 3, 6, 33] and may therefore not be useful for in-depth performance assessment. However, the fact that expert GOALS scores showed a strong correlation with expert MIE-CAT scores, suggest that expert GOALS scores may be used in situations where quicker screening of operative skills is required. If expert GOALS scores deviate from desired scores, this screening may be followed by in-depth review using the MIE-CAT, enabling the benefits of the MIE-CAT for detailed feedback, training, and quality improvement. Such a strategy might be a more efficient way to use the current tools that are available.

This study had several limitations. First, the crowd workers’ training included a skill comparison test with dry-lab videos. Ideally, this training would be more extensive [30] and should have included low- and high-performance surgical MIE videos. Unfortunately, in the beginning of this study these examples did not exist and therefore were not available. Regardless, we question how relatively short instructions can be used to effectively score complex surgical videos. Second, the most representative content for video clip performance assessment for MIE remains unclear. Current video clips were selected by the study team based on procedural-phase landmarks (e.g., introducing the stapler for the creation of the gastric tube) to provide comparable video clips. Successfully, a strong correlation between the assessed video clips with GOALS and full-length surgical videos with the MIE-CAT was observed. We expect performance in video clips will be representative to performance in full-length videos, even when components such as a large bleeding are absent. Nonetheless, future research is desired to determine the optimal video clip content and length, so video clips are representative for the complete procedure. Third, with the current study design we could not determine the individual influence of GOALS and video clips on the validity of global performance assessment of MIE by experts. Additional research would be recommended to investigate this for future use.

Once again video assessment has proven to be very labor intensive. Until performance assessments can be automated with artificial intelligence [11], further research could explore the reliability and validity of (para)medical workers, e.g., OR assistants or surgeons in training, for surgical performance assessment of complex laparoscopic procedures. [34] In addition, video clip assessment with GOALS seems to provide a fair reflection of global performance assessment of MIE if executed by experts and could be investigated to advance MIE video assessments. Experts’ GOALS assessments could be implemented in large-scale global performance video clip assessment as a screening or research tool for global performance levels, e.g., nationwide screening for performance of a specific procedural step, such as the creation of the anastomosis. Moreover, combining global performance assessment for initial screening with procedure-specific assessment for specific feedback and research questions could help optimizing essential MIE-CAT steps and minimize assessment time. Subsequently, shorter assessment time would enhance applicability of the MIE-CAT in daily clinical use. At the same time, a valuable database for computer-assisted video assessments containing global and procedure-specific performance assessments would be collected.

Conclusion

This study showed that global performance assessment of MIE by crowd workers is not useful for assisting expert assessments. MIE might be technically too difficult to assess for laypeople. GOALS, if used by experts, could be considered for large-scale global MIE performance video clip assessments and could be useful to shorten assessment time. However, as GOALS is not comprehensive enough to assess MIE performance in detail, the MIE-CAT can be used for an extensive procedure-specific performance analysis.