World Journal of Surgery

, Volume 33, Issue 3, pp 440–447

Laparoscopic Surgical Skills Assessment: Can Simulators Replace Experts?

Authors

    • Department of General SurgeryNorthumbria Healthcare NHS Trust
  • Liam Horgan
    • Department of General SurgeryNorthumbria Healthcare NHS Trust
  • J. Roger Barton
    • Department of GastroenterologyNorthumbria Healthcare NHS Trust
  • Stephen Attwood
    • Department of General SurgeryNorthumbria Healthcare NHS Trust
Article

DOI: 10.1007/s00268-008-9866-4

Cite this article as:
Pellen, M., Horgan, L., Roger Barton, J. et al. World J Surg (2009) 33: 440. doi:10.1007/s00268-008-9866-4

Abstract

Introduction

Global Rating Scales (GRS) quantify and structure subjective expert assessment of skill. Hybrid simulators measure performance during physical laparoscopic tasks through instrument motion analysis. We assessed whether motion analysis metrics were as accurate as structured expert opinion by using GRS.

Methods

A random sample of 10 consultant laparoscopic surgeons, 10 senior trainees, and 10 novice students were assessed on a Sharp Dissection task. Coded video footage was reviewed by two blinded assessors and scored using a Likert Scale. Correlation with metrics was tested using Spearman’s rho. Inter-rater reliability was measured using intraclass correlation coefficient (ICC).

Results

Strongest GRS–Metric correlations were found for Time/Motion/Progress with Time (Spearman’s rho 0.88; p < 0.05) and Instrument Handling with Path Length (Spearman’s rho 0.8; p < 0.05). Smoothness correlated with Respect for Tissue in Rater 1 (rho 0.68) but not Rater 2 (rho 0.18). Mean GRS showed stronger inter-rater agreement than individual scale components (ICC 0.68). Correlation coefficients with actual experience group were 0.58–0.74 for mean GRS score and 0.67–0.78 for metrics (Spearman’s rho, p < 0.05).

Conclusions

Metrics correlate well with GRS assessment, supporting concurrent validity. Metrics predict experience level as accurately as global rating and are construct valid. Hybrid simulators could provide resource-efficient feedback, freeing trainers to concentrate on teaching.

Introduction

Surgical training in the United Kingdom is undergoing a paradigm shift with a restructured competency-based curriculum. Restrictions imposed upon working hours and changes in attitudes toward training in the workplace are placing a heavy burden on bodies responsible for ensuring adequate training experience. However, crude case numbers may have poor correlation with actual technical skill and many methods of measuring acquired skill are resource consuming and may lack objectivity [1, 2].

Laparoscopy presents distinct psychomotor challenges but features more frequently in the armamentarium of the general surgeon [3]. The technique must be trained responsibly to avoid complications associated with its enthusiastic introduction during the last 20 years [4]. Laparoscopic simulators have emerged that may now offer both training outside the operating room and objective assessment roles [59]. However, the evidence base lags behind the technology and lacks a consistent, robust research methodology [10].

No “gold standard” measure of surgical skills exists. Validated Global Rating Scales (GRS) have improved biased subjective assessment, but are time-consuming and require multiple assessors. New hybrid simulators (ProMIS, Haptica, Dublin) allow subjects to perform structured tasks while performance is simultaneously recorded and instrument movements measured.

We compared the two assessment methods of motion analysis metrics and structured GRS to examine which could most reliably and accurately predict experience level, providing evidence of construct validity. We also examined whether simulator metrics correlated with assessments by consultant surgeons using a structured GRS (concurrent validity).

Methods

Volunteers were recruited from three experience groups: consultant laparoscopic surgeons (more than 100 procedures performed), senior surgical trainees (some laparoscopic experience: median 2 (0–41) cases performed) and medical students (no experience). All were asked to perform a standardized laparoscopic simulator task on the ProMIS Simulator (Haptica, Dublin, Ireland) after a familiarization bimanual transfer exercise.

The ProMIS “hybrid” Laparoscopic Simulator blends a physical box trainer with instrument tracking technology. The “body form” consists of a plastic shell and neoprene abdominal wall, mimicking the pivotal “fulcrum” effect. Instrument motion analysis data was acquired by two fixed view web cameras that tracked commercially available instruments, the distal shafts of which were marked with adhesive strips. This allowed three-dimensional tracking, generating exportable fields of Time, Path Length, and Smoothness for right and left instruments.

The Sharp Dissection task (Fig. 1) consisted of an air-inflated balloon (standardized to fit within a fixed container) secured within a latex glove marked with a standardized double-circle target. This target was divided into four quadrants, creating eight penalty margins. Subjects were instructed to dissect between inner or outer circle (Penalty) lines without puncturing the deeper (balloon) layer using disposable laparoscopic scissors and a grasper. The penalty score was calculated by the principle investigator alone at the immediate completion of the task and was scored from zero (no penalty lines crossed, maximum accuracy) to eight (all lines crossed, maximum inaccuracy).
https://static-content.springer.com/image/art%3A10.1007%2Fs00268-008-9866-4/MediaObjects/268_2008_9866_Fig1_HTML.jpg
Fig. 1

Sharp dissection task

Motion data were downloaded to a standard laptop computer (Dell™ Inspiron™ 5160, Dell Inc., Texas, USA). Task footage was acquired by web cameras and stored as digital (AVI) files. Data were formatted and examined by using the Statistical Package for Social Sciences (SPSS) for Windows version 11 (Chicago, IL, USA).

Each group’s subjects were coded, corresponding to task performance date. Three sets of ten subjects were randomly selected from each of the three groups using commercially available software (www.randomizer.org). Subjects were randomly assigned a new random integer of 1 to 30 to avoid any bias in ordering of the subjects. Coded unedited video footage was copied to compact disc for rating by two assessors.

Both assessors were consultants practicing advanced laparoscopic upper gastrointestinal and hernia surgery and additionally held recognized national and international faculty positions as laparoscopic tutors. A preliminary 1-hour briefing meeting with both raters introduced the assessment tool using five examples to optimize rater agreement. Discrepencies in blinded scores were discussed and agreement was sought on performance in the light of their comparison. Assessors were blinded to identity and experience level of subjects reviewed.

A bespoke and adapted Global Rating Scale was designed, based on the original OSATS Form of Reznick et al. [11, 12]. The original scale used seven categories, of which some were not applicable or relevant to a synthetic laparoscopic task (e.g., use of assistants). Furthermore, the original scale lacked a measure to reflect the specific handling requirements for safe laparoscopic practice; therefore, we described a new scale component based on principles of safe instrument use. The modified Global Rating Scale (Fig. 2) consisted of four categories (Respect for Tissue, Time/Motion/Progress, Instrument Handling, and Safety) and an “Overall Impression” score. Components were scored from 1 to 5 on a Likert Scale structured with performance descriptors.
https://static-content.springer.com/image/art%3A10.1007%2Fs00268-008-9866-4/MediaObjects/268_2008_9866_Fig2_HTML.gif
Fig. 2

Global rating scale/score sheet for sharp dissection task (modified after Reznick et al. [12])

Correlation was tested by using Spearman’s rho and significance level 0.05. Strength of relationship was interpreted as little or no relationship (R = 0–0.25), weak to fair (R = 0.25–0.5), moderate to good (R = 0.5–0.75), and excellent (R > 0.75) [13]. Inter-rater reliability was calculated by using intraclass correlation coefficient (ICC).

Results

A total of 30 subjects, consisting of 10 consultant surgeons, 10 senior trainees, and 10 novice medical students, were randomly selected from a cohort of 125 eligible subjects that comprised consultants (n = 18), senior trainees (n = 57), and students (n = 50). Both raters independently reviewed 30 randomly selected Sharp Dissection video files in a blinded fashion. Median task time was 327 (range, 95–1138) seconds. Raters estimated a review time of 3 min per subject totalling 90 min (35% of total footage time).

Boxplots demonstrated an inverse relationship between each performance metric (time, path, smoothness) and group experience level (Fig. 3), but only slight trend was detectable for penalty score. Spearman’s rho coefficients showed good to excellent correlation for time, path, and smoothness (−0.67 to −0.78; p < 0.05) but not accuracy (Table 1).
https://static-content.springer.com/image/art%3A10.1007%2Fs00268-008-9866-4/MediaObjects/268_2008_9866_Fig3_HTML.gif
Fig. 3

Boxplots of sharp dissection metrics in 30 random subjects grouped by experience level

Table 1

Correlation coefficients for sharp dissection metrics and actual experience level

Metric

Spearman’s rho

p value

Time (s)

−0.78

0*

Path length (mm)

−0.67

0*

Smoothness

−0.75

0*

Penalty

−0.27

0.143

* Correlation significant at 0.01 level (2-tailed)

Composite rating scores (mean GRS) also reflected expected skill level as demonstrated by a linear relationship for both raters (Fig. 4). This was better demonstrated by rater 1, confirmed by a near excellent Spearman’s rho of 0.74.
https://static-content.springer.com/image/art%3A10.1007%2Fs00268-008-9866-4/MediaObjects/268_2008_9866_Fig4_HTML.gif
Fig. 4

Boxplot of raters’ GRS assessment of performance and actual group

Spearman’s rho correlations for GRS with metrics are shown in Table 2. Strongest correlations were found for Time/Motion/Progress with Time (Spearman’s rho 0.88; p < 0.05) and Instrument Handling with Path Length (Spearman’s rho 0.8; p < 0.05). Smoothness correlated well with Respect for Tissue in Rater 1 (rho 0.68) but not Rater 2 (rho 0.18).
Table 2

Spearman’s rho correlations between rater GRS and ProMIS metrics for sharp dissection

Rater

GRS component

Time (s)

Path length (mm)

Smoothness

Penalty

Rater 1

Respect for tissue

−0.71*

−0.72*

−0.68*

−0.53*

Rater 1

Time/motion/progress

−0.88*

−0.86*

−0.85*

−0.32

Rater 1

Instrument handling

−0.78*

−0.8*

−0.75*

−0.42*

Rater 1

Safety

−0.64*

−0.67*

−0.61*

−0.61*

Rater 2

Respect for tissue

−0.18

−0.24

−0.18

−0.58*

Rater 2

Time/motion/progress

−0.53*

−0.47*

−0.55*

−0.24

Rater 2

Instrument handling

−0.69*

−0.60*

−0.67*

−0.46*

Rater 2

Safety

−0.25

−0.26

−0.25

−0.3

Italicized correlation coefficients >0.5 (indicating moderate to excellent association)

* Significant at 0.05 level

A summary mean GRS showed stronger inter-rater reliability than their individual scale components (ICC 0.68 vs. 0.5–0.62); this was a more reliable predictor of performance level than simply estimating experience level (overall impression ICC = 0.5).

The relationship between subjective overall impression and actual skill level was stronger in rater 1’s assessment (Spearman’s rho 0.74 vs. 0.62; Table 3). However, this measure showed poor reliability (ICC 0.5). Comparison of the two objective assessment methods revealed correlation with experience level was similar for both the consultant-assessed Mean GRS (Spearman’s rho 0.58–0.74) and ProMIS motion analysis metrics (0.67–0.78; p < 0.05).
Table 3

Correlation coefficients for GRS rater assessments and actual experience level

Score

Spearman’s rho rater 1

Spearman’s rho rater 2

Mean GRS

0.74*

0.58*

Overall impression

0.74*

0.62*

* Significant at 0.05 level

Discussion

As an assessment method, motion analysis assumes that novice features recognized in the motor behavioral literature (e.g., hesitancy, inaccuracy, inefficiency) can be inferred from instrument movement [14]. This rationale is consistent with recommendations by leading researchers in the field of surgical education and simulation and shares similar concepts with VR simulators, the Advanced Dundee Endoscopic Psychomotor Tester (ADEPT), and the Imperial College Surgical Assessment Device (ICSAD) [9, 1523].

The Global Rating Scale, developed by Reznick and colleagues in Toronto, has seen several derivations, which have transformed potentially biased subjective opinion into valid, objective, structured assessment tools. These have been applied in live and recorded operations and open and laparoscopic bench models [2, 11, 12, 21, 2426]. Our scale adaptation did not choose the same modifications as those of the GOALS (Global Operative Assessment of Laparoscopic Skills) group [24]. Our descriptor “Instrument Handling” was felt to encompass both depth perception and bimanual coordination as described in their study; it was believed that introduction of multiple descriptors risked scoring the same behaviour repetitively.

There are time, cost, and resource implications for both assessment methods used in this study. Objective observer rating requires reliable equipment and methods of obtaining data, such as time-consuming direct observation in theater or tedious review of edited or complete procedural footage. This requires standardization, supervision of collection, blinding, and investment in adequate audio-visual recording and playback equipment. The voluntary participation of consultants in our study cannot be relied on in other centers, and remuneration costs may need consideration.

Simulation technology also comes at a price and although many systems are now available for less than U.S. $40,000, running and maintenance costs, as well as those of disposable materials in the case of hybrid systems such as ProMIS, must be considered.

This study was deigned to reduce self-selection bias in motivated volunteers by randomly selecting a sample of ten subjects from each group for comparison. This had the additional effect of generating a feasible sample size that was amenable to rater assessment of each subject’s task footage.

Construct validity

Validity measures the degree of confidence one can place in inferences drawn from scores on a scale. In this study, it would be expected that consultants should outperform inexperienced trainees, who should perform better than complete novices (construct validity). A separate study in our unit has demonstrated ProMIS to possess construct validity [27].

Time, Path, and Smoothness scores all showed significant correlation with experience level supporting construct validity. Penalty score was a poor group discriminator. This was not an in-built simulator measure but rather an attempted qualitative assessment of accuracy, relying on immediate examination of the stamped target. It was a compromised measure using crossed lines as a proxy, but took no account of how many times each quadrant was crossed, only whether it was crossed at all. Furthermore, one cannot interpret how carefully the target was dissected from a simple summary value.

Although construct validity cannot be concluded from a single experiment, there was a sequential improvement in each motion analysis metric, which would support further work to examine significant group differences in a larger study and thus their ability to discriminate task performance by expected skill level.

Both raters’ GRS assessments identified actual experience level from task performance (Fig. 3). Comparisons between raters (both objective mean GRS and subjective impression) show consistently greater accuracy in rater 1. The relationship between scores and groups is consistent for mean GRS and overall impression in rater 1 (dark boxes) but not in rater 2 (light boxes). This may suggest that rater 1 used the GRS to inform the overall impression, whereas rater 2 drew conclusions based on an instinctive interpretation. Overall, rater 2 was more critical and more likely to underestimate the actual experience level.

Inferior GRS-experience level correlations seen in rater 2 suggest the discrepancy in reliability may be due to rater 2’s interpretation and, therefore, use of the scale. This may be due to comparative lack of familiarity with objective assessment tools, incomplete examination of task footage, or competing commitments influencing concentration. Causal factors could be identified through rater feedback and discussion, which may be rectified leading to improved agreement and accuracy. One must be cautious in assuming the superior correlation of assessed performance and experience level in rater 1 is representative of this tool’s reliability in a wider setting.

Group characteristics

The consultant group showed a wide variation in laparoscopic experience and current practise, ranging from occasional cholecystectomy to bariatric and colorectal surgery. In contrast, laparosopic operative experience was low in the “senior trainee” group. However, despite having obtained the MRCS diploma, the group from which these subjects were drawn was heterogeneous with respect to number of years qualified (median 7; range 4–28) and appointment (43% senior house officer, 37% specialist registrar, 20% research fellow/trust grade). Laparoscopic experience, where present in this group, was restricted to cholecystectomy and appendicectomy.

With such low case experience, the superiority of trainee performance metrics and GRS scores over student novices may seem surprising. However, whereas the seniors were inexperienced as first operators in laparoscopic surgery, one would still expect greater coordinated psychomotor skill given that this group have previously received basic instruction in laparoscopic skills on basic surgical skills courses. Furthermore, all had assisted laparoscopic procedures as camera assistants (mean 33; range 6–240), enabling familiarity and adaptation to fulcrum effects, depth perception, tracking, and steady handling. One might further expect seniors to have acquired some advantage in dexterity skill compared with students during their surgical training experience.

That both simulator metrics and rater assessment were able to detect performance differences between these groups, despite the absence of gross experiential contrast, is an encouraging sign that both tools examined in this study have promising discriminatory value and support views that surgical skills are not solely acquired through operative case numbers.

Concurrent validity

Metrics corresponded appropriately and significantly with observer GRS assessments. Multiple correlations suggest a degree of overlap between performances in each category that cannot be summarized by one single psychometric measure. For example, good instrument handling might involve steady control (smoothness) and economic spatial movement (path length). The objective accuracy measure (penalty) corresponded well and independently with rater assessments of Safety or Respect for Tissue, which confirms that raters were formally reviewing each performance accurately and not allocating scores randomly or based simply on completion time.

This study did not attempt to analyse the intra-rating behavior of the assessors. However, such information may have identified a change in consistency and inter-rater agreement as familiarity with the rating scale progressed.

Unedited video assessment has been shown to be reliable and superior to edited footage in terms of correlation with live observational assessment and inter-rater reliability [25, 28, 29]. The latter is time-consuming and requires expertise in adhering to agreed criteria to ensure that informative behaviors are not screened out, potentially introducing bias. Videotape analysis may still miss subtleties of performance that can only be gained in the operating room, such as instrument selection, communication, and the requirement for a senior surgeon to take over. However, in the context of a laboratory simulated dexterity task as in this study, there was no concern that such factors might influence the performance. Dath et al. demonstrated that rater scanning of videotape footage could be performed successfully without detriment to inter-rater reliability [25, 28].

Scanned footage review reduced total assessment time by one-third but required the valuable time of two consultant surgeons. Inter-rater reliability (IRR) coefficients demonstrated that rater agreement was inconsistent across the suggested GRS components. Instrument handling and Time/Motion/Progress showed good ICCs (0.6, 0.62), which may suggest observer assessment of the flow and technical approach to the task was straightforward to interpret.

Time/Motion/Progress may have shown good agreement because raters were able to interpret the completion time from the recorded footage—a limitation of choosing not to edit the footage into a standardized format. Conversely, allowing the option of scanning through video footage rather than examining it in its entirety may have enhanced practicality at the expense of accurate assessment, leading to both poor correlations and discrepancies between the raters.

Inferior ICCs were demonstrated for Respect for Tissue and Safety. These may be more ambiguous terms, especially due to the artificiality of a synthetic task. Furthermore, both may share some commonality in interpretation. Although performance descriptors were intended to reduce the subjectivity of the ordinal scale, terms, such as “appropriate handling,” “safe dissection,” and “minimal damage,” may still be subjective if raters differ in their opinion on what constitutes such behavior.

The category, Safety, was created specifically for this study, describing behavioral attributes appropriate to good laparoscopic practise. Although not a feature of the original GRS design, it shares some similarities with revisions made in other studies [24, 30]. Perhaps more defined examples of poor safety or an extended briefing discussion of example footage may have improved these measures. However, this may have decreased the ease of utility of the tool or frustrated and insulted the expertise of the raters.

Although rating scales, such as Vassiliou et al.’s GOALS [24], have shown ICC reliability coefficients of 0.89; they experienced discrepancies in inter-rater agreement between categories as in this study. The GOALS system also was applied in a real operative setting, a setting more familiar to senior consultant-level surgeons and thus likely to contribute to better agreement.

Of the two objective modalities used in our study, ProMIS motion analysis metrics showed at least an equivalent correlation with actual experience level as the composite (mean GRS) observer rating score (Spearman’s rho 0.67–0.78 vs. 0.58–074, respectively).

Simulation fidelity

The artificiality of the low-fidelity synthetic glove-balloon model used in our study makes skills interpretation considerably more difficult. Whilst the predictive validity and training value of this task is unknown, the value of low fidelity laparoscopic simulations has been demonstrated.

Videotrainer drills (e.g., bean drop, running string) have been designed by groups, such as Rosser et al. [30] and have shown improvement in objective performance in laparoscopic cholecystectomy in randomized trials [29, 31]. The Canadian McGill Inanimate System for Training and Evaluation of Laparoscopic Skills (MISTELS) has been validated and shown to have training benefit in novices, where repetition of simple pegboard transfer tasks improved completion times of intracorporeal knots [3235]. Studies have shown improved porcine laparoscopic cholecystectomy performance in subjects randomized to virtual reality manipulation exercises using simple geometric shapes [26, 36].

Similarly simple tasks, such as that used in our study, have been used by other authors and are useful in their low cost, reproducibility and adaptability to other centers, especially in an era where training and assessment resources and funding may be restricted [23]. If one is to avoid potential harm to patients, efforts must be made to identify representative simulations and assess skill outside the operating room. It is ethically questionable to expect a junior trainee to be assessed on basic manipulative skills in the operating room, yet objective assessment at this stage is increasingly important as competence in laparoscopy will be expected at an earlier stage in training than previously.

Whilst more complex simulations may be created (e.g., intracorporeal suturing), they introduce more variables, which may not be amenable to video assessment alone. This study deliberately selected a task that demanded reasonable coordination of laparoscopic skill without being too complex for a student or demanding a prolonged period of previous training, which might influence crude psychomotor ability, such as with suturing. Furthermore, more complex tasks may introduce learning curve effects whereby some would assimilate the cognitive aspects of the drill quicker than others. This task also was reasonably independent of end product analysis.

Conclusions

Our results support construct validity of the Sharp Dissection metrics with respect to experience level. Components of an unedited video-based Global Rating Score assessment were able to inform a reliable and representative measure of actual skill level, but were rater-dependent. Whilst subjective Overall Impression and mean GRS showed similar correlation with actual group, the objective rating was more reliable.

Significant GRS-metric correlations support the conclusion that instrument motion analysis data may provide a concurrent valid assessment technique with observer assessment. Similar correlations with experience level found in both methods suggest motion analysis may be as accurate as consultant observer assessment in the setting of a simulated laparoscopic task.

Historical assessments of surgical skill have relied on the opinion of experts and mentors. Intuitively the direct supervision of training provides an unrivalled informed impression of performance but arguably lacks objectivity. Acknowledging potential flaws in the design and application of the rating scale, results in this study did not confirm any strong advantage of consultant observer assessment compared with instrument tracking in a simulated laparoscopic task. Rating scale improvements and rater “training” could potentially address inconsistent findings and reliability.

Simulator data are immediate, objective, and independent of time-constrained consultant surgeons and agreement considerations. Although these results do not prove the superiority of simulator motion analysis, they are promising and demand further investigation. With trainees increasingly pressured to attain competency with diminished operative exposure and increasing service demands on trainers, simulator technology may have a role in supplementing course assessment, providing resource-efficient, objective feedback, and free trainers to concentrate on teaching.

Copyright information

© Société Internationale de Chirurgie 2008