Cumulatively, over ten million robot-assisted procedures have been performed worldwide using da Vinci® surgical systems [1]. Surgeons have been quick to adopt robot-assisted surgery across various surgical specialities due to enhanced optics, greater freedom of movement and preferable ergonomics [2]. Despite the rapid uptake of robot-assisted procedures, training curriculums and consensus on assessments that can be implemented to inform practice and assist accreditation are still in their infancy. Training often varies vastly and can lack structure, with most surgeons undergoing mentorship and observation instead of specifically set performance targets [2]. There are accessible training courses; however, content again varies greatly, and the practical training offered is inconsistent [3]. In the United States, several adverse events in robot-assisted surgery were reported from 2000 to 2013 by Alemzadeh et al. using Food and Drug Administration data on the Manufacturer and User Facility Device Experience database [4]. A lack of structured training has partly contributed to this, and it was proposed that the implementation of uniform standards for training would reduce the incidence of adverse events [4]. Recently, the Orsi Academy and the British Association of Urological Surgeons have made vital progress in setting foundations for standardising robotic training curricula [5, 6].

Using a validated, objective and standardised assessment tool to assess progression and competency is essential for robotic surgical training programmes [7]. Objective clinical human reliability analysis (OCHRA) is an error-based hierarchical task analysis assessment tool that has been applied to laparoscopic surgery to detect and analyse technical errors with their consequences. This provides an in-depth analysis that is required for meaningful feedback. It is also based on granular observational data capture which provides detailed assessment of surgical performance.

OCHRA has demonstrated construct and predictive validity, with excellent intra- and inter-rater reliability in multiple studies across various surgical specialities for open and laparoscopic procedures [8,9,10,11,12,13]. It is more objective than global observation assessments as is it based on granular observational data capture which provides a detailed assessment of surgical performance. Automated performance metrics (APMs), namely kinematic and system event data from recording devices such as the dVLogger®, provide an exciting prospect for future training [14]. OCHRA can facilitate the application of artificial intelligence (AI) and machine learning to complement and advance the early efforts along with APMs, which have demonstrated construct and predictive validity in robotic surgery [15,16,17]. By embracing big data methodology such as deep learning, effective and meaningful large-scale performance feedback is on the horizon [18]. For AI to recognise and interpret errors, large, manually annotated dry lab and clinical videos are required to train and test models. OCHRA is most likely the only granular error analysis tool that has been validated within minimally invasive surgery to date, but no prior reports in robotic surgery.

The aims of this study were to develop an OCHRA methodology that is applicable to assess generic basic robot-assisted skills and to test its feasibility, concurrent validity and reliability for the assessment of technical performance in robot-assisted surgical tasks.

Methodology

To explore the concurrent validity of OCHRA within basic robotic technical skills assessment, this study uses automated VR assessment scores and Global Evaluative Assessment of Robotic Skills (GEARS). VR scores have demonstrated face, content and predictive validity [19,20,21,22,23] and GEARS is one of the most widely used valid and reliable tools for robotic surgical assessments [24, 25].

Development of OCHRA methodology

Our OCHRA methodology was adapted and refined from laparoscopic surgery, described by Tang, Miskovic and Foster [8, 13, 26], to be applied to robotic surgical procedures. Five categorisations for error events provide the OCHRA methodology framework: error type, consequence of error, external error mode, instrument and non-error event (Supplementary Table 1A–E). As this was a feasibility study, OCHRA experts improved and refined the bespoke robotic-specific methodology, which contained five domains of error events described in the laparoscopic OCHRA methodology [8, 13, 26]. This was done after initial analysis of the videos through author discussion and piloted within the robotic setting.

We describe error types for robotic surgical procedures as the observed error seen during video analysis. OCHRA error events contain a “consequence” domain, which would be designated a severity rating within clinical surgery [27]. However, this was deemed not feasible within basic robotic skills, as errors are usually binary with no true severity rating, therefore excluded from our analysis. External error modes are defined as the “external and observable manifestation of the error or behaviours exhibited by an operator i.e. the physical form an error takes” [28]. For laparoscopic surgery, Joice and Tang et al. describe ten generic forms of external error modes which represent “observed patterns of failure” and can be categorised into procedural or execution error modes relating to the observed “underlying causative mechanism” [10, 26].

Task selection and hierarchical task analysis

Eight robotic tasks in total were selected for this study which included four VR tasks and four dry lab equivalent tasks. They were sea spikes (Fig. 1A), ring rollercoaster (Fig. 1B), big dipper needle driving (Fig. 1C) and knot tying (Fig. 1D). These separate tasks necessitate a broad range of generic basic surgical skills for assessment; endowrist manipulation, simple grasping, instrument handling, camera control, clutching, needle control, needle driving and suture manipulation are all assessed to varying degrees.

Fig. 1
figure 1

A, D are photographs of dry sea spikes and knot tying tasks being performed, respectively. B, C are photographs of VR ring rollercoaster tasks and big dipper needle driving tasks being performed, respectively

For each, a hierarchical task analysis was created through expert consensus with two surgical specialists (Supplementary Table 2A–D). A validated and reliable robotic surgical suturing checklist provided the basic framework for developing our task analyses for big dipper needle driving and knot tying tasks [29]. To our knowledge, there are no validated checklists for sea spikes or ring rollercoaster tasks within the literature. Each task was divided into four sequential component task zones. Task zones were subdivided further into subtasks with the aim of each subtask clearly described.

Study design and data collection

Participants were invited as part of a University Robotic Curriculum and a surgical specialty society robotic training course (The Association of Laparoscopic Surgeons of Great Britain and Ireland), with standardised timetable. An Institutional Review Board was not required as the study did not involve patient care. Ethics followed Helsinki principle, collecting informed written consent. Using definitions by Sridhar et al., we define novice robotic surgeons as those who have performed no robotic surgical procedures [30].

The 5-day robot-assisted surgical skills courses were held at a pre-clinical surgical research and training hub. All recorded and assessed VR tasks for this study were performed using da Vinci SimNow®. All recorded and assessed dry model tasks were performed using either da Vinci® X or Xi robotic surgical systems. VR and dry lab tasks were video recorded and uploaded by participants to a secure cloud database server.

Attempt 1, for both VR and dry tasks, occurred after orientation and training on the 1st day of the course with Attempt 2 on the 5th day. Each was recorded and assessed using GEARS for dry lab and VR tasks, and automated VR scores. GEARS assessments were carried out by two trained, independent, surgical residents. GEARS scores were obtained using a five-point Likert scale indicating competency in six domains: depth perception, bimanual dexterity, efficiency, force sensitivity, autonomy and robotic control, with a maximum score of thirty. Da Vinci SimNow® provided automated VR scores incorporating the total time to complete the task (scored out of fifty) and economy of motion (scored out of fifty), with a maximum score of one hundred. OCHRA error scores were obtained from analysing video recordings retrospectively after the course. The OCHRA assessor was blinded to other assessment scores.

OCHRA specialist training

Two independent assessors underwent specialist OCHRA training by specialists. Inter-rater reliability was tested and reviewed to improve assessment before reviewing all videos. OCHRA was applied to videos of each task by the two assessors independently. Eleven video tasks were compared using OCHRA methodology by the independent assessors with an 80.9% percentage agreement for matched error events [182/216 (86.1%) error events matched].

Statistical analysis

Paired T test was used to assess each participant’s change in the sum of scores for each assessment tool from Attempt 1 to Attempt 2 at two separate points along the participant’s learning curve. Wilcoxon matched-pair signed-rank test, the non-parametric and thus assumption-free counterpart of paired t tests, was conducted to validate findings. Using multiple imputation (regression method), missing data sets were accounted for when using Paired T test, Wilcoxon matched-pair signed-rank test and generating graphs.

For each selected task, correlational analysis examined the degree to which scores from each separate assessment tool correlated. Spearman correlation coefficients were consistent with the direction and magnitude of Pearson correlations, but Spearman correlation coefficients were reported instead of Pearson’s given the sample size. Inter-rater reliability was assessed using Spearman correlation coefficient. Two-tailed p values < 0.05 were considered significant.

All statistical analyses were carried out with SPSS Statistics 22.0 software (IBM corporation on and other(s) 1989, USA, 2013).

Results

Twenty participants age ranged from 25 to 53 years old (mean 32) performed 320 robotic skills tasks. There was an even split of female to male participants with 10 (50%) females. Participants were all qualified doctors and undertook their medical training in several different countries. The United Kingdom was the most popular with 10 (50%) doctors completing medical training there. Participants had varying open & laparoscopic surgical experience, and the majority, 12/20 (60%), had not completed any years of open or laparoscopic surgical experience. Only one participant had over 10 years of open & laparoscopic surgical experience. Importantly no participants had performed any robotic surgeries, so were all novice robotic surgeons.

Collected data

320 selected robotic skills tasks were assessed, including 160 videos for VR and 160 videos for dry lab. OCHRA video analysis was applied to an estimated 1632 min (over 27 h) of video data with a total of 6488 error events analysed. GEARS scores were obtained for all 320/320 tasks (100%). A total of 48/320 (15%) videos were incomplete or missing for dry lab and VR tasks. Therefore, OCHRA video analysis was applied to 272/320 (85%) dry lab and VR tasks with OCHRA error scores recorded and automated VR scores were obtained for 148/160 (92.5%) VR tasks.

Distribution of OCHRA error events

Error events most commonly occurred in the following task zones: for sea spikes [VR: ‘(2) Movement of ring in space’, 245/677 error events (36.2%)] [Dry lab: ‘(2) Movement of ring in space’, 186/582 error events (32.0%)], for ring rollercoaster [VR: ‘(2) Movement of ring along rollercoaster’, 805/1004 error events (80.2%)] [Dry lab: ‘(2) Movement of ring along rollercoaster’, 702/900 error events (78.0%)], for big dipper needle driving [VR: ‘(4) Exit of needle from tissue’, 813/1771 error events (45.9%)] [Dry lab: ‘(1) Preparing needle for tissue insertion’, 368/939 error events (39.2%)] and for knot tying [VR: ‘(2) Wrapping suture thread around the instrument’, 243/647 error events (37.6%)] [Dry lab: ‘(1) Preparing suture thread for making loops’, 126/324 error events (38.9%)].

The most common error type was ‘(N) Inappropriate/poor use of endowristed instrument’ for each task in both VR and dry lab. Specifically, for sea spikes [VR: 246/677 error events (36.3%)] [Dry lab: 258/582 error events (44.3%)], for ring rollercoaster [VR: 887/1004 error events (88.3%)] [Dry lab: 823/900 error events (91.4%)], for big dipper needle driving [VR: 633/1771 error events (35.7%)] [Dry lab: 532/939 error events (56.7%)] and for knot tying [VR: 292/647 error events (45.1%)] [Dry lab: 152/324 error events (46.9%)].

The most common consequence for sea spikes was ‘(15) Delay in progress of procedure’ in both VR and dry lab [VR: 368/677 error events (54.4%)] [Dry lab: 404/582 error events (69.4%)], for ring rollercoaster was ‘(13) Risk of injury to other structure/minor collision’ in both VR and dry lab [VR: 838/1004 error events (83.5%)] [Dry lab: 803/900 error events (89.2%)], for big dipper needle driving was ‘(13) Risk of injury to other structure/minor collision’ in VR [VR: 820/1771 error events (46.3%)] and ‘(8) Tissue/foam pad damaged’ in dry lab [Dry lab: 360/939 error events (38.3%)] and for knot tying was ‘(15) Delay in progress of procedure’ in both VR and dry lab [VR: 329/647 error events (50.9%)] [Dry lab: 187/324 error events (57.7%)].

External error modes can be divided into procedural errors (a–f) and executional errors (g–j). The executional errors (g–j) were greater than procedural errors (a–f) in each task; sea spikes executional errors [VR: 418/677 error events (61.7%)] [Dry lab: 335/582 error events (57.6%)], ring rollercoaster executional errors [VR: 925/1004 error events (92.1%)] [Dry lab: 860/900 error events (95.6%)], big dipper needle driving executional errors [VR: 1572/1771 error events (88.8%)] [Dry lab: 786/939 error events (83.7%)] and knot tying executional error events [VR: 496/647 error events (76.7%)] [Dry lab: 233/324 error events (71.9%).

The only instruments used for each task were the ‘(6) Fine grasper (Marylands, needle driver)’ and ‘(15) Camera endoscope’. Instruments were not exchanged during any of the tasks. A total 6373/6844 error events (93.1%) were caused using ‘(6) Fine grasper (Marylands, needle driver)’ across all tasks.

The most common non-error event for sea spikes was ‘(f) Continue uninterrupted’ in both VR and dry lab [VR: 347/677 error events (51.3%)] [Dry lab: 244/582 error events (41.9%)], for ring rollercoaster was ‘(b) Adjust hold to improve orientation’ in both VR and dry lab [VR: 798/1004 error events (79.5%)] [Dry lab: 762/900 error events (84.7%)], for big dipper needle driving ‘(f) Continue uninterrupted’ in VR [VR: 725/1771 error events (40.9%)] and ‘(j) Corrective action within subtask’ in dry lab [Dry lab: 329/939 error events (35.0%)], and for knot tying ‘(h) Requires repetition of step’ in both VR and dry lab [VR: 237/647 error events (36.6%)] [Dry lab: 111/324 error events (34.3%)].

Learning curve of participants

Performance improved from the start of the course (Attempt 1) and the end of the week (Attempt 2) using all assessment tools for each candidate (Fig. 2A–E). Paired samples t tests indicated that participants significantly improved (all two-tailed p values < 0.001) over the course using each assessment tool in both VR and dry lab tasks (Table 1). The mean difference ± standard deviation for the difference from the sum of attempt 1 assessment scores to the sum of attempt 2 assessment scores was 78.50 ± 55.49 for VR OCHRA error count scores, − 23.18 ± 11.11 for VR GEARS scores, − 142.57 ± 77.15 for automated VR scores, 62.70 ± 26.22 for dry lab OCHRA error count scores and − 21.95 ± 9.65 for dry lab GEARS scores. The median difference (interquartile range) for the difference from the sum of attempt 1 assessment scores to the sum of attempt 2 assessment scores was 61 (64.50) for VR OCHRA error count scores, − 22 (14.75) for VR GEARS scores, − 132.5 (96.38) for automated VR scores, 61.5 (34.75) for VR OCHRA error count scores and − 22 (17.25) for dry lab GEARS scores. Wilcoxon matched-pairs signed-rank tests, the non-parametric and thus assumption-free counterpart of paired t tests validated these findings (Table 2).

Fig. 2
figure 2

Performance change from start of the course (Attempt 1) to end of course (Attempt 2) using the different assessment tools for each participant. AC Show performance change for VR tasks using each assessment tool; sum of automated VR scores, sum of GEARS scores and sum of OCHRA scores, respectively, for each participant. D, E Show performance change for dry lab tasks using each assessment; sum of GEARS scores and sum of OCHRA scores, respectively, for each participant

Table 1 Paired sample T test comparing sum of participant’s assessment scores from Attempt 1 to Attempt 2 for each assessment tool
Table 2 Related-samples Wilcoxon signed rank test summary, the non-parametric, comparing sum of assessment scores for Attempt 1 and 2 for each assessment tool in VR and dry lab

Concurrent validity of OCHRA

Strong inverse and significant associations were seen between dry OCHRA error count scores and dry GEARS scores for dry lab tasks (range r = − 0.86 to − 0.76, mean r = − 0.82) with a breakdown seen in Table 3 and Fig. 3A–D.

Table 3 Correlation analysis using Spearman’s rho correlation coefficient to examine concurrent validity in scores between assessment tools for the different tasks in both VR and dry lab
Fig. 3
figure 3

OCHRA error count scores compared to dry GEARS scores for sea spikes dry tasks (A), ring rollercoaster dry tasks (B), big dipper needle driving dry tasks (C) and knot tying dry tasks (D). Each point represents a different participant. The OCHRA error count maximum score is infinite. The dry GEARS maximum score is 30

Strong inverse and significant associations were seen between VR OCHRA error count scores and VR GEARS scores for VR tasks (range r = − 0.92 to − 0.67, mean r = − 0.78) with a breakdown seen in Table 3 and Fig. 4A–D.

Fig. 4
figure 4

OCHRA error count scores compared to VR GEARS scores for sea spikes VR tasks (A), ring rollercoaster VR tasks (B), big dipper needle driving VR tasks (C) and knot tying VR tasks (D). Each point represents a different participant. The OCHRA error count maximum score is infinite. The VR GEARS maximum score is 30

Strong inverse and significant associations were seen between VR OCHRA error count scores and automated VR scores for VR tasks (range r = − 0.90 to − 0.64, mean r = − 0.77) with a breakdown seen in Table 3 and Fig. 5A–D.

Fig. 5
figure 5

OCHRA error count scores compared to automated VR scores for sea spikes VR tasks (A), ring rollercoaster VR tasks (B), big dipper needle driving VR tasks (C) and knot tying VR tasks (D). Each point represents a different participant. The OCHRA error count maximum score is infinite. The automated VR maximum score is 100

Strong and significant associations were seen between VR GEARS error count scores and automated VR scores for VR task (range r = 0.68 to 0.88, mean r = 0.77) with a breakdown seen in Table 3 and Fig. 6A–D.

Fig. 6
figure 6

Automated VR scores compared to VR GEARS scores for sea spikes VR tasks (A), ring rollercoaster VR tasks (B), big dipper needle driving VR tasks (C) and knot tying VR tasks (D). Each point represents a different participant. The automated VR maximum score is 100. The VR GEARS maximum score is 30

Inter-rater reliability

There is a very strong positive correlation between reviewer 1 and reviewer 2 (rSpearman (11) = 0.926, p < 0.001) such that both reviewers scored participants similarly showing very strong inter-rater reliability (Table 4).

Table 4 Correlation analysis using Spearman’s rho correlation coefficient to examine the inter-rater reliability between two independent assessors for OCHRA error scores

Discussion

This feasibility study is the first to investigate the application of OCHRA as an assessment tool in robotic surgery. Our developed bespoke robotic OCHRA methodology has shown concurrent validity with GEARS and automated VR scores, plus excellent inter-rater reliability. Furthermore, it has shown the ability to demonstrate participants’ proficiency learning curve. This OCHRA tool has the potential to be applied within the rapidly expanding domain of AI as a valuable educational adjunct to inform deep learning and robotic surgery accreditation programmes.

Using a validated, objective and standardised assessment tool to assess progression and competency is essential for robotic surgical training programmes [7]. Global rating scales lack the in-depth granular analysis of technical skill and objectivity that error-based analysis provides [31]. Despite the plethora of publications on assessment tools in robotic surgery [18, 25, 32,33,34,35,36,37,38,39], there is a lack of error-based analysis methods in robotic surgery for education or accreditation. For error assessment methods to be utilised widely, a clear definition of what is considered an error is required. In addition to this, an error tool needs to be generic and adaptable to multiple procedures whilst retaining its objectivity.

Current technical error-based assessment tools for minimally invasive surgery include Generic Error Rating Tool and Fundamentals of Laparoscopic Surgery. They provide basic error analysis by identifying that an observed error event has occurred and in what step of the procedure [31, 40, 41]. OCHRA methodology analyses events further by analysing the consequences of the error, the severity of the error, the mechanism by the operator which resulted in the error, the instrument which caused the error and the corrective events following the error. Assessing each error in such detail aids in more meaningful feedback which is fundamental to learning and for future application of AI methods. Our analyses showed executional errors were far more common than procedural errors within external error modes. Differentiating between procedural and executional errors is valuable for identifying areas to improve and reduce errors [26]. Executional errors suggest a lack of technical skills or inadequate equipment, whereas procedural errors suggest a lack of knowledge understanding the steps of a procedure.

Due to this granularity, OCHRA is likely to complement the assessment that is most commonly used in robotic surgery, the GEARS tool. It has been postulated that the ability to identify errors is linked to surgical aptitude [42, 43] with recent research using OCHRA proving a direct link between intra-operative performance and errors to patient outcomes, demonstrating predictive validity [12, 17, 44]. Incorporating error-based assessments such as OCHRA into robotic surgical training may accelerate the proficiency learning curves for trainees, as in-depth analysis of individual mistakes will enable surgeons to learn from and prevent them. Consequently, this can underpin future research that aims to improve patient safety in the operating room and within robotic surgical training [45].

Preventing a higher uptake of OCHRA within surgical training programmes is the substantial time required to analyse surgical videos. Additionally, OCHRA requires prior training on the methodology and its application to manually annotate and code the errors including the grading, the severity and any consequences. Moreover, OCHRA assessments cannot provide immediate feedback like GEARS and automated VR scores can, hence, precluding its use as a formative tool. Therefore, it may be better suited for objective, summative competency assessments that can be used for accreditation in specific procedures. The National Training Programme for laparoscopic colorectal surgery (LAPCO programme) is an example of an effective training programme which uses summative competency assessments for accreditation [46, 47]. The next stage of evaluation of OCHRA in robotic surgery is to demonstrate its construct validity with clinically relevant metrics.

The application of OCHRA to inform AI models is an exciting prospect. OCHRA can provide granularly annotated surgical videos, particularly in this context, to recognise a near miss or a consequential error. The long-term aim is to have an early warning system with real-time operative feedback. This is a long way off, however, OCHRA potentially can provide an opportunity to train deep learning methods in the recognition of errors and near misses. Further large studies and fully validated data sets of objectively analysed operative videos are required to allow this progression, to fully automated, objective, meaningful performance assessments.

This study had certain limitations, despite a large number of videos analysed, over 27 h, using OCHRA methodology, 20 participants performed the selected tasks in this proof-of-concept study. Research with larger sample sizes and different levels of experience is required before application in clinical practice and training programmes for robotic surgery.

Although this study confirmed the feasibility of recording the tasks, collating and analysing the data, there were 48 videos out of 320 (15%) of the selected tasks were not able to undergo OCHRA analysis due to missing data or incomplete videos. This was mainly because of the use of external recording, but in future studies, recording will use direct video capture.

This study may also lack external validity, as it used novice robotic surgical trainees as participants. In addition to this, we have not established construct or predictive validity for OCHRA in robotic surgical skill analysis, although presumably has a high likelihood of doing so, as it did in laparoscopic surgery [11,12,13].

The selected tasks used required a suitable range of general and basic robotic surgical skills. However, more advanced skills such as diathermy and dissection were not assessed in any of the selected tasks. Further work is needed to investigate if OCHRA methodology is applicable to more complex tasks that require more advanced robotic skills. This was a feasibility study which was intended to test the application of the OCHRA methodology outside clinical practice in the first instance. Since demonstrating the feasibility, reliability and validity of the OCHRA methodology in the lab setting, we are now proceeding with its application in a multi-centre clinical study, which is currently recruiting (Video Analysis in Minimally Invasive Surgery) (IRAS number 309024 and ClinicalTrials.gov Identifier: NCT05279287).

Whilst we have demonstrated excellent inter-rater reliability for evaluating error frequencies, there is potential for variability if other observers’ scores use a different definition of what constitutes an error. In our task analyses, we clearly describe the correct action to complete each subtask to minimise inter-rater variability using the OCHRA methodology. It is essential when using OCHRA methodology for other robotic surgical tasks that these descriptions are clear and well-defined within task analysis subtasks.

Conclusion

This study shows the feasibility, concurrent validity and excellent reliability of OCHRA methodology to assess technical performance of basic robotic surgical skills. Further application of OCHRA to more advanced robotic surgical procedures is required to validate its future application in the operating room, and the application of AI to evaluate automated error recognition.