The application of objective clinical human reliability analysis (OCHRA) in the assessment of basic robotic surgical skills

Background Using a validated, objective, and standardised assessment tool to assess progression and competency is essential for basic robotic surgical training programmes. Objective clinical human reliability analysis (OCHRA) is an error-based assessment tool that provides in-depth analysis of individual technical errors. We conducted a feasibility study to assess the concurrent validity and reliability of OCHRA when applied to basic, generic robotic technical skills assessment. Methods Selected basic robotic surgical skill tasks, in virtual reality (VR) and dry lab equivalent, were performed by novice robotic surgeons during an intensive 5-day robotic surgical skills course on da Vinci® X and Xi surgical systems. For each task, we described a hierarchical task analysis. Our developed robotic surgical-specific OCHRA methodology was applied to error events in recorded videos with a standardised definition. Statistical analysis to assess concurrent validity with existing tools and inter-rater reliability were performed. Results OCHRA methodology was applied to 272 basic robotic surgical skills tasks performed by 20 novice robotic surgeons. Performance scores improved from the start of the course to the end using all three assessment tools; Global Evaluative Assessment of Robotic Skills (GEARS) [VR: t(19) = − 9.33, p < 0.001] [dry lab: t(19) = − 10.17, p < 0.001], OCHRA [VR: t(19) = 6.33, p < 0.001] [dry lab: t(19) = 10.69, p < 0.001] and automated VR [VR: t(19) = − 8.26, p < 0.001]. Correlation analysis, for OCHRA compared to GEARS and automated VR scores, shows a significant and strong inverse correlation in every VR and dry lab task; OCHRA vs GEARS [VR: mean r = − 0.78, p < 0.001] [dry lab: mean r = − 0.82, p < 0.001] and OCHRA vs automated VR [VR: mean r = − 0.77, p < 0.001]. There is very strong and significant inter-rater reliability between two independent reviewers (r = 0.926, p < 0.001). Conclusion OCHRA methodology provides a detailed error analysis tool in basic robotic surgical skills with high reliability and concurrent validity with existing tools. OCHRA requires further evaluation in more advanced robotic surgical procedures. Supplementary Information The online version contains supplementary material available at 10.1007/s00464-023-10510-2.

enhanced optics, greater freedom of movement and preferable ergonomics [2].Despite the rapid uptake of robotassisted procedures, training curriculums and consensus on assessments that can be implemented to inform practice and assist accreditation are still in their infancy.Training often varies vastly and can lack structure, with most surgeons undergoing mentorship and observation instead of specifically set performance targets [2].There are accessible training courses; however, content again varies greatly, and the practical training offered is inconsistent [3].In the United States, several adverse events in robot-assisted surgery were reported from 2000 to 2013 by Alemzadeh et al. using Food and Drug Administration data on the Manufacturer and User Facility Device Experience database [4].A lack of structured training has partly contributed to this, and it was proposed that the implementation of uniform standards for training would reduce the incidence of adverse events [4].Recently, the Orsi Academy and the British Association of Urological Surgeons have made vital progress in setting foundations for standardising robotic training curricula [5,6].
Using a validated, objective and standardised assessment tool to assess progression and competency is essential for robotic surgical training programmes [7].Objective clinical human reliability analysis (OCHRA) is an errorbased hierarchical task analysis assessment tool that has been applied to laparoscopic surgery to detect and analyse technical errors with their consequences.This provides an in-depth analysis that is required for meaningful feedback.It is also based on granular observational data capture which provides detailed assessment of surgical performance.
OCHRA has demonstrated construct and predictive validity, with excellent intra-and inter-rater reliability in multiple studies across various surgical specialities for open and laparoscopic procedures [8][9][10][11][12][13].It is more objective than global observation assessments as is it based on granular observational data capture which provides a detailed assessment of surgical performance.Automated performance metrics (APMs), namely kinematic and system event data from recording devices such as the dVLogger®, provide an exciting prospect for future training [14].OCHRA can facilitate the application of artificial intelligence (AI) and machine learning to complement and advance the early efforts along with APMs, which have demonstrated construct and predictive validity in robotic surgery [15][16][17].By embracing big data methodology such as deep learning, effective and meaningful large-scale performance feedback is on the horizon [18].For AI to recognise and interpret errors, large, manually annotated dry lab and clinical videos are required to train and test models.OCHRA is most likely the only granular error analysis tool that has been validated within minimally invasive surgery to date, but no prior reports in robotic surgery.
The aims of this study were to develop an OCHRA methodology that is applicable to assess generic basic robotassisted skills and to test its feasibility, concurrent validity and reliability for the assessment of technical performance in robot-assisted surgical tasks.

Methodology
To explore the concurrent validity of OCHRA within basic robotic technical skills assessment, this study uses automated VR assessment scores and Global Evaluative Assessment of Robotic Skills (GEARS).VR scores have demonstrated face, content and predictive validity [19][20][21][22][23] and GEARS is one of the most widely used valid and reliable tools for robotic surgical assessments [24,25].

Development of OCHRA methodology
Our OCHRA methodology was adapted and refined from laparoscopic surgery, described by Tang, Miskovic and Foster [8,13,26], to be applied to robotic surgical procedures.Five categorisations for error events provide the OCHRA methodology framework: error type, consequence of error, external error mode, instrument and non-error event (Supplementary Table 1A-E).As this was a feasibility study, OCHRA experts improved and refined the bespoke roboticspecific methodology, which contained five domains of error events described in the laparoscopic OCHRA methodology [8,13,26].This was done after initial analysis of the videos through author discussion and piloted within the robotic setting.
We describe error types for robotic surgical procedures as the observed error seen during video analysis.OCHRA error events contain a "consequence" domain, which would be designated a severity rating within clinical surgery [27].However, this was deemed not feasible within basic robotic skills, as errors are usually binary with no true severity rating, therefore excluded from our analysis.External error modes are defined as the "external and observable manifestation of the error or behaviours exhibited by an operator i.e. the physical form an error takes" [28].For laparoscopic surgery, Joice and Tang et al. describe ten generic forms of external error modes which represent "observed patterns of failure" and can be categorised into procedural or execution error modes relating to the observed "underlying causative mechanism" [10,26].

Task selection and hierarchical task analysis
Eight robotic tasks in total were selected for this study which included four VR tasks and four dry lab equivalent tasks.They were sea spikes (Fig. 1A), ring rollercoaster (Fig. 1B), big dipper needle driving (Fig. 1C) and knot tying (Fig. 1D).These separate tasks necessitate a broad range of generic basic surgical skills for assessment; endowrist manipulation, simple grasping, instrument handling, camera control, clutching, needle control, needle driving and suture manipulation are all assessed to varying degrees.
For each, a hierarchical task analysis was created through expert consensus with two surgical specialists (Supplementary Table 2A-D).A validated and reliable robotic surgical suturing checklist provided the basic framework for developing our task analyses for big dipper needle driving and knot tying tasks [29].To our knowledge, there are no validated checklists for sea spikes or ring rollercoaster tasks within the literature.Each task was divided into four sequential component task zones.Task zones were subdivided further into subtasks with the aim of each subtask clearly described.

Study design and data collection
Participants were invited as part of a University Robotic Curriculum and a surgical specialty society robotic training course (The Association of Laparoscopic Surgeons of Great Britain and Ireland), with standardised timetable.An Institutional Review Board was not required as the study did not involve patient care.Ethics followed Helsinki principle, collecting informed written consent.Using definitions by Sridhar et al., we define novice robotic surgeons as those who have performed no robotic surgical procedures [30].
The 5-day robot-assisted surgical skills courses were held at a pre-clinical surgical research and training hub.All recorded and assessed VR tasks for this study were performed using da Vinci SimNow®.All recorded and assessed dry model tasks were performed using either da Vinci® X or Xi robotic surgical systems.VR and dry lab tasks were video recorded and uploaded by participants to a secure cloud database server.
Attempt 1, for both VR and dry tasks, occurred after orientation and training on the 1st day of the course with Attempt 2 on the 5th day.Each was recorded and assessed using GEARS for dry lab and VR tasks, and automated VR scores.GEARS assessments were carried out by two trained, independent, surgical residents.GEARS scores were obtained using a five-point Likert scale indicating competency in six domains: depth perception, bimanual dexterity, efficiency, force sensitivity, autonomy and robotic control, with a maximum score of thirty.Da Vinci SimNow® provided automated VR scores incorporating the total time to complete the task (scored out of fifty) and economy of motion (scored out of fifty), with a maximum score of one hundred.OCHRA error scores were obtained from analysing

OCHRA specialist training
Two independent assessors underwent specialist OCHRA training by specialists.Inter-rater reliability was tested and reviewed to improve assessment before reviewing all videos.OCHRA was applied to videos of each task by the two assessors independently.Eleven video tasks were compared using OCHRA methodology by the independent assessors with an 80.9% percentage agreement for matched error events [182/216 (86.1%) error events matched].

Statistical analysis
Paired T test was used to assess each participant's change in the sum of scores for each assessment tool from Attempt 1 to Attempt 2 at two separate points along the participant's learning curve.Wilcoxon matched-pair signed-rank test, the non-parametric and thus assumption-free counterpart of paired t tests, was conducted to validate findings.Using multiple imputation (regression method), missing data sets were accounted for when using Paired T test, Wilcoxon matched-pair signed-rank test and generating graphs.
For each selected task, correlational analysis examined the degree to which scores from each separate assessment tool correlated.Spearman correlation coefficients were consistent with the direction and magnitude of Pearson correlations, but Spearman correlation coefficients were reported instead of Pearson's given the sample size.Inter-rater reliability was assessed using Spearman correlation coefficient.Two-tailed p values < 0.05 were considered significant.
All statistical analyses were carried out with SPSS Statistics 22.0 software (IBM corporation on and other(s) 1989, USA, 2013).

Results
Twenty participants age ranged from 25 to 53 years old (mean 32) performed 320 robotic skills tasks.There was an even split of female to male participants with 10 (50%) females.Participants were all qualified doctors and undertook their medical training in several different countries.The United Kingdom was the most popular with 10 (50%) doctors completing medical training there.Participants had varying open & laparoscopic surgical experience, and the majority, 12/20 (60%), had not completed any years of open or laparoscopic surgical experience.Only one participant had over 10 years of open & laparoscopic surgical experience.Importantly no participants had performed any robotic surgeries, so were all novice robotic surgeons.
The most common non-error event for sea spikes was '(f)

Learning curve of participants
Performance improved from the start of the course (Attempt 1) and the end of the week (Attempt 2) using all assessment tools for each candidate (Fig. 2A-E).Paired samples t tests indicated that participants significantly improved (all twotailed p values < 0.001) over the course using each assessment tool in both VR and dry lab tasks (Table 1).The mean difference ± standard deviation for the difference from the sum of attempt 1 assessment scores to the sum of attempt  2).

Concurrent validity of OCHRA
Strong inverse and significant associations were seen between dry OCHRA error count scores and dry GEARS scores for dry lab tasks (range r = − 0.86 to − 0.76, mean r = − 0.82) with a breakdown seen in Table 3 and Fig. 3A-D.
Strong inverse and significant associations were seen between VR OCHRA error count scores and VR GEARS scores for VR tasks (range r = − 0.92 to − 0.67, mean r = − 0.78) with a breakdown seen in Table 3 and Fig. 4A-D.
Strong inverse and significant associations were seen between VR OCHRA error count scores and automated VR scores for VR tasks (range r = − 0.90 to − 0.64, mean r = − 0.77) with a breakdown seen in Table 3 and Fig. 5A-D.
Strong and significant associations were seen between VR GEARS error count scores and automated VR scores for VR task (range r = 0.68 to 0.88, mean r = 0.77) with a breakdown seen in Table 3 and Fig. 6A-D.

Inter-rater reliability
There is a very strong positive correlation between reviewer 1 and reviewer 2 (rSpearman (11) = 0.926, p < 0.001) such that both reviewers scored participants similarly showing very strong inter-rater reliability (Table 4).

Discussion
This feasibility study is the first to investigate the application of OCHRA as an assessment tool in robotic surgery.Our developed bespoke robotic OCHRA methodology has shown concurrent validity with GEARS and automated VR scores, plus excellent inter-rater reliability.Furthermore, it has shown the ability to demonstrate participants' proficiency learning curve.This OCHRA tool has the potential to be applied within the rapidly expanding domain of AI as a valuable educational adjunct to inform deep learning and robotic surgery accreditation programmes.
Using a validated, objective and standardised assessment tool to assess progression and competency is essential for robotic surgical training programmes [7].Global rating scales lack the in-depth granular analysis of technical skill and objectivity that error-based analysis provides [31].Despite the plethora of publications on assessment tools in robotic surgery [18,25,[32][33][34][35][36][37][38][39], there is a lack of error-based analysis methods in robotic surgery for education or accreditation.For error assessment methods to be utilised widely, a clear definition of what is considered an error is required.In addition to this, an error tool needs to be generic and adaptable to multiple procedures whilst retaining its objectivity.
Current technical error-based assessment tools for minimally invasive surgery include Generic Error Rating Tool and Fundamentals of Laparoscopic Surgery.They provide basic error analysis by identifying that an observed error event has occurred and in what step of the procedure [31,40,41].OCHRA methodology analyses events further by analysing the consequences of the error, the severity of the error, the mechanism by the operator which resulted in the error, the instrument which caused the error and the corrective events following the error.Assessing each error in such detail aids in more meaningful feedback which is fundamental to learning and for future application of AI methods.Our analyses showed executional errors were far more common than procedural errors within external error modes.Differentiating between procedural and executional errors is valuable for identifying areas to improve and reduce errors [26].Executional errors suggest a lack of technical skills or inadequate equipment, whereas Due to this granularity, OCHRA is likely to complement the assessment that is most commonly used in robotic surgery, the GEARS tool.It has been postulated that the ability to identify errors is linked to surgical aptitude [42,43] with recent research using OCHRA proving a direct link between intra-operative performance and errors to patient outcomes, demonstrating predictive validity [12,17,44].Incorporating error-based assessments such as OCHRA into robotic surgical training may accelerate the proficiency learning curves for trainees, as in-depth analysis of individual mistakes will enable surgeons to learn from and prevent them.Consequently, this can underpin future research that aims to improve patient safety in the operating room and within robotic surgical training [45].
Preventing a higher uptake of OCHRA within surgical training programmes is the substantial time required to analyse surgical videos.Additionally, OCHRA requires prior training on the methodology and its application to manually   [46,47].The next stage of evaluation of OCHRA in robotic surgery is to demonstrate its construct validity with clinically relevant metrics.The application of OCHRA to inform AI models is an exciting prospect.OCHRA can provide granularly annotated surgical videos, particularly in this context, to recognise a near miss or a consequential error.The long-term aim is to have an early warning system with real-time operative feedback.This is a long way off, however, OCHRA potentially can provide an opportunity to train deep learning methods in the recognition of errors and near misses.Further large studies and fully validated data sets of objectively analysed operative videos are required to allow this progression, to fully automated, objective, meaningful performance assessments.
This study had certain limitations, despite a large number of videos analysed, over 27 h, using OCHRA methodology, 20 participants performed the selected tasks in this proof-of-concept study.Research with larger sample sizes and different levels of experience is required before application in clinical practice and training programmes for robotic surgery.
Although this study confirmed the feasibility of recording the tasks, collating and analysing the data, there were 48 videos out of 320 (15%) of the selected tasks were not able to undergo OCHRA analysis due to missing data or incomplete videos.This was mainly because of the use of external recording, but in future studies, recording will use direct video capture.
This study may also lack external validity, as it used novice robotic surgical trainees as participants.In addition to this, we have not established construct or predictive validity for OCHRA in robotic surgical skill analysis, although presumably has a high likelihood of doing so, as it did in laparoscopic surgery [11][12][13].
The selected tasks used required a suitable range of general and basic robotic surgical skills.However, more advanced skills such as diathermy and dissection were not assessed in any of the selected tasks.Further work is needed to investigate if OCHRA methodology is applicable to more complex tasks that require more advanced robotic skills.This was a feasibility study which was intended to test the application of the OCHRA methodology outside clinical practice in the first instance.Since demonstrating the feasibility, reliability and validity of the OCHRA methodology in the lab setting, we are now proceeding with its application in a multi-centre clinical study, which is currently recruiting (Video Analysis in Minimally Invasive Surgery) (IRAS number 309024 and ClinicalTrials.gov Identifier: NCT05279287).
Whilst we have demonstrated excellent inter-rater reliability for evaluating error frequencies, there is potential for variability if other observers' scores use a different definition of what constitutes an error.In our task analyses, we clearly describe the correct action to complete each subtask to minimise inter-rater variability using the OCHRA methodology.It is essential when using OCHRA methodology for other robotic surgical tasks that these descriptions are clear and well-defined within task analysis subtasks.

Conclusion
This study shows the feasibility, concurrent validity and excellent reliability of OCHRA methodology to assess technical performance of basic robotic surgical skills.Further application of OCHRA to more advanced robotic surgical procedures is required to validate its future application in the operating room, and the application of AI to evaluate automated error recognition.
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 1 A
Fig. 1 A, D are photographs of dry sea spikes and knot tying tasks being performed, respectively.B, C are photographs of VR ring rollercoaster tasks and big dipper needle driving tasks being performed, respectively

Fig. 2
Fig. 2 Performance change from start of the course (Attempt 1) to end of course (Attempt 2) using the different assessment tools for each participant.A-C Show performance change for VR tasks using each assessment tool; sum of automated VR scores, sum of GEARS

Fig. 3
Fig.3OCHRA error count scores compared to dry GEARS scores for sea spikes dry tasks (A), ring rollercoaster dry tasks (B), big dipper needle driving dry tasks (C) and knot tying dry tasks (D).Each

Fig. 4
Fig. 4 OCHRA error count scores compared to VR GEARS scores for sea spikes VR tasks (A), ring rollercoaster VR tasks (B), big dipper needle driving VR tasks (C) and knot tying VR tasks (D).Each

Fig. 5
Fig. 5 OCHRA error count scores compared to automated VR scores for sea spikes VR tasks (A), ring rollercoaster VR tasks (B), big dipper needle driving VR tasks (C) and knot tying VR tasks (D).Each

Fig. 6
Fig. 6 Automated VR scores compared to VR GEARS scores for sea spikes VR tasks (A), ring rollercoaster VR tasks (B), big dipper needle driving VR tasks (C) and knot tying VR tasks (D).Each point

Table 2
Related-samples Wilcoxon signed rank test summary, the non-parametric, comparing sum of assessment scores for Attempt 1 and 2 for each assessment tool in VR and dry lab Related-samples Wilcoxon signed rank test summary, the non-parametric, is used to validate the paired T test findings.The null hypothesis (the median of difference between sum of Attempt 1 scores and sum Attempt 2 scores equal 0) can be rejected with significance < 0.001 for each assessment tool in both VR and dry lab

Table 3
Correlation analysis using Spearman's rho correlation coefficient to examine concurrent validity in scores between assessment tools for the different tasks in both VR and dry lab (15)for knot tying was '(15)Delay in progress of procedure' in both VR and dry lab [VR: 329/647 error events (50.9%)] [Dry lab: 187/324 error events (57.7%)].
Continue uninterrupted' in both VR and dry lab [VR:

Table 4
Correlation analysis using Spearman's rho correlation coefficient to examine the inter-rater reliability between two independent assessors for OCHRA error scores At least one of each selected task in VR and dry lab equivalent were analysed by both reviewers independently