Keywords

1 Introduction

Surgical education is lengthy. It takes medical students from textbooks in the classroom to performing surgeries on live patients. There are currently numerous opportunities for surgical residents, i.e. doctors in training, to acquire nominal knowledge on human anatomy and physiology by means of both analog and digital tools. Conversely, hands-on practice of operations tends to be limited prior to the learner stepping into a real operating room (OR). In the traditional educational path, gaining the know-how about a specific operation happens directly in front of a live patient: residents start by observing a senior surgeon performing the operation, then they operate on their own while being under supervision of an expert until they gain the independence to operate on their own. Before entering the OR, opportunities for honing manual dexterity and hand-eye coordination are rare and expensive. Residents might be able to gain access to animal and/or human cadavers in a handful of occasions and with limited possibilities for repeated training sessions. In this context, what is lacking is the possibility for residents to quickly and independently assess their own performance after a surgical simulation, as well as compare it with the performance of experts or their own past performance, with the aim of analyzing their own learning curve over time. In recent years, researchers in medicine and engineering have come together to address this lack of resources for surgical practice by proposing numerous novel tools and environments targeting residents as individual learners who wish to hone their skills in a risk-free, controlled and accessible environment.

Table 1. Systematic review of metrics used in XR cranial neurosurgical training classified by educational setting. Based on [11].

The adoption of anatomically realistic phantom models, as well as of XR technologies, such as augmented and virtual reality, are two of the most prolific avenues of research from this perspective [9]. By combining virtual with real imagery, and integrating it with accurate replicas of human anatomy, numerous possibilities for interactive, realistic and adaptable simulation scenarios are enabled. One of the greatest benefits of employing these accessible and easy-to-develop emerging technologies is arguably the versatility of the resulting learning environments to fit different types of surgical practices. Furthermore, through sensors and automatic data collection, assessing simulation performance improvements over time is greatly facilitated by modern applications. A consequence of this research windfall is the need for common learning-measuring criteria. To enable a full exploration of the versatility, scalability and accessibility of newly developed XR tools for surgical education, there needs to be a common set of metrics that quantify simulation performance as a measure of procedural knowledge acquisition and transfer. With widespread adoption of standard performance metrics by the research community, it may be possible to compare learning achievements both across different technologies and between different actors, potentially distantly located and with varying degrees of expertise. In the present paper, we propose a set of metrics for the assessment of outcomes in XR simulations of surgical operations, specifically in the field of cranial neurosurgery. From this perspective, this narrower field of research comes with a partially different set of challenges when compared to, for instance, spinal neurosurgery, and, as we have previously shown, has so far been relatively uncharted territory for educational XR technologies [11]. Nevertheless, our proposed set of metrics can be applied to other types of surgery where procedural knowledge acquisition and transfer involves precise and efficient hand-eye coordination and manual dexterity.

2 Survey of the Domain of Practice

In a recent systematic review, we surveyed the adoption of XR technologies, with varying degrees of augmentation, in cranial neurosurgical education [11]. There, we defined education as a combination of learning, practicing and skill assessment, with the goal of acquiring the necessary knowledge to successfully perform a surgical procedure. Table 1 highlights the variability in the metrics considered among the 26 studies that measured user performance. Studies are grouped by education type according to the definition above, while performance metrics are arbitrarily categorized based on complexity, degree of aggregation and collection method (automatic vs. non-automatic) into the following:

  • Space-time metrics include time, position and orientation measures in the surgical performance, i.e. kinematic measures that can be related to both the surgical simulation as a whole or only part of it. Examples are time to completion, instrument position, entry point location.

  • Force metrics include measures of forces applied by test subjects onto the instrument or onto the apparatus being used as a proxy for the patient (e.g. a phantom). Examples are bandwidth, ratio and sum of the forces applied.

  • Outcome metrics include frequencies and patterns of accuracy, errors, precision and consistency in the simulation. Metrics derived from comparisons with the intended outcome or an existing benchmark are also included. Examples are number of attempts, frequency of complications and success rate.

  • Qualitative metrics include non-automatic evaluations made by experimenters or experts to evaluate surgical simulation performance, according to either standard or arbitrary criteria. Examples are grades based on scales, answers to questions and scores given to operative dictation.

As shown in Table 1, despite a broad arbitrary categorization of performance metrics considered in the studies, there is no clear consensus on a common type of metric. There appears to be notable variance in the frequency distribution of the single metric categories considered here. The outcome category, while being the most frequent (65%), still falls short of being labeled as “widely adopted”. This is surprising, considering that simple measures of success, accuracy and precision fall into this category. Possible reasons may be the nature and scale of the research questions investigated, as well as the alternative adoption of more qualitative methods for assessing performance, i.e. the fourth category of metrics. In other words, not all the studies present data related to, for instance, time to completion or distance to the target, because an approximate estimate of outcome was performed by experts. This qualitative category, on the other hand, is the least representative among the 26 studies (31%). It involves non-automatic assessment from senior surgeons grading through validated forms, ensuring systematic scoring. Finally, quantity and variability of metrics also noticeably vary between different types of education under scrutiny. In particular, while only a handful of distinct “learning” studies employ performance metrics at all (n=7), the “skill assessment” studies present considerably richer data (n=12). In the latter case, more than one category of metric is often considered simultaneously.

3 PARENT Metrics for Objective Assessment

As previously discussed, one of the many benefits of adopting XR technologies in surgical education is their versatility and scalability in different settings. By enabling asynchronous, distributed, and independent procedural knowledge acquisition and transfer, the learning opportunities for a resident surgeon increase compared to traditional education. Given the traditional co-located, synchronous learning in this field, portable and relatively inexpensive XR technologies can thus complement training through automatically assessed performance metrics. Such metrics need not be limited to low-dimension measures of e.g. kinematics and forces; “advanced” computed metrics can also be considered, such that multiple “simple” ones can be aggregated into meaningful indexes. Furthermore, the absence of a teacher during this mediated learning experience entails that subjective evaluations of surgical performance are not scalable in space and time. That is, the need for an expert surgeon assessing and grading simulation performance is unwarranted by the ability for residents to practice “anytime, anywhere”.

In order to propose a set of metrics that can reach consensus across multiple domains of expertise, their usefulness should be balanced with their scalability. While metrics that suit a specific surgical operation are effective in providing the necessary data to inform reliable evaluations, this approach is very sensitive to small variations in the learning scenario. This means that, for instance, the total volume of tumor removed may be relevant in tumor resection tasks, but not applicable at all in ventriculostomy tasks. On the other hand, metrics that are too abstract for the scenario may fall short of being informative enough for a learner aiming at assessing their own performance by comparing it to the intended one (e.g. as performed by experts) or to their past performances. A simple grade on an arbitrary and opaque A–F scale is an example: the grade does not tell why the performance was graded as such or what the student may have done correctly or not. The following proposed metrics should therefore act as a concrete origin for the automatic collection and aggregation of relevant performance indicators. If kept agnostic to the specific surgical procedure, they can be robust enough to enable cross-domain comparisons, and sufficient for a preliminary real-time assessment.

  • Precision of distances and angles: how close the measured values are to each other, i.e. their variability across multiple simulation trials. This can be inferred by calculating Euclidean and angular distances between surgical instruments at equivalent time frames in two or more trials.

  • Accuracy of distances and angles: how close the measured values are to the intended (target) value, i.e. their correctness for each simulation trial, inferred by comparing against benchmark baselines.

  • Rate of success: ratio between the number of successful simulation trials and total number of trials. It is complementary to the rate of error, the ratio between the number of unsuccessful trials and total number of trials. A clear definition of a threshold between success and error is warranted here.

  • Errors of measurement: robustness of the hardware in measuring performance indicators, expressed as the minimal detectable difference between two distinct observations over the range of values across all observations.

  • Number of attempts: count of simulation trials. This metric needs a clear definition distinguishing a re-start from a continuation of a previous attempt.

  • Time to completion: total time elapsed between the start and the end of a single simulation trial. A clear definition for procedure start and end, either as a location in space and/or a moment in time, is warranted here.

4 Conclusions and Future Work

Growing research on XR technologies in neurosurgical training calls for consensus one learning metrics. By observing current trends in the field and carefully balancing scalability and efficacy, we have proposed six concrete metrics for objective quantification of procedural knowledge acquisition and transfer. Consensus over them and, ultimately, their adoption throughout the broader field of surgical education will potentially enable more impactful research results that are comparable across different application domains. In future research, these metrics may afford rigorous and quantitative comparison between participant populations, simulated procedures, and XR tools. For validation, we plan to disseminate them in future workshops and surveying the research community.