Wearable technology-based metrics for predicting operator performance during cardiac catheterisation

Introduction Unobtrusive metrics that can auto-assess performance during clinical procedures are of value. Three approaches to deriving wearable technology-based metrics are explored: (1) eye tracking, (2) psychophysiological measurements [e.g. electrodermal activity (EDA)] and (3) arm and hand movement via accelerometry. We also measure attentional capacity by tasking the operator with an additional task to track an unrelated object during the procedure. Methods Two aspects of performance are measured: (1) using eye gaze and psychophysiology metrics and (2) measuring attentional capacity via an additional unrelated task (to monitor a visual stimulus/playing cards). The aim was to identify metrics that can be used to automatically discriminate between levels of performance or at least between novices and experts. The study was conducted using two groups: (1) novice operators and (2) expert operators. Both groups made two attempts at a coronary angiography procedure using a full-physics virtual reality simulator. Participants wore eye tracking glasses and an E4 wearable wristband. Areas of interest were defined to track visual attention on display screens, including: (1) X-ray, (2) vital signs, (3) instruments and (4) the stimulus screen (for measuring attentional capacity). Results Experts provided greater dwell time (63% vs 42%, p = 0.03) and fixations (50% vs 34%, p = 0.04) on display screens. They also provided greater dwell time (11% vs 5%, p = 0.006) and fixations (9% vs 4%, p = 0.007) when selecting instruments. The experts’ performance for tracking the unrelated object during the visual stimulus task negatively correlated with total errors (r = − 0.95, p = 0.0009). Experts also had a higher standard deviation of EDA (2.52 µS vs 0.89 µS, p = 0.04). Conclusions Eye tracking metrics may help discriminate between a novice and expert operator, by showing that experts maintain greater visual attention on the display screens. In addition, the visual stimulus study shows that an unrelated task can measure attentional capacity. Trial registration This work is registered through clinicaltrials.gov, a service of the U.S. National Health Institute, and is identified by the trial reference: NCT02928796.


Introduction
Patient safety and the mitigation of medical errors are of growing importance [1]. Poor decision-making and lack of skill in clinical procedures can be significant factors in many of the errors that are reported [2]. Research into clinical skills would suggest a critical role for 'continual practice' and maximising training time to reach an 'appropriate' level of performance [3]. Simulation-based training has demonstrated that skills can be acquired as well as measured without the need to 'learn on real patients' [4,5]. Many healthcare tasks and procedures can be simulated using computer technology for training purposes and provide novices with a way to improve or maintain their skills [6][7][8]. In addition to technical skill acquisition, we know that the errors made in the clinical environment are also related to non-technical skills [9] and hence there is a need to understand the relationship between skill and cognitive load during procedures. For example, a high cognitive load may affect the non-technical leadership skills of the operator in the clinical environment.

Eye tracking in medical research
One interest in measuring performance is investigating the link between visual attention (eye gaze) and clinical performance. This domain investigates whether an operator's eye gaze behaviour is correlated with their level of competence during a clinical procedure [10][11][12][13][14]. The 'mind-eye hypothesis' [15] states that visual attention can indicate cognitive activity [16][17][18]. Put differently, where someone looks can be indicative of their cognitive experience and thus their level of expertise, situational awareness, uncertainty and perhaps the likelihood that their future actions could cause harm to a patient. A recent study with surgical tasks [11] was shown to discriminate between novices and experts using eye tracking metrics.

Attentional capacity
Clinical decision-making is comprised of many steps including perception, attention, information processing, information storage (including organisation) and then knowledge retrieval from long-term memory at the appropriate time [19]. One aspect of cognition that has received no consideration in related literature is 'attention', yet this is of paramount importance to the interventional cardiologist who is learning a new set of skills. Attention refers to the ability to cognitively focus on an object or activity. It is well known that humans have a limited attentional capacity [20]. The human mind can only attend to a finite amount of information at any given time. When a novice clinical operator is acquiring new skills, they use almost all of their attentional resources to monitor what their hands are doing in addition to the spatial judgments and clinical decision-making. This results in limited 'additional' attentional capacity for the novice [21] and hence why this study involves the aforementioned visual stimulus task.
This study aims to (1) use wearable technology to determine metrics that could be used to auto-assess operator and procedural performance and (2) to determine whether a visual stimulus task can be used to measure attentional capacity and whether performance of this task is associative to operator errors. Both objectives were carried out using a state-of-the-art, high-fidelity, full-physics VR simulator which provided the means for recording the procedural performance of interventional cardiologists. This work could lead to 'smart operating rooms' that can provide live metrics on individual and team performances, providing critical automated analytical feedback.
Ethical approval for this study was granted across the island of Ireland: (1)

Methods
This study involved investigating the use of (1) eye tracking, (2) psychophysiological monitoring and (3) attentional capacity in surgical simulation-based assessment (specifically in coronary angiography). We recorded data from two different groups of interventional cardiologists to test the significance of metrics in discriminating between novices and experts. Data collection took place in the ASSERT Centre, University College Cork.

Study components
The study was comprised of a surgical simulator with simulated patient cases, eye tracking glasses and an E4 wristband for monitoring the operator's psychophysiology. For the visual stimulus task, an additional LCD display screen was used to display the playing cards.

Simulated coronary angiography
A Mentice VIST-Lab 1 and VIST G5 software (developed by Mentice, Sweden) provided the simulated coronary angiography cases (model details: VIST G5 + VIST-C LD, Coronary PRO v2.3.3, Coronary Angiography v1.3.3 and Coronary Educator v1.1.2). Two cases were assessed by a teaching-and consultant-level interventional cardiologist. One case allowed the participant to practise with the system, and the second case was the primary data collection session. Each participant was allocated 'up to 30 min' to practise using the first case allowing the participant to gain a level of familiarity with the simulator. The investigator provided a demonstration on how to use the simulator. Participants were tasked with taking nine views controlling the C-arm: Wearable technologies SMI 2 eye tracking glasses were used to measure visual attention during procedures. The glasses allow the participant to move freely while performing the procedure; while capturing temporal and spatial metrics. Empatica's E4 3 wristband provided real-time measurements of the participant's heart rate, inter-beat intervals (or heart rate variability), EDA (4 Hz), skin temperature (4 Hz) and an accelerometer (32 Hz).

Visual stimulus card task to measure attentional capacity
To measure attentional capacity by proxy, each participant was given an additional visual stimulus to monitor (playing cards) and tasked to verbally respond with the word 'heart' when a given playing card (queen of hearts) appeared on the LCD screen. It was made clear that the priority should be performing the procedure but to undertake this additional task if they could. Two variations of the stimulus were provided, one for each of the two attempts. The first acted as a baseline measurement with less additional attention required, and the second performance demanded greater attentional capacity. We increased the number of cards the participant could examine per minute between the first and second performance. This aspect of the study is based on the works from Weaver [22] and Smith [23]. In Smith's experiment, a playing card provides 5.7 bits/items of information. Using this measurement, the difference for information output between the stimulus tasks presented during the first and second procedures can be quantified. However, the exposure duration of the playing card is also important and the 2 s exposure duration was determined to be appropriate for this study.
Participants are asked to examine the cards and detect a specific card that they were instructed to verbally acknowledge. Both variations (see 'Appendix A' for further detail) have the same design: continual blocks of 20 s with one card that they are instructed to verbally acknowledge. Within these 20 s blocks, ten different cards would appear for 2 s each. Using a random number generator, the random position (within the 20 s block) of the specific card would be continually changed according to an integer 1-10 (referencing its position in the block). This approach semi-randomised the appearance of the playing card while guaranteeing that the participant would have three cards to acknowledge every 60 s. The first performance attempt only provided three playing cards (5.7 bits/item) exposed for 2 s each and therefore an information output of 17.1 bits per 60 s. In contrast, the second performance stimulus card involved a continuous sequence of cards and had information output of 171 bits every 60 s.

Protocol
The protocol is comprised of four stages: (1) demonstration of the VIST-Lab simulator, (2) setting up the wearable technology, (3) participant attempts the first task and (4) participant attempts the second task. Details are as follows: Explanation and demonstration of the VIST-Lab simulator • Participants were informed that a 0.035 guide wire and 5F catheter with a contrast syringe were already connected for use. • C-Arm controls to facilitate different views were demonstrated. They were asked to record nine views. • They were shown how to start the case and select instruments. • They were provided with a practice case and given up to 30 min, allowing for familiarity with the simulator.

Assistance with wearable technology
• Before the main procedure, it was necessary to calibrate the eye tracking glasses and begin recording data for both wearable devices.
• Wristband -Once comfortably fitted, wristband was switched on, and using an iOS application, the recording session was initialised via Bluetooth.
• Eye tracking glasses -Once comfortable, the glasses were connected via USB to the portable recording device. -Three-point calibration was completed.

Procedural performance
The following performance metrics were exported from the VIST simulator after each session: • Performance duration (minutes) • Total errors -Type 1: vessel wall scraping -Type 2: moving without wire 1 3 -Type 3: too deep in ostium -Type 4: wire in small branch • Wire and catheter use (including counts for each time selected and subsequently detected entering the simulator)

Stimulus card task
Using laboratory cameras and eye tracking footage, the cards that were correctly acknowledged by each participant in each performance were counted against all stimulus cards that appeared. A percentage of correctly acknowledged cards were used as an assessment metric.

Eye tracking metrics
Four AOIs were defined as the instruments selection screen, the stimulus screen displaying the cards, the X-ray and the vital signs (see Fig. 1). Eye gaze metrics are derived from fixations and saccades. A fixation is when the participant is fixating on single location using their fovea vision, and a saccade can be a vector between two fixations or rapid movements between fixations [24]. The following eye tracking metrics were calculated which have been used in similar studies [25][26][27][28]: • AOI specific metrics: dwell %, fixation %, first fixation duration (ms).
Fixation transitions count the direct switching of fixations from one AOI to another. Additionally, the counts for transitions between AOIs were totalled into a new metric called total transitions. Another metric was developed using total transitions against procedure duration, i.e. fixation transition frequency (transitions between AOIs per second).

Wristband measurements
Measurements recorded from the E4 wristband include heart rate (bpm), inter-beat interval (SD of inter-beat intervals taken as heart rate variability), EDA (micro-Siemens or µS) and skin temperature (°C) and triaxial accelerometry (X-, Y-, Z-axis values at 32 Hz). From the latter, we computed the acceleration magnitude (ACC) using Euclidian distance.

Statistical methods
The R programming language was used for the data analytics. Summary statistics for groups are presented as mean and standard deviation (mean ± SD). Different significance tests were chosen to perform depending on (1)  either Mann-Whitney U or Welch tests as no equal variances were found. Either the Pearson product moment coefficient (r) or the Spearman rank-order correlation coefficient (ρ) was used for correlation analysis depending on the normality of the variables. The Shapiro test was used for normality testing in this instance (null hypothesis is that data are normally distributed). Also, Bonferroni-corrected alpha values are presented for transparency. Table 1 describes the demographics of the novices and experts in this study.

Results
Novices had a mean experience in years of 2.8 ± 1.8 versus 19.9 ± 5.9 for experts (p < 0.001). Novices had participated in a mean 113 ± 91 coronary angiogram cases in past 12 months versus 464 ± 225 for experts (p < 0.01). Experts had more experience in simulation-based training (86% vs 43%). Almost all participants had never used the VIST-Lab simulator (0% vs 14%, 7% in total). Participants were also asked to declare whether they were left-or right-handed (1L and 6R vs 2L and 5R). The only females in the study (n = 3) were novices. Experts were more likely to signal 'early' (before 30 min was complete) that they were ready to begin the next case. Experts had a mean practice time of 19 min compared with 28 min for novices (p = 0.04). Table 2 presents the key metrics for procedural performance for both attempts. Table 3 presents changes in errors and the stimulus task card acknowledgement %, either improvement or deterioration, between the first and final attempt. It is notable that experts increase their total errors compared with novices, along with a poorer card acknowledgement rate. Figure 2 shows the correlation between the less demanding stimulus card task (first procedure attempt) and total errors. There is a moderate but statistically insignificant positive correlation between card acknowledgement rate and total errors (ρ = 0.42, p = 0.13). Similar correlation values exist between novices and experts (novices: r = 0.38 [p = 0.39] vs experts: ρ = 0.38 [p = 0.40]). Figure 3 shows the correlation coefficients between card acknowledgement rates and total errors for the final procedure attempt (involving the more demanding card stimulus task). When including   Table 4 presents the group comparison of AOI specific eye tracking metrics: instruments, vital signs, X-ray and stimulus. Experts had a significantly larger dwell % (11.1 ± 4.3 vs 4.7 ± 1.6, p = 0.006) and fixation % (8.5 ± 3.5 vs 3.5 ± 1.4, p = 0.007) on the instruments screen. In addition, experts had a significantly higher totalled dwell % (63 ± 10% vs 42 ± 20%, p = 0.03) and fixation % (50.2 ± 9.6 vs 33.5 ± 17, p = 0.04). Table 5 presents the general eye gaze metrics with none being statistically significantly different between the groups. Table 6 shows the fixation transitions between all AOIs. None of the transition count differences are significantly different between groups. Figure 4 shows the group difference for transition frequency: transitions made between any of the AOIs per second. Table 7 presents the statistical analysis of the signals EDA, HRV (i.e. inter-beat intervals), skin temperature and accelerometry (ACC) that are recorded from the wearable E4 wristband. The table provides summary statistics (i.e. mean, min, max and SD) for each signal. No strong statistical correlations were found between the E4 wristband signals and the groups. As shown in Fig. 5, the only insightful significant difference of note is that experts had a larger SD of EDA (2.52 ± 2.38 vs 0.89 ± 0.74 µS, p = 0.04). However, if applying Bonferroni-corrected alpha values, then these are not significant findings (Bonferroni-corrected alpha values for 16 tests = 0.003 and for four tests = 0.013)

Discussion
This is the first study to use eye tracking and psychophysiological monitoring in this setting. This is also the first study to use a visual stimulus task as a proxy to measure attentional capacity during surgical procedures. This study resulted in several metrics that could be used in a model to automatically discriminate between novices and experts, perhaps leading to assess proficiency in the real setting. Experts had greater dwell time on the X-ray which perhaps indicates their superiority in spatial awareness and coordination; however,   this was not statistically significant (p = 0.13). Experts also had greater transitions between AOIs which could indicate their intention for more frequent cross-referencing (although this did not quite reach statistical significance, p = 0.06). The wristband produced only a small number of metrics that are of interest. Regarding accelerometer-recorded movement, the hands/fingers would be of higher value in future analysis and therefore would necessitate a different type of wearable   monitoring tool. Most interestingly, we discovered that card acknowledgement rate during the stimulus task is predictive of the number of handling errors in a procedure (for experts only). It is also interesting to observe the lack of visual attention dedicated to the patient vital signs from both the novice and expert groups (1.6% and 1.7%, respectively). There is potentially significant value for quantified behaviour during high stakes operations within various environments, from the operating room to the cockpit of a commercial aircraft. Despite the difficult and time-consuming methods required to capture these data, its value when used with machine learning techniques could result in smarter, more responsive environments with intelligent feedback provided to the operators.
Experts complete their first attempt faster than novices; however, in the final attempt, there is little difference. This could be indicative of the confounded effect that the added stimulus task had on the procedural performance-whatever effect it has had on the novice, it could be much more pronounced with experts. Experts have less total errors in their first attempt in comparison with novices, and performance two sees this flipped with the expert committing more errors than the novice. This is a surprising result; however, this result is not statistically significant (p = 0.20). One interesting difference is that in the first performance, the expert had 0 ± 0 scraping vessel wall errors reported from the simulator, while in comparison the novice had 1 ± 3. However, when it came to the final attempt, including a much more demanding stimulus task, this inverted despite both groups performing the same case for a second time (in theory, you would expect a better performance), with experts reporting 2 ± 2 compared to novices reporting the same 1 ± 2.
It can be speculated that experts are affected more by the second variation of the stimulus task compared to the novices. Other than this, it can be suggested that either the sample size is too small or that the experts have possibly lost concentration or have demonstrated a waning interest in the challenge by the second attempt.
The stimulus card task produced mixed results when looking at both performances. There were no significant differences in how the groups performed on the additional task while carrying out the procedure. In the second performance, novices improved their correct card acknowledgement rate while the expert % deteriorated slightly changing to a more demanding stimulus task. It could be speculated that the distraction of the cards had a greater impact on experts, perhaps since experts can quickly become 'in the flow' given they are more influenced by automatic muscle memory and 'autopilot' abilities. Likewise, perhaps the novices are less 'set' with the process and additionally, expecting a challenge, therefore able to adapt better. Hence, while experts should have more attentional capacity to undertake an additional task, they are influenced by routine automatic muscle memory which makes it difficult to use an additional task as a proxy for measuring attentional capacity.
The largest effect sizes found when looking for key correlations are that for the final performance two, the expert card acknowledgement % is strongly negatively correlated with the total errors. This relationship for the final attempt is also seen (though not as strongly) with all participants once we have removed one outlier. With the less demanding and less frequent stimulus provided to the participants, card acknowledgement % seems to be weakly positively correlated with total errors. This is consistent in both groups with almost no difference in effect coefficient and p value.
This study has suggested that eye tracking could have a role to play in the automated assessment of interventional cardiologist trainees with this type of high-fidelity surgical simulator. The eye tracking metrics have been able to quantify how the expert significantly spends much more visual attention (both with dwell % and its encompassing fixation %) at the display screens compared to the novice. This might be intuitive to those familiar with surgery and may predict it as an expected consequence of superior spatial awareness analogous to an experienced driver (where the expert makes many actions automatically without delay and the need to visually attend to the objects their hands interact with). On average, the expert spends much more of their visual attention looking at the instruments display screen (selecting and changing instruments). We also found that on average the expert will have a higher frequency of fixation transitions between the display screens compared with the novice.
Finally, the attempt to analyse psychophysiological measurements acquired using the E4 wristband has provided little insight. One outcome is that the expert will record a significantly higher SD of EDA for their measurements over time in comparison with the novice. What greater SD in quantified arousal from skin conductance means in a clinical performance setting is up for debate.

Limitations
Despite the high-fidelity of the laboratory and virtual reality simulator, these data were not recorded in a real clinical environment with real patients. Moreover, we did not fully simulate environmental features such as noise and ongoing staff interactions. We acknowledge that it may never be possible to simulate a procedure that is in par with the real event, since the psychological fidelity is very difficult to emulate. This is a limitation of this study since we are assuming that metrics acquired in simulation settings are transferable to real-life settings. The low sample numbers while understood (feasibility of gathering data from numerous extremely busy operators to partake in a study during a three-month period) hinder what can be inferred from the results. While the sample size is small, each correlation coefficient is accompanied by a statistical test and p value that considers the sample size (degrees of freedom) in its calculation. A limitation to this study includes the fact that one of the procedures included a 'distraction' of undertaking a secondary unrelated task, i.e. card acknowledgements. In addition, we acknowledge a lack of a proper control group to compare with the procedure that included this additional distracting card acknowledgement task. Also, we must acknowledge that there was no baseline psychophysiological measurement of the participant before the session. For example, context for a participant that was already stressed is not considered or that some participants may have been eager to leave within a certain time, having a rushed effect on their final procedure attempt. Another limitation is that we removed an outlier for a correlation computation because this outlier was 5.98 SDs (or standard deviations/units) from the mean distance (or mean residual) from the regression line; however, often outliers in small samples can be meaningful and removing them can dramatically change results. We acknowledge the limitation of multiple hypothesis tests which increase the likelihood of type 1 errors (false positives) and false discovery rates; however, we have included Bonferroni-corrected alpha values where appropriate. We also acknowledge that participants with prior exposure to the simulation technology can be a confounder in studies that measure performance on a simulator where some subjects have had prior experience of the technology and others have not, which begs the question whether some operator performances are partly influenced by their proficiency with the simulator technology. However, only 7% of subjects had prior experience with the simulator.

Future work
Some metrics almost statistically discriminate between the two groups but perhaps lack significance due to the low sample numbers. As a result, we have provided guidance in 'Appendix B' for future recruitment using power calculations based on the effect sizes in this study. Future studies attempting a similar experimental set-up should consider the length of time provided to participants for practice and familiarising with the surgical simulator. This would reduce the confounder of computer literacy. For further testing of the stimulus card task, other metrics such as mean saccade latency (ms) specific to the stimulus card (from the moment it appears on screen) to the moment it is acknowledged may follow in future work-this would be a more precise measurement of attentional capacity in comparison with the rudimentary count of correctly acknowledging the card. Furthermore, while we only used the procedure errors as detected by the MENTICE VIST simulator, other procedural errors could be classified in future studies, such as those described by Mazomenos et al. [29]. Other future work could determine the extent of which brief prior exposure or proficiency of using the simulator technology can affect the operator's procedural performance on the simulator?" Put differently, can knowledge of the simulator be a confounder in studies such as the one described in this paper.
Looking beyond the simulation laboratory setting, capturing psychophysiological metrics and measurements in a real clinical environment, while still running a simulation would add to the validity of the data captured. In the case of this procedure, using a simulated operator room with full immersion: leads, scrubs and a theatre team to support the participant. This could drive larger differences between genuine novices and experts. Beyond that, it would seem that this work is linked with a greater goal of creating what could be called 'smart theatres'.

Conclusions
This work contributes to the future of sensor-based smart theatres and the 'quantified physician' for assessing trainees and operators and to perhaps provide ongoing automated analytical feedback to individuals and teams to drive performance. The study captured a unique dataset with psychophysiological metrics along with a novel measurement of attentional capacity recorded during an important highly skilled clinical procedure. Only a few significant differences between groups have been found when using these metrics: most notably the dwell % and fixation % spent on the display screens. However, the point of this exploratory study is to highlight a number of novel variables that warrant further investigation for assessing proficiency, namely: dwell time on screens, fixation transition frequency between screens, SD of EDA signal and card acknowledgement rates (when using an additional task to measure attentional capacity).
We do acknowledge that this paper mainly focuses on 'construct validity' since we wanted to determine whether the metrics can distinguish between novices and experts before providing a more granular analysis which would require a greater number of subjects. Overall, this study provides incentive for further work in the area, with larger sample sizes, a larger range of procedures and using higher fidelity environments. participant travel and accommodation for the laboratory investigator while the study took place over 7-8 weeks. Tracy Ahern (UCC) assisted with ethics submission for the Clinical Research Ethics Committee of Cork Teaching Hospitals and aided the recruitment process by coordinating with Dr. Peter Kearney. David Power (UCC) volunteered as the initial test participant for the protocol and provided technical assistance with the Mentice VIST-Lab simulator and setting up lab ceiling cameras. Kevin McGuire (UCC) provided technical assistance with the Mentice VIST-Lab simulator and provided laboratory ceiling camera recordings.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest. Informed consent All subjects received informed consent as approved by ethics committee.

Ethical approval
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Other metrics that were close to significance such as total dwell duration and total fixation duration on the instruments screen are strongly correlated with dwell % and fixation % for that AOI and should not be pursued over the %-based metrics. Fixation count on the instruments screen is another but is also strongly correlated with fixation % for the same AOI.

Fixation transitions between AOIs
None of these metrics were statistically significant between the groups, despite the experts consistently having higher counts. In a future study:

Wristband measurements
The only measurement approaching significance and potentially worth further investigating is SD of skin temperature. It is still debatable what insight this is providing but nevertheless, experts measure a higher skin temperature SD (1.4 °C ± 1.0 vs 0.6 °C ± 0.3, p = 0.07, power = 0.56) and a future study would require 17 subjects per group.

All the above
Using the individual metric recommendations above, if this study was to be repeated without alteration, the minimum number of participants would be 36 subjects per group for a high likelihood of detecting these significant differences (if they truly exist) novices and experts performing simulated coronary angiography.