This is the first study to use eye tracking and psychophysiological monitoring in this setting. This is also the first study to use a visual stimulus task as a proxy to measure attentional capacity during surgical procedures. This study resulted in several metrics that could be used in a model to automatically discriminate between novices and experts, perhaps leading to assess proficiency in the real setting. Experts had greater dwell time on the X-ray which perhaps indicates their superiority in spatial awareness and coordination; however, this was not statistically significant (p = 0.13). Experts also had greater transitions between AOIs which could indicate their intention for more frequent cross-referencing (although this did not quite reach statistical significance, p = 0.06). The wristband produced only a small number of metrics that are of interest. Regarding accelerometer-recorded movement, the hands/fingers would be of higher value in future analysis and therefore would necessitate a different type of wearable monitoring tool. Most interestingly, we discovered that card acknowledgement rate during the stimulus task is predictive of the number of handling errors in a procedure (for experts only). It is also interesting to observe the lack of visual attention dedicated to the patient vital signs from both the novice and expert groups (1.6% and 1.7%, respectively).
There is potentially significant value for quantified behaviour during high stakes operations within various environments, from the operating room to the cockpit of a commercial aircraft. Despite the difficult and time-consuming methods required to capture these data, its value when used with machine learning techniques could result in smarter, more responsive environments with intelligent feedback provided to the operators.
Experts complete their first attempt faster than novices; however, in the final attempt, there is little difference. This could be indicative of the confounded effect that the added stimulus task had on the procedural performance—whatever effect it has had on the novice, it could be much more pronounced with experts. Experts have less total errors in their first attempt in comparison with novices, and performance two sees this flipped with the expert committing more errors than the novice. This is a surprising result; however, this result is not statistically significant (p = 0.20). One interesting difference is that in the first performance, the expert had 0 ± 0 scraping vessel wall errors reported from the simulator, while in comparison the novice had 1 ± 3. However, when it came to the final attempt, including a much more demanding stimulus task, this inverted despite both groups performing the same case for a second time (in theory, you would expect a better performance), with experts reporting 2 ± 2 compared to novices reporting the same 1 ± 2.
It can be speculated that experts are affected more by the second variation of the stimulus task compared to the novices. Other than this, it can be suggested that either the sample size is too small or that the experts have possibly lost concentration or have demonstrated a waning interest in the challenge by the second attempt.
The stimulus card task produced mixed results when looking at both performances. There were no significant differences in how the groups performed on the additional task while carrying out the procedure. In the second performance, novices improved their correct card acknowledgement rate while the expert % deteriorated slightly changing to a more demanding stimulus task. It could be speculated that the distraction of the cards had a greater impact on experts, perhaps since experts can quickly become ‘in the flow’ given they are more influenced by automatic muscle memory and ‘autopilot’ abilities. Likewise, perhaps the novices are less ‘set’ with the process and additionally, expecting a challenge, therefore able to adapt better. Hence, while experts should have more attentional capacity to undertake an additional task, they are influenced by routine automatic muscle memory which makes it difficult to use an additional task as a proxy for measuring attentional capacity.
The largest effect sizes found when looking for key correlations are that for the final performance two, the expert card acknowledgement % is strongly negatively correlated with the total errors. This relationship for the final attempt is also seen (though not as strongly) with all participants once we have removed one outlier. With the less demanding and less frequent stimulus provided to the participants, card acknowledgement % seems to be weakly positively correlated with total errors. This is consistent in both groups with almost no difference in effect coefficient and p value.
This study has suggested that eye tracking could have a role to play in the automated assessment of interventional cardiologist trainees with this type of high-fidelity surgical simulator. The eye tracking metrics have been able to quantify how the expert significantly spends much more visual attention (both with dwell % and its encompassing fixation %) at the display screens compared to the novice. This might be intuitive to those familiar with surgery and may predict it as an expected consequence of superior spatial awareness analogous to an experienced driver (where the expert makes many actions automatically without delay and the need to visually attend to the objects their hands interact with). On average, the expert spends much more of their visual attention looking at the instruments display screen (selecting and changing instruments). We also found that on average the expert will have a higher frequency of fixation transitions between the display screens compared with the novice.
Finally, the attempt to analyse psychophysiological measurements acquired using the E4 wristband has provided little insight. One outcome is that the expert will record a significantly higher SD of EDA for their measurements over time in comparison with the novice. What greater SD in quantified arousal from skin conductance means in a clinical performance setting is up for debate.
Despite the high-fidelity of the laboratory and virtual reality simulator, these data were not recorded in a real clinical environment with real patients. Moreover, we did not fully simulate environmental features such as noise and ongoing staff interactions. We acknowledge that it may never be possible to simulate a procedure that is in par with the real event, since the psychological fidelity is very difficult to emulate. This is a limitation of this study since we are assuming that metrics acquired in simulation settings are transferable to real-life settings. The low sample numbers while understood (feasibility of gathering data from numerous extremely busy operators to partake in a study during a three-month period) hinder what can be inferred from the results. While the sample size is small, each correlation coefficient is accompanied by a statistical test and p value that considers the sample size (degrees of freedom) in its calculation. A limitation to this study includes the fact that one of the procedures included a ‘distraction’ of undertaking a secondary unrelated task, i.e. card acknowledgements. In addition, we acknowledge a lack of a proper control group to compare with the procedure that included this additional distracting card acknowledgement task. Also, we must acknowledge that there was no baseline psychophysiological measurement of the participant before the session. For example, context for a participant that was already stressed is not considered or that some participants may have been eager to leave within a certain time, having a rushed effect on their final procedure attempt. Another limitation is that we removed an outlier for a correlation computation because this outlier was 5.98 SDs (or standard deviations/units) from the mean distance (or mean residual) from the regression line; however, often outliers in small samples can be meaningful and removing them can dramatically change results. We acknowledge the limitation of multiple hypothesis tests which increase the likelihood of type 1 errors (false positives) and false discovery rates; however, we have included Bonferroni-corrected alpha values where appropriate. We also acknowledge that participants with prior exposure to the simulation technology can be a confounder in studies that measure performance on a simulator where some subjects have had prior experience of the technology and others have not, which begs the question whether some operator performances are partly influenced by their proficiency with the simulator technology. However, only 7% of subjects had prior experience with the simulator.
Some metrics almost statistically discriminate between the two groups but perhaps lack significance due to the low sample numbers. As a result, we have provided guidance in ‘Appendix B’ for future recruitment using power calculations based on the effect sizes in this study. Future studies attempting a similar experimental set-up should consider the length of time provided to participants for practice and familiarising with the surgical simulator. This would reduce the confounder of computer literacy. For further testing of the stimulus card task, other metrics such as mean saccade latency (ms) specific to the stimulus card (from the moment it appears on screen) to the moment it is acknowledged may follow in future work—this would be a more precise measurement of attentional capacity in comparison with the rudimentary count of correctly acknowledging the card. Furthermore, while we only used the procedure errors as detected by the MENTICE VIST simulator, other procedural errors could be classified in future studies, such as those described by Mazomenos et al. . Other future work could determine the extent of which brief prior exposure or proficiency of using the simulator technology can affect the operator’s procedural performance on the simulator?” Put differently, can knowledge of the simulator be a confounder in studies such as the one described in this paper.
Looking beyond the simulation laboratory setting, capturing psychophysiological metrics and measurements in a real clinical environment, while still running a simulation would add to the validity of the data captured. In the case of this procedure, using a simulated operator room with full immersion: leads, scrubs and a theatre team to support the participant. This could drive larger differences between genuine novices and experts. Beyond that, it would seem that this work is linked with a greater goal of creating what could be called ‘smart theatres’.