Self-supervised representation learning for surgical activity recognition

Purpose: Virtual reality-based simulators have the potential to become an essential part of surgical education. To make full use of this potential, they must be able to automatically recognize activities performed by users and assess those. Since annotations of trajectories by human experts are expensive, there is a need for methods that can learn to recognize surgical activities in a data-efficient way. Methods: We use self-supervised training of deep encoder–decoder architectures to learn representations of surgical trajectories from video data. These representations allow for semi-automatic extraction of features that capture information about semantically important events in the trajectories. Such features are processed as inputs of an unsupervised surgical activity recognition pipeline. Results: Our experiments document that the performance of hidden semi-Markov models used for recognizing activities in a simulated myomectomy scenario benefits from using features extracted from representations learned while training a deep encoder–decoder network on the task of predicting the remaining surgery progress. Conclusion: Our work is an important first step in the direction of making efficient use of features obtained from deep representation learning for surgical activity recognition in settings where only a small fraction of the existing data is annotated by human domain experts and where those annotations are potentially incomplete. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-021-02493-z.

Description of the surgical procedure Figure S1 shows the simplified version of the hierarchical task decomposition model that we refer to in Section 3.2 of the main paper. It describes the basic procedure of a hysteroscopic myomectomy in a flow chart. A description of the individual activities is given in Table S1.

Activity
Description diagnosis The summary of multiple diagnostic activities, which includes the inspection of the entire uterine cavity and the inspection of the myoma. position hysteroscope The activity of positioning the hysteroscope, which is done to prepare for a cutting, coagulation or handle chips activity. cutting The activity of cutting off part of the tissue. coagulation The activity of staunching a bleeding.

clear view
The activity of flushing the system to remove blood that opacifies the view.

handle chips
The activity of collecting and extracting the pieces of tissue ("chips") that have been cut off.
Tab. S1. Description of the activities performed in a hysteroscopic myomectomy * contributed equally to this work Fig. S1. Simplified hierarchical task decomposition model (HTDM) of a hysteroscopic myomectomy that defines the activities performed during the procedure. The diagnosis task jointly with the steps of the operative part of the procedure form the set of activities that we considered in the context of this work.
Hyperparameter optimization for the self-supervised representation learning.
Here we provide additional details regarding the hyperparameter optimization for the spatio-temporal models described in Section 4 of the main paper. We performed ten-fold stratified cross-validation approach for the hyperparameter optimization. To that end, we divided the surgical trajectories into three categories: Similar to [5], the trajectories whose duration was less than the lower quartile value of the empirical duration distribution of all trajectories in the data set were classified as short trajectories; those lasting longer than the upper quartile value were classified as long trajectories, and the remaining ones as medium trajectories. The data set was split such that the relative abundance of sequences of these three classes were approximately the same in each fold. Thereby, we ensured that the different folds were comparable despite the large intersample variance in the data set. The length of a sequence served as a simple proxy for the comparability of different surgical trajectories. Finally, at each iteration the training portion of the data (i.e., the data without the fold held out for cross-validation) was additionally split into the set of sequences to train the model (80%) and a validation set (20%). The validation set was used to monitor the performance of the model within each iteration of the cross-validation and to e.g. early stop the training if the validation score has not improved for 20 epochs. We trained every model until convergence of the validation loss, which took up to 300 epochs.
The hyperparameter settings that we analyzed to identify the deep architecture that performed best on the VRSHM data set are summarized in Table S2. The performances of the five best encoderdecoder models among those we analyzed, as well as the performances of two baseline models; a Alexnet [4] and a Resnet18 [2] trained on single frames from the video sequences, are shown in Table  S3.

Hyperparameter
Description Searched Domain Tab. S3. Performance overview of five of the best performing model configurations analyzed during the hyperparameter search using the discussed ten fold stratified cross-validation approach. All shown encoder-decoder models were trained using a l 2 weight regularization of 10 −5 . The first two models are the CNN models that served as baselines. A '*' indicates the application of an attention layer to the output of the CNN encoder as part of the architecture. The GRU/LSTM column provides information regarding the architecture of the respective GRU/LSTM layers of the encoder-decoder models by describing the number of units in the different layers.

Pruned Exact Linear Time (PELT) algorithm
To automatically identify important change points in the multivariate time series given by the learned spatio-temporal representations as described in Section 5 of the main paper we applied the PELT algorithm. We here provide a formal description of the respective algorithm.
Algorithm 1: PELT algorithm by [3] input : A data sequence A constant penalty term β output: The change points of the data sequence recorded in cp(n)

Hidden semi-Markov models
To complement the short description of the hidden semi-Markov models in Section 2.3 of the main paper, we provide a more formal definition of such models in the following. Formally, an HSMM is a tuple θ = (S, O, Π, P, B, D), where S is the state set, O is the observation space, Π = {π j | j ∈ S} is the initial state distribution, P = {p i,j | (i, j) ∈ S × S} is the transition model, B = {b j (o) | j ∈ S, o ∈ O} is the emission model and D = {d j (u) | j ∈ S, u ∈ N} is the duration model. The latter allows to explicitly model the distribution over the number of time steps the system stays in any given state, which is what distinguishes HSMMs from the simpler hidden Markov models (HMMs). In HMMs, these distributions are constrained to be geometric and are not explicitly modeled.
Inference in HSMMs can be performed in a similar way as in HMMs. First, one iteratively computes maximum-likelihood estimates for the parameters of the HSMM using the Baum-Welch algorithm. Second, one computes the most probable state sequence given the observations, i.e., the maximum-aposteriori (MAP) using the Viterbi algorithm [1]. In our setting, after performing inference on a set of surgical trajectories, the MAP sequence for a specific surgical trial is the sequence of activities A that best describes the observed data. Fig. S2. Visualization of a HSMM and its data generating process: A dynamic system described by a HSMM remains in a state s for a duration d that is sampled from its duration model and then transitions to another state s ′ sampled from the transition model. The state sequence is not observed. Instead one only observes noisy observations o that depend on the state s of the system and are sampled according to the emission model. Adapted from [6].

Description of the observables
A variety of different categorical observables was extracted from the continuous and categorical sensor data. The super set O defined by all observables and their respective realization spaces (domains) as given in the Table S4 defined the observation space. The observables used for the experiments in the paper and referred to in Section 6 of the main paper are described in Table S4. {True, False} seen last pedal use The indicator, if no pedal will be used in the remainder of the procedure.

{True, False}
Tab. S4. Overview of the observables derived from the sensor measurements and the self-supervised learning approach.