Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows

Purpose Surgical workflow recognition is a challenging task that requires understanding multiple aspects of surgery, such as gestures, phases, and steps. However, most existing methods focus on single-task or single-modal models and rely on costly annotations for training. To address these limitations, we propose a novel semi-supervised learning approach that leverages multimodal data and self-supervision to create meaningful representations for various surgical tasks. Methods Our representation learning approach conducts two processes. In the first stage, time contrastive learning is used to learn spatiotemporal visual features from video data, without any labels. In the second stage, multimodal VAE fuses the visual features with kinematic data to obtain a shared representation, which is fed into recurrent neural networks for online recognition. Results Our method is evaluated on two datasets: JIGSAWS and MISAW. We confirmed that it achieved comparable or better performance in multi-granularity workflow recognition compared to fully supervised models specialized for each task. On the JIGSAWS Suturing dataset, we achieve a gesture recognition accuracy of 83.3%. In addition, our model is more efficient in annotation usage, as it can maintain high performance with only half of the labels. On the MISAW dataset, we achieve 84.0% AD-Accuracy in phase recognition and 56.8% AD-Accuracy in step recognition. Conclusion Our multimodal representation exhibits versatility across various surgical tasks and enhances annotation efficiency. This work has significant implications for real-time decision-making systems within the operating room. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-024-03101-6.

Fig. A1 Example of acceptable delay and transitional delay in the evaluation on MISAW dataset [1] manipulators (PSMs) by the right and left master tool manipulators (MTMs) through the stereo endoscope.Demonstrations were recorded with synchronized kinematic and video data at 30 Hz. Kinematic data consists of samples of dimension 76, encompassing positions, linear velocity, rotation matrix, rotational velocity, and gripper angle velocity for left and right MTMs and two PSMs.Video data are available at 640 × 480 pixels from both left and right cameras.For validation, we utilize five demonstrations for SU, four for KT, and three for NP.
MISAW challenge was a one-time event for MICCAI2020, and its dataset contains 27 demonstrations of micro-surgical anastomosis on artificial blood vessels performed by three surgeons and three engineering students.We split demonstrations into 14 for training, 3 for validation, and 10 for testing.The dataset provides stereo-microscope frames (920 × 540).

A.2 Skill recognition
Our method utilizes unsupervised MVAE representations, the same ones used for workflow recognition, to recognize skill labels by LSTM.Skill recognition labels are denoted as S = {s d } D d=1 , consisting of skill labels s d for each demonstration.LSTM classifier is trained on a cross-entropy loss as a discriminative model for offline skill recognition, represented as p(s d |z 1:T d ).
To validate the performance, we conducted three skill classification using LOSO cross-validation on the JIGSAWS dataset.For LSTM training, we downsampled the representations to 1 FPS and did not apply dropout, unlike workflow recognition.

A.3 Metrics
In gesture recognition on JIGSAWS, we use the following metrics outlined in [2], referencing the code in [3].We calculate these metrics for each demonstration and compute mean and standard deviation over all demonstrations: Accuracy for gesture recognition: This metric calculates the ratio of correctly identified gestures (denoted as N c ) to the total number of frames (N ) in a demonstration.The accuracy is presented as a percentage: Edit Score: The edit score is calculated as the Levanshtein distance between true (τ ) and predicted (γ) sequences and scaled to [0,100] based on the maximum number of segments (G) in τ or γ: For skill recognition, we utilized accuracy: Here, S c is the number of correct skill predictions, and S is the number of demonstrations.
In the case of MISAW, we utilized balanced application-dependent accuracy (AD-Accuracy), which re-estimate accuracy using acceptable delay thresholds for a translational window (Fig. A1) [1].AD-Accuracy is calculated based on balanced accuracy (BAC) for M classes:

A.4 Implementation Details
We implemented our model in PyTorch 2.0.0.This model can be trained on a GPU with a minimum of approximately 20GB of memory.TCN was fed the right camera frames resized to 224 × 224, and data augmentation techniques were applied, including shifts, scales, rotates, brightness, and saturation.We implemented MVAE based on [4] and computed reconstruction losses (rec loss) utilizing PyTorch's distribution framework and tuned scale parameters.We summarized the hyperparameters in Table A1, which were set based on prior experience and domain expertise.In future work, we plan to conduct a comprehensive analysis of hyperparameter and their sensitivity to investigate our model's properties and enhance its performance across multiple tasks.

A.5 Comparison targets in evaluation of workflow recognition
We provide the methods and evaluation setup of comparison targets used in the main manuscript.

A.5.1 JIGSAWS
We compared our model with state-of-the-art models which can be applicable to online inference.When multiple results were reported in their papers, we selected the best outcome under conditions that did not use future data to align the conditions as closely as possible.Our model was evaluated at 5 FPS and 30 FPS on both LOUO and LOSO cross-validations and predicted labels only from current and past data.It is important to acknowledge that performance differences observed in the comparisons may stem from their varying settings, such as the use of future data, frame rates (FPS), pre-processing and post-processing techniques, and different evaluation methods.In practical implementation, recognition systems must be carefully designed to balance recognition frequency, accuracy, and segmental coherence, all of which can be impacted by such factors.The summary of comparison targets is as follows.Please refer to the original papers for more details.
Forward LSTM [5] was trained using positions, velocities, and gripper angles from PSMs at 5 FPS, run online, and evaluated on LOUO cross-validation.
Skip-Chain Conditional Random Field (SC-CRF) [6,7] utilizes data from a distant previous state instead of the neighboring one, enabling it to capture gesture transitions over a longer range than a typical Linear Chain Conditional Random Field.In the model, potential functions rely only on the current and previous state.They predict the sequences using a modified version of the Viterbi algorithm and applying a median filter.We employed the results using all PSMs at 30 FPS on LOUO cross-validation for comparison.3D-CNN [8] utilizes a 3D convolutional neural network to extract spatiotemporal information from the video.We employed the result of their model, which initialized with pre-trained weights on Kinetics [9], did not incorporate future data, and evaluated at 5 FPS on LOUO cross-validation.
Fusion-KV [10] first predicts the workflow labels for each modality data.Visual data are processed by a combination of CNN pre-trained on ImageNet and a Temporal Convolutional Network.Additionally, a forward LSTM and a Temporal Convolutional Network process normalized kinematic data, which includes positions, velocities, gripper angles, and Euler angles from PSMs.These predictions are then combined using a weighted voting method.The evaluation was carried out using LOUO cross-validation.It's worth noting that although this model is applicable to online inference, the results on the JIGSAWS dataset were derived from non-causal convolutional layers, which uses future data.
MRG-NET [11] first creates latent representation from video and kinematic data with Temporal Convolutional Networks and LSTM units.Then, these features from each modality are integrated through a relational graph convolutional network.For kinematic input, they convert the rotation matrix into Eular angles, then normalized all data from PSMs to zero mean and unit variance.Their evaluation on the JIGSAWS dataset, carried out using LOUO cross-validation, did not specify the FPS value and clarify if causal convolutions were employed for their stated "online approach".
MA-TCN [12] processes separately visual and kinematic data by Temporal Convolutional Networks (TCN) and then integrates them while weighting dynamically in time through multimodal attention.They utilized smoothed and normalized positions, velocities, and gripper angles from PSMs.The model was evaluated at 5 FPS on LOUO cross-validation with rectification of the 12 annotation errors.Our model was compared with the model in the casual setting, which only relies on past and current data.
Motion2vec [13] uses contrastive learning with partially labeled data to learn a latent space that reflects manual and generated gesture labels and predicted segment labels by multiple classifiers.They treated a total of 78 videos from two available streams of stereo cameras as a complete dataset and downsample them at 3 frames per second with an average duration of 3 minuted per video.The evaluation was based on the average accuracy calculated over four iterations of the LOSO test set, which were created by random choice of four demonstrations from each surgeon for the training set.We employed the condition with the highest performance except for the bi-directional LSTM, which uses future data.The condition used a combination of labeled triplet and sv-TCN loss for representation learning on a CNN, which initialized ImageNet pre-trained weights and the k-nearest neighbor for a classifier.This combination can operate without future data.
Cross-modal [14] first trains an encoder-decoder network by predicting kinematic data from optical flows and then freezes the weights and truncates the decoder.The representation from the encoder is fed into the XGBoost classifier to recognize gestures.To align the condition, we used results where the training and test data were identical.They conducted gesture classification, a segmental classification that predicts a gesture for each segment based on the assumption that the boundaries of each gesture are known.Gesture recognition performed by all the others is framewise classification and significantly more challenging, as it requires the simultaneous determination of gesture boundaries and labels, i.e., a finer and more dynamic understanding of the scene [6].Regardless, the comparison remains valid as both tasks evaluate the model's performance by recognizing the corresponding gesture for the given data.In addition, they employed balanced leave-one-trial-out (LOTO) crossvalidation, which takes one trial for the test set while balancing the gesture classes.
On the other hand, we employed leave-one-supertrial-out (LOSO) cross-validation, extending LOTO cross-validation to a broader array of demonstrations, taking one trial from each surgeon.They both assess model performance to new demonstrations; while the comparison is valid, LOSO is more difficult due to the need for generalizability across all surgeons.In summary, this comparison, though not a direct equivalence due to the outlined differences, provides valuable insights.Despite facing more challenging conditions, our method demonstrated significant performance improvements, highlighting the effectiveness of our multimodal representation learning.These results underscore the superiority of our approach in complex scenarios.

A.5.2 MISAW
For comparison in phase and step recognition, our main manuscript used the MISAW challenge report models, including six uni-task and multi-task learning models.The performances of multi-task models depend on their ability to capture relationships between multiple tasks, potentially leading to synergistic benefits.However, if they cannot effectively capture the relationships, the tasks' complexity increases.To assess each granularity's performance directly, we compared our model against the following four uni-task models except for two models with only multi-task learning.
UniandesBCV extracted the features of video frames using ResNet and fed it into a SlowFast model pre-trained on Kinetics.Wr0112358 also employed a vision-based model, a DenseNet121 CNN, with data augmentation and regularization.MedAir utilized MRG-Net [11] for workflow recognition based on both video and kinematic data.They correct prediction based on a median filter and a post-processing strategy (PKI).IMPACT also used video and kinematic data.Video data were rescaled and normalized, and kinematic data were normalized to have zero mean and unit variance.In addition, data were downsampled to 5 FPS.A pre-trained VGG19 on ImageNet processed video data and an adapted ResNet network processed kinematic data, and then the output from each modality network was concatenated to predict workflow labels.For more details about these models, you can refer to [1].

B.1 Visualization of a latent representation
We show comprehensive UMAP visualizations of the representations from TCN and MVAE on the JIGSAWS dataset in Fig. B2, B3, B4, B5, B6, B7, B8.Particularly in tasks involving Suturing (SU) and Knot Tying (KT), TCN can solidify each gesture, implying that TCN effectively breaks down the surgical process into gesture-level components by comprehending the complex video sequences.However, we observe a different pattern when we examine the Needle Passing (NP) task.TCN demonstrates proficiency in understanding the sequential properties of NP but does not effectively separate the distribution by individual gestures.This outcome suggests that in NP, there might be a limited connection between the video flow and the gestures, making it challenging to deconstruct the representation into gesture-level components.
Furthermore, we visualized the output of the projection head represented by b in Fig. B5.These visualizations demonstrate that b focuses on only the temporal dimension, which strongly compresses the latent space.A comparison between Fig. B2 and Fig. B5 reveals the impact of the projection head incorporation, which mitigates the intense compression while preserving the sequential property.
MVAE forms clusters of gesture labels, with some areas showing the gathering of temporal neighbors.In addition, it effectively forms distinct clusters that align with different skill levels, as shown in Fig. B6b, B7b, B8b.This highlights MVAE's ability to differentiate skill levels by incorporating kinematic data.Notably, the MVAE visualization separates novice and expert, with intermediate positioned as one of them.The intermediate definition in JIGSAWS represents a transition from novice to expert.This distribution may result from two categories of intermediate: some closer to novice and others closer to expert.This pattern is consistent with distributions observed in previous representation learning approach [14] and reinforces previous findings of lower recognition accuracy among intermediate in skill assessment studies [15].These visualizations demonstrate MVAE's ability to distinguish skill levels and gestures while holding temporal context.
In Fig. B10, B11, we conducted four visualizations of these representations using the MISAW dataset: normalized frame indexes (video sequence), phase labels, step labels, and demonstration indexes.TCN captures some local information, such as phase and step, but struggles to capture the global video sequence.This shows the difficulty of time contrastive learning in tasks with multi-stage complex processes.
Conversely, the introduction of MVAE leads to significant enhancements in the quality of representation.MVAE creates clusters based on the characteristics of each demonstration, as shown in the visualization of the video indexes (Fig. B11).This capability allows MVAE to effectively express the sequential features, phases, and steps on each cluster as shown in Fig. B10d, Fig. B10e, and Fig. B10f.By combining kinematic data, organizing information, and distinguishing features for each demonstration, MVAE facilitates a hierarchical understanding of the workflow.
These findings underscore the synergistic benefits of combining TCN and MVAE and the high generalization performance of MVAE.

B.2 Results of gesture recognition
We present the results for JIGSAWS Suturing, Knot Tying, and Needle Passing in Table B2.We observed a significant performance drop in gesture recognition in NP compared with SU and KT.This drop in performance can be attributed to TCN's challenges in comprehending gestures in NP, as previously discussed.These findings indicate that our model's representations might not always align perfectly with human intent, partly due to our reliance on unlabeled data.This insufficient representation learning could also contribute to the failure to recognize the gestures G9 and G10.
To cope with these issues, future work should explore efficient ways to incorporate annotation information, such as fine-tuning and active learning strategies.As detailed in Section 5.3 of the main manuscript and as observed in the UMAP visualization, our representation learning could be influenced by variability in small datasets, including variations in skill level and individual demonstration characteristics.Insufficient data hinders generalization and promotes overfitting, resulting in difficulty in learning robust representations.Differences in data volume across JIG-SAWS tasks are one of the contributing factors to the lower and less stable performance on the KT and NP datasets in the challenging cross-validations of evaluating generalization performance for an unseen surgeon or unseen demonstrations from all surgeons. 1Evaluated on leave-one-trial-out. 2 We trained LSTM only.

B.3 Results of skill recognition
We assess our representation's capacity for skill recognition using the same representation used for gesture recognition.We compared our results with a fully supervised model specifically trained for skill recognition using 3D-CNN [15] and a model that conducted similar experiments recognizing both gesture and skill from one representation [14], as presented in Table B3.Our model achieved an approximately 10% lower accuracy than the supervised skill recognition model.However, compared with [14], our model exhibited a remarkable performance improvement of nearly 20% for SU and NP.As shown in Fig. B12, most of the prediction errors are associated with intermediate, aligning with the observations from the visualization of latent representation.Skill labels in this study were determined based on participants' hours of experience in robotic surgery.However, these labels did not account for the performance quality in each demonstration.The variability at the instance level within the intermediate could significantly influence the performance.
In addition to multi-granularity recognition, our model demonstrates the ability to discern skill levels, showcasing the further versatility of our representation learning.Looking ahead, multi-task learning for both skill and workflow recognition appears promising for enhancing performance in both cases.As shown in the visualization, our representation forms clusters based on skill levels, suggesting a potential synergistic effect when the model is trained while considering the relationship between skills and workflow.In addition, this capability allows for advanced surgical training platforms.These could offer real-time instructions tailored to current behavior and assess skill levels upon completion.
Furthermore, this model can recognize a surgeon's skill online by changing the LSTM classifier from a current mechanism that holds the cell state in the whole sequence to one that operates within a constant window similar to [16].This shift could expand an application range, such as an immediate surgical quality assessment and the dynamic adjustment of support level based on the quality of surgeons' behavior, thereby providing more adaptable and responsive assistance.

B.4 Results on the MISAW dataset
As mentioned in the main text, lower performance than supervised models could be due to high variability in a small dataset.In the MISAW dataset, surgeons' videos average 2.5 minutes in length, while students' average 4.0 minutes, and the dataset exhibits a standard deviation of 1.8 minutes for the total lengths (Table B5).High variation in video duration might affect the LSTM we employed.The current LSTM's mechanism, designed to hold the cell state throughout the entire video, may encounter difficulties with state updates due to the difference in video length, especially as the video progresses.This is illustrated by the noticeable decline in performance after Suture making, as shown in the confusion matrix (Fig. B15f).To better accommodate the properties of specific surgical tasks, exploring alternatives like a temporal convolutional network [17] or LSTM with sliding window, which processes data in fixing time segments, could be beneficial over the continuous sequence approach of the current LSTM model.

Fig. B3
Fig.B3TCN visualization (visual feature v) on JIGSAWS Knot Tying: "sequence" is colored by frame indexes normalized from 0.0 (blue) to 1.0 (red) on each demonstration, "gesture" is colored by gesture labels, and "skill" is colored by novice (red), intermediate (purple), expert (green)

Fig. B4
Fig.B4TCN visualization (visual feature v) on JIGSAWS Needle Passing: "sequence" is colored by frame indexes normalized from 0.0 (blue) to 1.0 (red) on each demonstration, "gesture" is colored by gesture labels, and "skill" is colored by novice (red), intermediate (purple), expert (green) Fig. B12 Confusion matrix of skill recognition on JIGSAWS using LOSO crossvalidation

Fig. B13
Fig. B13 Confusion matrix of gesture recognition on JIGSAWS SU and KT

Fig. B17
Fig. B17 Color-coded ribbon visualization of closest below median on MISAW

Table A1
Hyperparameters for each module

Table B2
Gesture recognition performance on JIGSAW SU, KT, and NP

Table B3
Skill recognition performance on JIGSAW SU, KT, and NP using LOSO cross-validation

Table B5
Video duration in MISAW

Table B6
Phase and Step recognition performance on MISAW 1We trained LSTM only.
b) JIGSAWS SU on LOSO