Abstract
Purpose
In order to produce a surgical gesture recognition system that can support a wide variety of procedures, either a very large annotated dataset must be acquired, or fitted models must generalize to new labels (so-called zero-shot capability). In this paper we investigate the feasibility of latter option.
Methods
Leveraging the bridge-prompt framework, we prompt-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses.
Results
Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema.
Conclusion
Bridge-prompt and similar pre-trained + prompt-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Proposed intra-operative robotic gesture recognition systems are described as enabling automation, rapid skill assessment and pedagogic feedback, and more general surgical support [1]. However, current methods [2,3,4] for gesture recognition attempt to estimate a fixed set of supervised gestures from contemporaneously collected video and/or kinematic datastreams [5,6,7,8]. These systems are often trained in a fully supervised manner, which requires frame-wise gesture labels for the entire training set, and importantly require specification of the target labels during the whole training phase.
We believe that in order to produce a system that can provide support to wide variety of procedures, gesture recognition as a sub-field will either need (a) a large number of annotated datasets for each procedure as previous fully supervised methods demand, or (b) a system that can generalize well to new label sets with limited supervision. The latter option is likely more efficient given the expense of annotation and fits within the “zero-shot learning” paradigm, where a general representation is reused for labelling tasks unseen during training [9].
We thus investigate the feasibility of such a zero-shot gesture recognition method using weak supervision and text augmentation. We focus on improving visual feature extraction from video data streams, as these data offer rich information about surgical gestures (surgemes) and kinematic data require access to the research API, limiting current use cases. We use the bridge-prompt [10] framework which specifically relaxes the fully supervised constraints: Bridge-Prompt is able to use large weakly supervised datasets, which are relatively inexpensive and numerous in comparison with densely annotated data, and our experiments suggest that Bridge-Prompt generalizes well to unseen labels (i.e. novel gestures).
In order to validate these claims and evaluate Bridge-Prompt’s overall efficacy and zero-shot capability, we demonstrate usage on the JIGSAWS [11] and RARP-45 [6] dataset, simulating both standard and zero-shot use cases on cross-validated training schemes. We compare image encoders with differing configurations, varying experimentally the amount of label information (both annotation presence/absence and label text). We then compare the performance of experimental cases and standard baselines from the literature using a standard prediction recognition model (MS-TCN++, [12]). Our experiments show the benefit of using Bridge-Prompt image encoders for gesture recognition tasks.
In summary, in the present work, we do the following:
-
We demonstrate the within-task and zero-shot capability of bridge-prompt.
-
We show that gesture labels’ text descriptions do not improve training.
-
We provide open-source codeFootnote 1 for prompt-tuning encoders with bridge-prompt.
Terminology
Throughout this manuscript we use the phrase “pre-trained” exclusively to denote previously trained instances of encoder networks that we either directly use without modifying or further train using a different weakly supervised loss function. We refer to this further training phase as “prompt tuning”.
No matter if an encoder is only pre-trained or incorporates prompt-training, its weights are then frozen and used unchanged for the supervised training phase, wherein a supervised network will be trained to predict gesture labels from the frozen embeddings.
Related work
Surgical robotic gesture recognition has an extensive history of study [13,14,15]. While multiple disparate modalities are each reasonable to incorporate into a recognition system, video is by far the most popular [5, 7, 16, 17], followed by integration of video with robotic kinematic data [6, 18, 19], and discrete surgical event streams [20]. Notably, almost all proposed methods rely on video data, or on video derived features such as optical flow [21].
Gesture recognition is usually considered in a temporal context, and thus many of the prediction models are taken from time-series prediction and forecasting domains: HMM, LSTMs, temporal convolution, and attention methods [6, 22,23,24]. The Temporal Convolutional Network (TCN) has emerged as a common selection for the majority of deep learning-based temporal gesture studies [12, 24, 25]. MS-TCN and MS-TCN++ [12, 25] are the most relevant instances of this family of temporal modelling methods, as they were specifically designed for temporal action segmentation, and have been used in several surgical workflow analysis tasks [26, 27]. Our objective is to test the effectiveness of differing image encoders and not the prediction head architectures; thus, we use a generic MS-TCN++ implementation throughout the experiments.
Feature extraction has become an essential part of deep learning systems [28], particularly in computer vision [29]. Initially performed by fully convolutional architectures trained in a supervised end-to-end schema, recent image encoders are typified by extended pre-training [30] using proxy tasks and/or self-supervised, contrastive methods. In surgical video, standard methods include Inflated-3D [2], 3DCNN [3] and 3DResNet [4]; these methods fall into the category of fully convolutional supervised end-to-end trained encoders. Bridge-Prompt, which uses the pre-trained CLIP [29] vision-text joint embedding model, has been proposed as a generic video encoder. The objective of this paper is to measure both bridge-prompt’s performance on the standard surgical gesture task and its zero-shot generalization ability.
Zero-shot learning
In traditional learning, models learn from labelled samples for each class and make predictions on previously unseen samples of the same classes. In zero-shot learning, models are trained on a subset of classes and tested on a different set of classes without any overlap [9]. This is especially crucial in surgical robot videos where collecting annotated gestures for every possible class is impractical or expensive, leading to a vast amount of surgical video being unlabelled.
Method
This section describes the two parts of our model: (1) the video encoder (bridge-prompt) and (2) the downstream gesture recognition model. The former we construct during prompt-tuning and is the focus of our empirical study, while the latter is used in evaluation of each of the encoders, including both the proposed encoder (bridge-prompt) and baseline encoders. Though neither is first proposed by this manuscript, we include descriptions of both, as understanding their nuances (or simplicity, in the case of the downstream model) is necessary for contextualizing experiments.
Bridge-prompt prompt-tuning architecture
Bridge-prompt is a training protocol for constructing high-quality image encoders for sequential labelling of video frames. It starts with an image-text joint encoder, for which the standard model is the CLIP model [29], which is in turn based on ViT-B/16 vision transformer trained on three large natural image datasets [31,32,33], alongside an analogous text transformer to GPT-2 [34]. In the CLIP training protocol, these two models were modified to have matching encodings, so that from either the image or text one could predict the other. Bridge-Prompt starts at this pre-trained CLIP state and prescribes additional video sequence-based training. This fine-tuning to surgical video (and to the particular surgical gesture labels) is the first phase of our empirical work (Fig. 1).
We follow the Bridge-Prompt protocol and first split videos into sub-videos with a fixed number of frames, but possibly at different sampling rates. All resulting sub-videos have the same number of frames, alongside a label for each frame (which may be missing/undefined). Every defined label may also have a text description, e.g. “orienting needle” or “pulling suture with left hand”, though this may also be left undefined, or, as we experiment with, replaced with a categorical placeholder.
For each sub-video four text prompts are constructed from the sub-video’s labels:
-
1.
Statistical text prompt: “this video contains K actions in total”
-
2.
Ordinal text prompt: “this is the ith action in the video” (defined for each distinct label interval in the video)
-
3.
Semantic text prompt: “{\(Ord_i\)}, the person is performing {Gesture i text description}” where \(Ord_i\) refers to “Firstly”Footnote 2, “Secondly”, etc.
-
4.
Integrated text prompt: the concatenation of all of the semantic text prompts.
These prompts are then sent to the text encoder, to form \(z_{\textrm{stat}}\), \(z_{\textrm{ord}}\), \(z_{\textrm{sem}}^k\) and \(z_{\textrm{int}}\), respectively, where k represents k-th gesture in a video clip.
Each image frame \(x_t\) is passed through the initial image encoder f; this is the encoder that will be reused for the downstream surgical gesture recognition task. However, for fine-tuning only, the encodings \(f(x_t)\) are then processed as a sequence by a “fusion module”, which also receives as input the ordinal text prompts (\(z_{\textrm{ord}}\)) and summary statistic frame-level indicators (count tokens, split indicators, and length/position indicators). The outputs of the fusion module (the “fusion encodings”) include \(z_c^k\) for each gesture, mean-pooled \(\bar{z}_c\), and a separate embedding \(z_{\textrm{count}}\). \(z_{\textrm{count}}^k\) is output at sub-clip for each gesture, but only the mean-pooled aggregate is used. These encodings are the focus of the loss components which drive the contrastive fine-tuning.
Contrastive pre-training losses
For two vectors \(z_x,z_y\) on the same space the cosine similarity is
We can then define a batch similarity matrix from sets \(Z_x = \{z_{x,b}\}\) and \(Z_y = \{z_{y,b}\}\)
where b denotes the sub-video (batch) index and B is the batch size. We construct three different similarity matrices:
\(S_{\textrm{sem}}^k =\text {S}(Z_{c}^k,Z_{\textrm{sem}}^k)\) the similarity between the frame-wise encodings and the semantic text prompt embeddings, \(S_{\textrm{int}} = \text {S}(\bar{Z}_c,Z_{\textrm{int}})\) the similarity between the mean-pooled embeddings and the integrated text prompt embeddings, and \(S_{\textrm{stat}} = \text {S}(\bar{Z}_{\textrm{count}},Z_{\textrm{stat}})\) the similarity between the mean-pooled count embeddings and the statistical text prompt embeddings. After computing each similarity matrix soft-max is applied first row-wise/column-wise to form text-wise/clip-wise \(\bar{S}_{\textrm{sem}},\bar{S}_{\textrm{int}}\), and \(\bar{S}_{\textrm{stat}}\), respectively.
Within each batch, matching image-text pairs are taken as positive contrastive pairs, while mismatched pairs (i.e. videos paired with label text from a different video) are taken as negative contrastive pairs; this is to say that in the context of our contrastive learning problem we optimize the matrices towards the identity matrix. Towards this end we define three losses for each of the three matrices:
where the (generalized KL) divergence D is defined for square matrices of matching dimension
Gesture recognition model
To measure the effectiveness of prompt-based video encoder, during the evaluation we freeze the weights in the video encoder and do frame-wise visual embedding. The frame encoder is fixed, and we train a predictive model for gesture recognition based on that fixed visual embedding. Our chosen downstream predictive model is the MS-TCN++ [12] temporal convolution networks have become a common choice in action segmentation, and the MS-TCN++ is a refinement of the original MS-TCN. MS-TCN++ uses a simple two-stage training and a slightly modified convolutional configuration with dilations. We avoid deeper or more nuanced architectures (such as those incorporating attention, or longer context windows) to ensure that performance is due to the quality of the features and not the complexity of the classifier itself.
Experiments
Datasets, implementation, training and evaluation
Datasets We demonstrate Bridge-Prompt and baseline methods on two standard datasets: the JHU-ISI gesture and skill assessment working set (JIGSAWS) [11], which is composed of endoscopic video of suturing and knot-tying in a phantom environment, and robot-assisted radical prostatectomies (RARP-45) [6], which is composed of endoscopic video recordings of prostatectomies. Both datasets are collected from 8 surgeons with varying skill levels using the da Vinci Surgical System (dVSS, Intuitive Surgical, Sunnyvale, CA, USA) and are annotated for multiple gestures at the image-frame level: 15 gestures at 30 Hz for JIGSAWS and 8 gestures at 60 Hz for RARP-45. JIGSAWS additionally has robotic kinematic recordings, but we do not use these data in Bridge-Prompt trained methods. We also omit the JIGSAWS needle passing task due to data quality.
Implementation We implement multiple Bridge-Prompt variants as well as two baseline image encoders (3DResNet [4] and I3D [2]) in PyTorch. Each Bridge-Prompt implementation is composed of a backbone image encoder (either the default ViT-B/16 [35] or ResNet-50 [36]) and the analogous GPT-2 text encoder [34] which is discarded after training. These backbones are initialized using standard pre-trained weight-sets.Footnote 3 The Bridge-Prompt encoders are then prompt-tuned using the Adam optimizer minimizing the sum of the losses in Eq. 3 in the prompt-tuning phase.
After prompt-tuning we freeze the weights of each image encoder and then train a standard prediction network for the supervised task (MS-TCN++ [12]). This second directly supervised phase we call the supervised training phase. The supervised training phase has its own hold-out test set, and it is on this hold-out that we report performance metrics. For JIGSAWS we use leave-one-user-out (LOUO) [11] cross-validation, and for RARP-45 we choose 10 videos (out of 36 total) as the test set.
Training samples Even though we are constructing frame-by-frame image encoder, their training requires multi-frame segments (video clips) and their corresponding sequence of gesture labels. Each video clip is sampled from the original video of JIGSAWS or RARP-45 in 16 frame windows at three separate temporal sampling rates (sampled frames every 4/8/16 frames for JIGSAWS and 6/15/30 frames for RARP-45). We resize each frame to \(224\times 224\) pixels. The input format for the video clips for all methods (Bridge-Prompt [10], I3D [2], 3DResNet [4]) is the same, and input sets are changed only by the stated experimental condition. For the label sequences from JIGSAWS, we provide two additional placeholder labels/prompts for the unlabelled frames at the beginning and end of each video: “Waiting and preparing for the surgery” for beginning frames and “Finishing the surgery” for ending frames.
Training time All experiments were run on NVIDIA A40 GPUs or NVIDIA A5000 GPUs. The prompt-tuning phase for all Bridge-Prompt variants was conducted on two GPUs; otherwise, only a single GPU was used. In JIGSAWS, pre-training for 50 epochs for each tested variant of Bridge-Prompt takes approximately 8 h using two A40 GPUs. It takes 5 min to train the MS-TCN++ during the supervised training phase, using pre-extracted image encodings.
Performance metrics We assess the outcomes using five standard evaluation metrics, Accuracy (Acc.), Edit Distance, and F1@{10, 25, 50} [24]. For JIGSAWS we condition this on pre-training task (Knot Tying or Suturing), and for zero-shot cases we also condition on each unseen label.
Experimental conditions
We first measure the quality of the features learned by the various encoders operating under normal (nonzero-shot) conditions. Results are reported in Tables 1 and 2. These results show that Bridge-Prompt improves gesture recognition performance as measured by all but one performance metric. Further, due to the similarity between Bridge-Prompt performance with either ResNet50 or ViT backbones, this performance gain does not appear to be due to the transformer architecture of ViT. However, ResNet50 backbones appear to be slightly less stable in RARP-45 training; across multiple runs we experienced exploding gradients, and the method failed to converge. Tuning the learning-rate might resolve this issue, but we did not have the resources to tune that parameter. We include JIGSAWS results reported in van Amsterdam et al. for contextualization,Footnote 4 as we believe this to be a state-of-the-art contemporary system utilizing all available data streams (both visual and kinematic), and (presumably) many model architecture optimizations and procedure refinements.
Zero-shot capability We measure the efficacy of the Bridge-Prompt image encoder for describing unseen gestures, i.e. zero-shot generalization capability. This is done by selectively training encoders using only subsets of the gestures: for JIGSAWS, we test subsets with only gestures 1–5, only gestures 1–10, and no gestures at all (the “pure CLIP [29] encoder”), presenting results in Table 3. We disaggregate performance by task, gesture and model in Table 5. The 1–10 gesture model necessarily contains 6 and 8, and thus performance is not reported. For RARP-45 we test subsets with only gestures 1–3 and gestures 1–6, presenting results in Table 4. Per-gesture performance disaggregation is provided as a table in Appendix. For JIGSAWS, we also trained video encoders on one task but evaluated them on the other (“cross-task”) and reported at the bottom of Table 3. This experiment is not possible in RARP-45, as it only has one task. In general we find that Bridge-Prompt prompt-tuning with only a subset of gestures or even on a different task still provides improvement for most performance measures. This, to us, indicates that Bridge-Prompt has relatively high zero-shot capability.
Text description ablation Finally, we show evidence that text descriptions provide negligible benefit for bridge-prompt encodings. We measured the value of gesture text descriptions by ablating them to simple “Gesture Index” categorical descriptors. For example, gesture 9 could be described as “using right hand to help tighten suture” or as “Gesture 9”. Results from these experiments are included in Tables 3, 4 and 5.
Conclusion and discussion
In this paper we have shown that the Bridge-Prompt framework provides both cutting edge gesture-prediction performance for the standard within-task paradigm as well as strong zero-shot performance on unseen gestures. We believe that this latter case will be essential for any eventual surgical support system. The vocabulary of gestures is too large to learn purely by databases of supervised annotated cases; we should instead plan for situations with weak supervision and novel gestures in deployment. While the Bridge-Prompt framework may not be a component of that eventual system, we believe it makes a significant step towards such a system by demonstrating zero-shot capacity.
Notes
The original Bridge-Prompt paper uses “Firstly” instead of “First”, and we do not modify it here.
Both Bridge-Prompt and 3DResNet methods prescribe the use of weights from another task and training phase before the prompt-tuning phase.
Van Amsterdam et al. [6] report results for models that include input data from both video and robotic kinematic streams and thus are not entirely comparable for selecting video encoders. Moreover, the particular architecture, training details and data split cannot be reproduced without access to their codebase. We exclude their RARP-45 reported values as they evaluate on all 45 videos, but only 36 videos in RARP-45 are publicly available.
References
Amsterdam B, Clarkson MJ, Stoyanov D (2021) Gesture recognition in robotic surgery: a review. IEEE Trans Biomed Eng 68(6):2021–2035
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Hara K, Kataoka H, Satoh Y (2017) Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp 3154–3160
Zhang J, Nie Y, Lyu Y, Li H, Chang J, Yang X, Zhang JJ (2020) Symmetric dilated convolution for surgical gesture recognition. In: Medical image computing and computer assisted intervention—MICCAI 2020: 23rd international conference, Lima, Peru, October 4–8, 2020, proceedings, Part III 23, pp 409–418. Springer
Van Amsterdam B, Funke I, Edwards E, Speidel S, Collins J, Sridhar A, Kelly J, Clarkson MJ, Stoyanov D (2022) Gesture recognition in robotic surgery with multimodal attention. IEEE Trans Med Imaging 41(7):1677–1687
Zhang J, Nie Y, Lyu Y, Yang X, Chang J, Zhang JJ (2021) SD-Net: joint surgical gesture recognition and skill assessment. Int J Comput Assist Radiol Surg 16:1675–1682
Goldbraikh A, Avisdris N, Pugh CM, Laufer S (2022) Bounded future MS-TCN++ for surgical gesture recognition. In: European conference on computer vision, pp 406–421. Springer
Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Trans Intell Syst Technol TIST 10(2):1–37
Li M, Chen L, Duan Y, Hu Z, Feng J, Zhou J, Lu J (2022) Bridge-prompt: towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19880–19889
Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Béjar B, Yuh DD et al (2014) JHU-ISI gesture and skill assessment working set (JIGSAWS): a surgical activity dataset for human motion modeling. In: MICCAI workshop: M2cai, vol 3
Li S, Farha YA, Liu Y, Cheng M-M, Gall J (2020) MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans Pattern Anal Mach Intell 45:6647–6658
DiPietro R, Lea C, Malpani A, Ahmidi N, Vedula SS, Lee GI, Lee MR, Hager GD (2016) Recognizing surgical activities with recurrent neural networks. In: Medical image computing and computer-assisted intervention—MICCAI 2016: 19th international conference, Athens, Greece, October 17–21, 2016, proceedings, Part I 19. Springer, pp 551–558
Tao L, Zappella L, Hager GD, Vidal R (2013) Surgical gesture segmentation and recognition. In: Medical image computing and computer-assisted intervention—MICCAI 2013: 16th international conference, Nagoya, Japan, September 22–26, 2013, proceedings, Part III 16. Springer, pp 339–346
Reiley CE, Lin HC, Varadarajan B, Vagvolgyi B, Khudanpur S, Yuh DD, Hager GD (2008) Automatic recognition of surgical motions using statistical modeling for capturing variability. In: MMVR, vol 132, pp 396–401
Funke I, Bodenstedt S, Oehme F, Bechtolsheim F, Weitz J, Speidel S (2019) Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 467–475
Lea C, Reiter A, Vidal R, Hager GD (2016) Segmental spatiotemporal CNNs for fine-grained action segmentation. In: Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, Part III 14. Springer, pp 36–52
Zappella L, Béjar B, Hager G, Vidal R (2013) Surgical gesture classification from video and kinematic data. Med Image Anal 17(7):732–745
Long Y, Wu JY, Lu B, Jin Y, Unberath M, Liu Y-H, Heng PA, Dou Q (2021) Relational graph learning on visual and kinematics embeddings for accurate gesture recognition in robotic surgery. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 13346–13353
Qin Y, Feyzabadi S, Allan M, Burdick JW, Azizian M (2020) davincinet: Joint prediction of motion and surgical state in robot-assisted surgery. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 2921–2928
Wu JY, Tamhane A, Kazanzides P, Unberath M (2021) Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery. Int J Comput Assist Radiol Surg 16:779–787
Tao L, Elhamifar E, Khudanpur S, Hager GD, Vidal R (2012) Sparse hidden Markov models for surgical gesture classification and skill evaluation. In: Information processing in computer-assisted interventions: third international conference, IPCAI 2012, Pisa, Italy, June 27, 2012. Proceedings 3. Springer, pp 167–177
DiPietro R, Ahmidi N, Malpani A, Waldram M, Lee GI, Lee MR, Vedula SS, Hager GD (2019) Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks. Int J Comput Assist Radiol Surg 14(11):2005–2020
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 156–165
Farha YA, Gall J (2019) MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3575–3584
Yuan K, Holden M, Gao S, Lee W (2022) Anticipation for surgical workflow through instrument interaction and recognized signals. Med Image Anal 82:102611
Czempiel T, Paschali M, Keicher M, Simson W, Feussner H, Kim ST, Navab N (2020) Tecno: surgical phase recognition with multi-stage temporal convolutional networks. In: Medical image computing and computer assisted intervention—MICCAI 2020: 23rd international conference, Lima, Peru, October 4–8, 2020, proceedings, part III 23. Springer, pp 343–352
Bengio Y, Courville AC, Vincent P (2012) Unsupervised feature learning and deep learning: a review and new perspectives. CoRR arXiv:1206.5538
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, Part V 13. Springer, pp 740–755
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123:32–73
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J (2016) Yfcc100m: The new data in multimedia research. Commun ACM 59(2):64–73
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2021) An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: International conference on learning representations
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Acknowledgements
This research was supported in part by NSF 2321684 and the Wellcome LEAP SAVE program. The authors declare that they have no conflict of interest. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors. This article does not contain patient data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rao, M., Qin, Y., Kolouri, S. et al. Zero-shot prompt-based video encoder for surgical gesture recognition. Int J CARS (2024). https://doi.org/10.1007/s11548-024-03257-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11548-024-03257-1