SIG-Former: monocular surgical instruction generation with transformers

Purpose: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. Methods: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual–textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. Results: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. Conclusion: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images.


Introduction
With the increasing demands of surgical training, intraoperative surgical assistance and decision support in modern clinical rooms, digital context-aware surgical system is an essential component toward next-generation surgery. It aims to leverage the available information inside operation rooms to assist clinicians during surgical practices. Among all the related techniques, surgical instruction generation is a process of generating human-like guidance from surgical views. It is particularly important when an emergency situation is detected or onsite mentoring is unavailable. However, the complexity of operation environments, the high intra-B Yinyu Nie yinyu.nie@tum.de 1 generation follows some templates, for instance, it always includes few fixed sentence templates where each one focuses on one specific topic. While in surgical instruction generation task, the surgical content and the way describing it are both diverse.
To the best of our knowledge, the only prior work of surgical instruction generation task is [20]. In their study, the authors collect a dataset called Database for AI Surgical Instruction dataset (DAISI) and use bidirectional recurrent neural network (RNN) method to build a baseline model. This work, however, has two main limitations. On the one hand, although the RNNs are designed to memorize the historical information and generate sequences in arbitrary length, they have limited representation ability and are bottlenecked by the gradient vanishing and exploding problems [17]. On the other hand, evaluating the quality of sentences in different points of view is significant to verify their correctness. Nevertheless, they use the BLEU score [16] as the only evaluation metrics, which is deficient.
In this paper, we propose SIG-Former, a transformerbackboned encoder-decoder architecture along with selfcritical reinforcement optimization, to generate instruction texts from surgical images. Our proposed methods are mainly based on two insights and observations: (1) With selfattention and multi-head attention mechanism, transformers can achieve great performance in sequence generation, for example, machine translation and image captioning from natural domain [7,23]. (2) Sequence generation tasks are often trained with the teacher forcing mode [18]. That means during the training procedure, the model uses the ground-truth token to predict the next token, while it uses the previous generated token to predict the next token during the inference stage. This discrepancy between training and inference procedure causes accumulated errors.
Given a surgical image, we first extract its attention map using a pre-trained ResNet-101 model [12]. Then we feed the attention map into a transformer encoder to get the position-wise latent feature map. The transformer encoderdecoder attention module and decoder attention module further models the visual-text dependency and token-wise textual information. Furthermore, after the standard crossentropy training for some epochs, we employ the self-critical reinforcement learning to alleviate aforementioned mismatching between training and inference.
We validate the effectiveness of our approach on DAISI dataset. This work is an extended version from MICCAI2021 conference paper [27], where we improve our training strategy by exploring the data distribution. The results demonstrate that our new training setup further improves the performance of SIG-Former in surgical instruction generation.

Methods
The architecture of SIG-Former is shown in Fig. 1. It consists of two parts: (1) surgical instruction generation with a transformer-based encoder-decoder model (see Sect. 2.1) and (2) self-critical reinforcement learning (see Sect. 2.2).

Surgical instruction generation with transformers
Rather than computing the tokens sequentially in recurrent style, transformer [23] allows building non-local relationships concurrently for different positions inside a sequence. Our SIG-Former model has two components: a transformer encoder and a transformer decoder, with each composed of stacks of attentive layers. The encoder input is feature maps from surgical images while the output is a sequence of tokens (i.e., the surgical instructions).
Given a surgical image, we first extract its feature map, with the dimension of 14 × 14 × 2048, using the last convolutional layer of a pre-trained ResNet-101 [12]. We reduce the feature map to 14 × 14 × 512 dimensions with a linear embedding layer followed by a ReLU activation layer and a dropout layer. We further flatten the feature maps to the shape of 196×512 and put this visual embedding as the entry token to the first transformer encoder layer.
The transformer encoder aims to build position-wise relationships for input image regions. It is composed of a sequence of six identical attention modules, where each module consists of a multi-head self-attention layer followed by a feed-forward layer. For a given input X ∈ R N ×D , where N is the number of entries and D is the feature dimension, the attention layer first linearly converts the input into queries (Q = X W Q , W Q ∈ R D×D k ), keys (K = X W K , W K ∈ R D×D k ) and values(V = X W V , W V ∈ R D×D v ). Then the scaled-product attention can be computed by: where D k is the dimension of queries and keys; D v is the dimension of values (D k = D v in our implementation). In order to jointly access to different sub-spaces, the Multihead attention is applied (#heads = 8). The outputs from 8 heads are then concatenated and multiplied by a learned projection matrix W O . The process can be represented as: where W 1 , b 1 and W 2 , b 2 are the learnable weights and biases of two MLP layers. The transformer decoder is also a sequential stack of six identical attention modules, where each module has two layers of multi-head attention (one for the self-attention on words and another for cross attention over the output from the last encoder layer) followed by one position-wise feedforward layer. For the detailed explanation of the decoder, we refer readers to [23].

Self-critical reinforcement optimization
Sequence generation models usually come with two shortcomings: (1) During the training process, the model predicts next word using the previous ground-truth word. While in inference, the model predicts the next word by feeding the previous generated word as the input. This discrepancy is called exposure bias [18] because the model is only exposed to the training data distribution rather than its own prediction. As the result, the error would be quickly accumulated if initial predictions are wrong. (2) Another mismatching is in the loss calculation. In the training stage, normally the word-level cross-entropy loss is used to maximize the likelihood of the next correct word. However, the non-differentiable language evaluation metrics (e.g., BLEU or CIDEr) are applied to evaluate the model performance.
In order to eliminate above discrepancies, following the standard practice in image captioning from natural domain, we first train our model with word-level cross-entropy and then fine-tune the sequence generation model using selfcritical reinforcement learning approach. Consequently, in the reinforcement fine-tuning step, we use predicted words as the input to generate the next word in training and directly optimize the model with CIDEr score as the reward. Because it well correlates to human judgement [24].
Following the details from [19], given the model parameters θ , the policy p θ and sentence w s , the gradient for one sample can be expressed as: where r (.) is the reward function, and b = rŵ is the selfcritical baseline, which is obtained by the current model under the inference algorithm applied at the test time. As the result, it increases the probability of high reward sample and penalties the low reward sample. For detailed formula derivation, please refer to [19].

Experimental details
Dataset. We evaluate our proposed method on the DAISI dataset [20]. The dataset contains 17,256 color images of the 290 medical procedures from 20 clinical disciplines, including laparoscopic inguinal hernia repair, open cricothyroidotomy, laparoscopic sleeve gastrectomy, etc. The availability of the dataset is upon request from. 1 Each procedure contains surgical images with their corresponding text description of how to complete a step in the procedure. We further clean the dataset by deleting the irrelevant and noisy images and captions, e.g., some images and captions only represent the surgeon information of that particular procedure. Finally, we get 16,413 images with one caption per image. Rather than split the images randomly in [27], we split the data intra-procedurally. That means inside each procedure, we randomly assign 80% images for training, 10% for validation and 10% for testing. We finally have 13,035 training images, 1618 validation images and 1760 test images. Text Preprocessing. Text preprocessing is a significant task for any natural language-related task. The unstructured raw text data need to be converted to a more digestible and predictable format such that the model can learn meaningful feature and perform better. We follow these four steps to clean the raw text data: (1) Convert all the words into lower case; (2) Expanding abbreviations, including English contractions (e.g., "aren't" to "are not") and medical abbreviations (e.g., 1 https://engineering.purdue.edu/starproj/. "m." to "muscle"); (3) Remove all numbers, whitespaces and punctuation; (4) Tokenize the sentence into words. Moreover, we set the sentence length threshold to 16 and mark those words as "UNK" if they appear less than 5 times in dataset, which ends up with a 2212 words vocabulary. The top-20 word distribution and sentence length distribution are shown in Fig. 2.

Implementation details
To comprehensively explore the performance of SIG-Former in surgical instruction generation, we additionally implement two LSTM-based generation models as baselines. We implement all methods using PyTorch and train them on two GeForce RTX 2080 Ti GPUs.
Our method. We extract the input feature map with the last convolutional layer of a pre-trained ResNet-101 [12], which is followed by a spatial adaptive max-pooling layer and flattened to 196 × 2048 dimension. For the standard crossentropy training, we set the batch size to 16. The learning rate of the model is initialized to 3 × 10 −4 and follows the learning rate scheduling strategy with 20,000 warm-up steps. After 50 epochs training with cross-entropy loss, we employ the self-critical reinforcement strategy to optimize the CIDEr score with the fixed learning rate of 1 × 10 −5 for another 10 epochs. The batch size for this stage is set to five. LSTM-based methods. We further provide two LSTMbased baselines for the surgical instruction task (LSTM and LSTM-based soft-attention model) similar to [25,26], as this task is fairly new and LSTM is a milestone work for processing sequences. We also use the last convolutional layer of a pre-trained ResNet-101 to extract visual features for these two baselines. We apply an average pooling and obtain 2048-d feature for vanilla LSTM. It is trained with the initial learning rate to 5 × 10 −4 and batch size to 16 for 70 epochs using cross-entropy loss.
The LSTM-based soft-attention model shares the same feature map (196 × 2048 dimension) with our SIG-Former model. For the cross-entropy training stage, we initialize the learning rate to 5 × 10 −4 and batch size to 16. For the reinforcement optimization step, the learning rate is fixed to 1 × 10 −5 with batch size at five.
All the models are optimized using the ADAM optimizer.

Comparison with the state of the arts
Since we clean the data by removing noisy and inappropriate image-text pairs, a new evaluation benchmark is required. We re-implement the state-of-the-art network Bi-RNN [20] for comparison, as their code is not publicly available. Following the details of Bi-RNN, we extract the 4096-d feature using the last convolutional layer of a pre-trained VGG16 network [21]. The Bi-RNN model is trained 50 epochs with learning rate at 5 × 10 −4 and batch size at 10. Rather than randomly split the data as in [27], we split the data intra-procedurally such that the model is able to get some prior information during the test stage (see Sect. 3.1). We compare the performance of our method with the stateof-the-art and other baseline methods in Table 1. To verify the efficacy of reinforcement learning, we also ablate our method by removing it in evaluation, which is denoted by "Transformer only" in Table 1.
It can be seen that Bi-RNN method shows a relatively lower performance than other proposed baselines on all the evaluation metrics. For example, CIDEr [24] is an evaluation metric specifically designed for image captioning task based on the consensus between predicted instructions and reference descriptions. The CIDEr score of Bi-RNN is 30% lower than ours, which indicates the weakness of a simple RNN model in catching visual-textual relationship for instruction generation. For other three proposed approaches, transformer model with reinforcement learning outperforms all other methods on all evaluation metrics. The promising results of transformer models indicate the robustness of the encoder-decoder attentive layers.
From another perspective, one intuitive difference between natural domain image captioning and surgical instruction generation is that there are strong contextual relationship and temporal dependency between images in the same type of surgical operations, which is particularly important for surgical content analysis. In Table 1, we see the intra-procedure split further improves the model performance, where the scores of transformer-only model show ≈ 7 points higher than its performance in random split. Since for the intra-procedure setting, 80% of the images in the same procedure are assigned to the training set and the rest are in the validation and test set, each model is equipped with more prior information from the training set. Figure 3 shows the qualitative comparisons. We randomly select 9 images with predicted instructions from the best LSTM model and our SIG-Former.

Ablation study
To explore the effectiveness of each single design in our method, we decompose our network into six configurations as follows: C1: LSTM only C2: Soft-attention only C3: Transformer only C4: Soft-attention + Self-critical reinforcement learning C5: Transformer + Self-critical reinforcement learning The results are presented in Table 2, from which we observe that: C1 versus C2: We add the soft-attention on top of the LSTM model such that each word position can sequentially access different image regions to make prediction. C1 versus C2 and C3: Without using any sequence-aligned recurrent units, the transformer attention mechanism processes the sequence as a whole. Comparing with other baselines, transformer backboned framework improves the performance over all evaluation metrics, which demonstrates the capability of transformers in processing multi-modal context. C2 versus C4: and C3 versus C5: After the standard training stage with cross-entropy loss, we use the reinforcement learning to directly optimize the CIDEr score. From Table 2, we see that this optimization step not only improves the CIDEr score, but also improves the results on other metrics. Specifically, there is an obvious improvement for the transformer-only model.

Limitations and challenges
In this section, we discuss the limitations and challenges in the surgical instruction generation task.
1. Limited dataset scale. Surgical instruction generation is a multi-modal task which relates to visual, textual and dependencies between them, where the parameter space is much larger than those single-modal tasks (e.g., classi-fication or objection detection). It requires large amount of data to tune complex hyper-parameters and prevent overfitting. In natural image domain, the COCO image captioning task [15] has more than 120k samples, while the DAISI dataset has less than 20K images. Furthermore, from the comparisons in Table 1, contextual information contributes to the performance of the models. However, the dataset contains only one sample for few surgical procedures, thus only spatial-based prior information can be learned. If we can build a larger dataset, not only the spatial information but also the contextual temporal relationships can be learned to improve the instruction generation. 2. No fine-grained supervisions. In natural image domain, a large dataset (e.g., COCO or ImageNet [9]) is usually pre-trained to support downstream tasks such as object detection and attribute feature identification. Nonetheless, these pre-trained models are difficult to be applied in medical domain as labeling surgical images requires expert annotators. 3. One reference instruction per image. In real case, the content for a same image can be explained in multiple ways. COCO image captioning dataset [15] equips one image with 5 different references, while we have only one in the DAISI dataset, where an appropriate prediction could be ignored only because it has a different instruction description.

Conclusion
In this paper, we propose a self-critical transformer, named by SIG-Former, to generate surgical instructions given from a monocular image. The network is composed of a transformer encoder to model visual features, a transformer decoder to model textual information and an encoder-decoder to catch multi-model dependencies. In addition, we use the reinforcement learning approach to alleviate the discrepancy between training and inference by directly optimizing the CIDEr metric. The performance of our method demonstrates the effectiveness of attention blocks in handling multi-modal sequence-to-sequence problem.
Funding Open Access funding enabled and organized by Projekt DEAL.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent This article does not contain patient data.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.