Advancing surgical VQA with scene graph knowledge

Purpose The modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with natural language capabilities is emerging as a necessity. Our work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question–condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. Methods First, we propose a surgical scene graph-based dataset, SSG-VQA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. We then propose SSG-VQA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module, which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Results Our comprehensive analysis shows that our SSG-VQA dataset provides a more complex, diverse, geometrically grounded, unbiased and surgical action-oriented dataset compared to existing surgical VQA datasets and SSG-VQA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. Conclusion We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. We point out that the bottleneck of the current surgical visual question–answer model lies in learning the encoded representation rather than decoding the sequence. Our SSG-VQA dataset provides a diagnostic benchmark to test the scene understanding and reasoning capabilities of the model. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-VQA. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-024-03141-y.


Introduction
Surgical data science is a rapidly growing field that aims to streamline clinical workflows and enable the development of real-time intra-operative decision-support systems [16,18].Recent advancements in surgical video analysis, such as surgical workflow phase recognition, fine-grained surgical action detection, and surgical semantic scene segmentation, show evidence of the progress [17,3,21].However, the scope of these methods is confined as it mainly focus on visual-only data to perform classification or recognition tasks, thereby offering limited user interaction.The next generation of surgical data science applications also demands approaches operating at the crucial intersection of vision and language to offer intuitive user interaction during intra-operative surgical procedures.Surgical Visual Question Answering (VQA) is emerging as a notable solution in that direction, which aims to provide precise answers to user queries in a natural language by analyzing a given surgical image [1,10,20,19].
Developing an effective surgical VQA system is inherently challenging for a typical surgical scene, which contains mul-tiple anatomical structures and instruments connected through diverse spatial and action relationships.While a few works have explored VQA tasks in the surgical context [20,19], they are typically limited to datasets and models that ignore detailed scene knowledge.From the dataset perspective, one key challenge is the lack of a dataset with potentially a vast set of question-answer pairs covering various aspects of the surgical scene.The current surgical VQA datasets are small and only consider simple scene information, e.g., object/action occurrence, as shown in Fig. 1.Moreover, these datasets contain question-answer pairs with significant question-condition bias, where answers can be derived from just the questions without performing any visual processing.This hinders the utility of these datasets to serve as appropriate surgical VQA benchmarks.
From the model perspective, the current surgical VQA architectures operate on the global visual representation of the surgical image, ignoring the detailed understanding of surgical scene knowledge.This can be detrimental, especially when object-level visual reasoning is essential to answer fine-grained questions.Our key contributions are therefore twofold: the introduction of a new surgical scene-aware VQA dataset called SSG-VQA and a novel surgical VQA model called SSG-VQA-Net.
The SSG-VQA dataset uses a semantic scene graph [14] as a suitable representation to generate diverse question-answer pairs.A semantic scene-graph representation provides scene knowledge by detecting objects and their attributes and connecting relationships and interactions between objects in the scene.To develop the surgical semantic scene graph, we use publicly available datasets for semantic segmentation and tool detection [11,8], and apply these models to estimate object spatial relationships.We then estimate the surgical action relationships, i.e., < instrument, verb, target >, among the objects using the public CholecT45 dataset [17].
We then develop a surgery-specific question engine that ingests the surgical scene graph and manually designed question templates to produce a variety of question-answer pairs.Including detailed surgical scene understanding along with question templates helps us to generate question-answer pairs covering various aspects of the surgical scene, for example, fine-grained action recognition -"What is the action being performed on peritoneum?", semantic scene reasoning -"What anatomy is at the top-mid of the frame?", and surgical object attribute reasoning -"What is the name of the anatomy that is being retracted?".Furthermore, we perform the balancing and sampling strategies based on the surgical-specific knowledge and class distribution to remove the questions that contain question-condition bias, e.g., "How many livers are in the frame?" which counts the number of certain anatomical structures.The overall pipeline is illustrated in Fig. 2.
Given a large-scale SSG-VQA dataset containing finegrained surgical question-answer pairs, we propose a multimodality surgical VQA model called SSG-VQA-Net.Existing surgical VQA models use a highly parameterized multimodality transformer encoder to fuse the textual embeddings from a question and the patches from a global visual representation of a surgical image [20,19].However, these patches do not contain object-wise information about the surgical scene, hence missing the geometric scene understanding.Our key idea is to exploit object-wise local features and fuse geometric scene information in the VQA model design.To enable this, we train a fast and lightweight object detector, YOLOv7 [22], on the bounding-box labels of the SSG-VQA dataset.The trained object detector allows us to extract object-wise local representations of the surgical scene objects using RoIAlign pooling [6].Furthermore, we integrate the geometric spatial coordinates and class labels of detected bounding boxes into the VQA model by introducing a lightweight multi-modal transformer encoder named the Scene-embedded Interaction Module (SIM).The SIM module uses a scene graph of detected bounding boxes where each node contains the class label and bounding-box coordinate information.The scene graph is refined by cross-attention between the scene graph and the textual inputs, highlighting specific graph nodes correlated to the complex question query.These refined textaware scene embeddings are then combined with the objectwise local representations of the surgical scene and the textual embeddings through a transformer encoder layer to generate an accurate response.Experimental results show that our method outperforms prior works while achieving a low parameter count.We summarize our main contributions as follows: • We present a new surgical scene graph-based VQA dataset, SSG-VQA, providing complex, diverse, geometrically grounded, and surgical action-oriented question answers.
• We present a surgical VQA model, SSG-VQA-Net, utilizing a novel scene-aware feature extraction strategy to obtain state-of-the-art performance.

SSG-VQA dataset
In this section, we explain the SSG-VQA dataset generation process consisting of creating surgical scene graphs, designing a question engine with diverse templates, and employing a sampling strategy to mitigate data imbalance and questioncondition bias.

Scene graph generation
We build our SSG-VQA dataset using the publicly available CholecT45 [17], m2cai16-tool-locations [11] and Cholec-Seg8k [8] datasets.Specifically, we train a tool detection model [12] on m2cai16-tool-locations [11] and a semantic segmentation model [4] on CholecSeg8k [8] to extract bounding boxes of surgical objects, including surgical instruments and anatomies.Then, we build the surgical semantic scene graph using the detections.A surgical scene graph can be formulated as a set of nodes and edges, where the nodes represent surgical objects that contain a set of attributes, i.e., color, location, and type, and edges represent the spatial and action relations among the objects.The spatial relations are calculated by comparing the centroid of objects, and the action relations are provided by the triplet annotations from CholecT45 [17].Then, we leverage the generated scene graphs as input to a question engine, as described below, to generate diverse question-answer pairs.Note that to create a clean test set of question-answer pairs, we manually correct the bounding boxes and class labels of scene graphs in the test videos.

Question engine
The question engine, which is responsible for generating diverse questions with various categories, requires two inputs, i.e., scene graph and question templates.We use the CLEVR engine [13,10] and extend it to the surgical context.Specifically, the question engine can change question templates' parameters conditioned on the surgical scene graph to express diverse questions.For example, the question "what is the tool to the left of yellow anatomy?" can be formed by the template "what is the tool <R> <C> <L> <T>?", by replacing the parameters <R>, <C>, <L> and <T> into "to the left of", "yellow", "null" and "anatomy".The questions are parameterized by five parameters, namely <C> (color), <L> (location), <T> (type), <N> (name), and <R> (relationship).In total, there are 40 question templates containing different types of questions, such as querying object (e.g., "what is the name of instruments to the left of the gallbladder?"),querying attribute ( e.g., "there is an object that is both to the left of the yellow thing and below the brown anatomy; what is its location?"),querying relation (e.g., "what is the action being performed?"),confirming existence (e.g., "is there a bipolar in the top-mid location?"),and counting (e.g., "how many instruments are in the bottom-right?").Generated questions also fall into three categories depending on their complexity: zero-hop, one-hop, and singleand.Each requires different visual reasoning steps to resolve.Specifically, solving these three types of questions involves the understanding of relations between zero, one, or two surgical objects, respectively.Examples from each category are provided in the supplementary material.The question engine allows us to freely control the complexity, length, and number of questions per image.

Sampling and balancing
Here, we introduce the applied strategies to reduce the effect of class imbalance and question-condition bias during SSG-VQA dataset construction.Surgical VQA is a multichoice task, which mainly includes questions about the surgical objects.Therefore, an imbalance in the occurrence of surgical objects could lead to an imbalance in their class distribution.To address that, we sample the frame amounts based on the surgical phase and tool presence labels from the Cholec80 dataset, instead of sampling evenly like Cholec80-VQA [20].Also, to address the question-condition bias that the information is leaked out from poorly formulated questions, we remove the question templates that may contain the question-condition bias.We also eliminate poorly formulated questions, such as "What is the location of the <N>?" with <N>=gallbladder when there is no gallbladder in the scene.These processing strategies prevent question-condition bias and avoid generating degenerate question-answer pairs.The overall pipeline is shown in Fig. 2.

SSG-VQA-Net 2.2.1. Pipeline
Here, we explain the pipeline of SSG-VQA-Net.Given the textual form of a question, we first extract textual embeddings of questions using a pre-trained tokenizer [20], denoted as T = {t 1 , ...t K }.From the surgical scene image, we extract a feature map using the ResNet18 [7] visual backbone.Then we use a trained object detector, YOLOv7 [22], to detect the surgical objects and extract N object-wise visual embeddings using RoIAlign pooling, denoted as V = {v 1 , ...v N }.The object detector is trained on the bounding boxes of surgical objects from the SSG-VQA training dataset.
We build the scene embeddings using the detected surgical objects' information.Specifically, we initialize the graph nodes as a concatenation of objects' class labels and spatial coordinates, as shown in Fig. 3.These low-dimensional embeddings are projected using a linear layer, called Scene Encoder, to match the dimensionality of textual embeddings.These scene embeddings S = {s 1 , ...s N } are then passed through our proposed Scene-embedded interaction (SIM) module, explained below, to obtain text-aware scene embeddings (S r ).These text-aware scene embeddings (S r ) are then concatenated with the visual embeddings (V) and the textual embeddings (T ) and passed through a self-attention-based transformer module.Finally, features are average-pooled and mapped to a predefined answer set to generate the output answer.

Scene-embedded interaction Module
In SSG-VQA-Net, initial scene embeddings S capture global surgical scene semantics.To handle complex questions that require localized focus, we introduce a lightweight Scene-embedded Interaction Module (SIM).The main objective of SIM is to correlate the textual embeddings with the scene embeddings.SIM consists of two interaction layers.Each layer comprises self-attention, cross-attention, and feedforward sub-layers, as shown in Fig. 4. The attention mechanism is defined as: where the Q is the short for query, K is key, and V is value.
In SIM, we first apply cross-attention, S r = Cross-Attention(S , T, T ), to the textual and scene embeddings by processing textual embeddings T as key and  value, and scene embeddings S as the query.This results in refined scene embeddings, which incorporate textual cues.Then, the refined scene embeddings are passed to the selfattention layer, S r = Self-Attention(S r , S r , S r ), to interact with themselves.By interacting with the textual embeddings and the scene embeddings, we obtain the textual-aware scene embeddings S r .Through our ablation experiments, we show that the S r significantly contributes to providing correct answers to fine-grained questions.

Dataset comparison
SSG-VQA dataset contains the same train and test set videos as CholecT45 [17] dataset, which contains 40 laparoscopic cholecystectomy videos for training and 5 videos for testing.Our SSG-VQA dataset contains 960k questions from 25k surgical scenes.Table 7 presents the comparison between SSG-VQA and the typical datasets from prior work, i.e., EndoVis-18-VQA and Cholec80-VQA from [20], showing a 8× and 22× increase in number of questions, respectively.Also, our SSG-VQA dataset contains more diverse Table 1: We show that the SSG-VQA dataset is more challenging and balanced than the EndoVis-18-VQA [20] and Cholec80-VQA [20].SSG-VQA dataset includes more attributes and complexities in the questions.

Dataset
EndoVis-18-VQA [20]  questions per scene (38.9 v.s.6.5) and much longer questions (12.8 words v.s.5.8 words).Furthermore, SSG-VQA contains a wider range of categories for object attributes, names, and inter-object relationships, as shown in the appendix.Also, compared to the Cholec80-VQA dataset which provides 51 questions for all surgical scenes, our SSG-VQA has more diverse questions (501k) that are unique to surgical scenes, which prevents the model from overfitting to specific question patterns.

Question-condition bias
VQA systems can exploit the question-condition bias from the dataset as a shortcut to answer questions without understanding the visual scenes.To quantify this bias, we train language-only models like ClinicalBert [9] to answer the questions without any visual information on existing VQA datasets, such as EndoVis-18-VQA and Cholec80-VQA, and on our SSG-VQA dataset.As shown in Table 2, the language-only model ClinicalBert outperforms the vision-language multi-modal models in EndoVis-18-VQA, suggesting the questions from EndoVis-18-VQA contain simple shortcuts to the correct answer.Cholec80-VQA and SSG-VQA have a lower bias as their questions are more visionrelevant.SSG-VQA further reduces bias by using scenegraph-based diverse questions.In the following, we perform the experiments on the Cholec80-VQA and SSG-VQA dataset due to their low question-condition bias.

Results of SSG-VQA-Net 3.3.1. Results on SSG-VQA
Comparison to other works.As shown in Table 3, SSG-VQA-Net outperforms baseline models like VisualBert [15] and VisualBert Resmlp [20] in metrics such as mAP, Recall, and Fscore.We also train an upper-bound model, called SSG-VQA-Net (oracle), that uses the scene embeddings from detection labels of the SSG-VQA dataset instead of using the trained YOLOv7 object detector.This model outperforms prior works significantly, emphasizing the importance of highquality scene embedding inputs.
Analysis by question type.As shown in Table 4, SSG-VQA-Net can handle various question types.For "counting" questions, it outperforms VisualBert by 7.3 points in mAP.For "existence" and "query object" types, the model again shows superior performance w.r.t to baseline models.
Analysis by complexity.SSG-VQA dataset provides the diagnostic setup to pinpoint the weakness of the model.As shown in Table 4, we compute the performance of our models on questions that require different visual reasoning complexity, i.e., zero-hop and one-hop, and single-and.Our model shows consistent gains in both simple and complex question queries.For one-hop and single-and type of questions, SSG-VQA-Net achieves a 4.1 and 8.1 point mAP increase over Vi-sualBERT ResMLP, respectively.This indicates that the inclusion of scene context can aid in resolving complex queries.

Results on Cholec80-VQA
We also conduct the experiments on the other publicly available surgical VQA dataset Cholec80-VQA.As illustrated in Table 5, SSG-VQA-Net significantly outperforms the Sur-gicalGPT [19], which requires heavy sequence decoding using GPT-2 architecture.This highlights that the bottleneck of the current surgical VQA problem lies in the visual scene understanding instead of text generation.Also, even using YOLOv7 for object detection, our model achieves higher per-Table 2: Identification of question-condition bias in existing datasets.We use the Accuracy, Recall, and Fscore metrics from SurgicalVQA [20].Endovis-VQA contains significant question-condition bias because the language model with pure language inputs can outperform the model with vision and language inputs.

Methods
Endovis-VQA [20] Cholec80-VQA [20] SSGQA Acc Recall Fscore Acc Recall Fscore Acc Recall Fscore L+Bert [5] 57.5 45.9 Table 4: Breakdown results of the prior models and our models.We show that our model outperforms the baselines by a large margin, especially on the complex questions that require visual reasoning.Also, the results on a different set of questions shows that our dataset is not dominated by one type of question.We report the mAP here.

Ablation study
Table 6 shows that combining both the Scene-embedded Interaction Module (SIM) and RoIAlign (ROI) pooling significantly boosts the model's performance.This suggests that these modules are not just individually beneficial but are actually complementary.Specifically, the model attains the highest mAP (54.9%) when both components are added.Also, the improvement indicates that introducing scene knowledge rep-resentation learning is crucial for robust surgical visual question answering.

Conclusion
In this paper, we tackle the problem of visual question answering (VQA) in the context of fine-grained surgical scene understanding.First, we introduce a new dataset called SSG-VQA, which uses a surgical scene graph as an underlying representation and a question-answer generation engine to generate diverse, geometrically grounded, and surgical-action- oriented question-answer pairs.The question-answer pairs are also sampled to mitigate the question-condition bias that exists in the current surgical VQA datasets.We also propose a novel model called SSG-VQA-Net to explicitly incorporate scene knowledge and object-wise local features in the VQA model design to improve the reasoning ability on complex questions.The results show that SSG-VQA-Net outperforms existing baseline models by a large margin.

Acknowledgments
This work has received funding from the European Union (ERC, CompSURG, 101088553).Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council.Neither the European Union nor the granting authority can be held responsible for them.This work was also partially supported by French state funds managed by the ANR under Grants ANR-20-CHIA-0029-01 and ANR-10-IAHU-02.This work was granted access to the HPC resources of IDRIS under the allocation AD011013704R1 and AD011011631R2 made by GENCI.This work was granted access to the HPC resources of IDRIS under the allocations AD011013704R1, AD011011631R2, and AD011011631R3 made by GENCI.The authors would like to acknowledge the High Performance Computing Center of the University of Strasbourg for supporting this work by providing scientific support and access to computing resources.Part of the computing resources were funded by the Equipex Equip@Meso project (Programme Investissements d'Avenir) and the CPER Alsacalcul/Big Data.assign attributes to each surgical object, e.g., type, color, and location of the objects.Note that the spatial relations are fully connected among the objects.We show the important relations for a simpler demonstration.

Complexities of Question-answer Pairs
As the SSG-VQA dataset is generated with a question engine and question templates automatically, we control the complexity of question-answer pairs by altering the question templates.The question-answer pairs with different complexities also offer the diagnostic ability to pinpoint the weakness of the VQA model.As shown in Fig. 7, we have 5 kinds of question-answer pairs with 3 kinds of complexities.Here, the complexity is measured based on how many reasoning steps are required to answer the question.For example, the fifth question-answer in Fig. 7 requires two steps of reasoning to solve, highlighted in red.As a result, this question is more complex than the others.

Qualitative Results
In this section, we compare our SSG-VQA-Net to the stateof-the-art surgical VQA models in terms of the question that  require the understanding of attributes (Sec.5.2.1), action triplet occurrence (Sec.5.2.2) and reasoning steps (Sec.5.2.3) to solve.

Attributes
As shown in Fig. 8, the SSG-VQA-Net model outperforms the prior works when the questions contain attributes, such as location and color.For example, the second subfigure in Fig. 8 shows that the SSG-VQA-Net model understands the location attributes such as "top-mid" and "bottom-mid" to resolve the question.Compared to the other methods, it contains more detailed knowledge about the surgical scene.

Action Triplets
We also compare the SSG-VQA-Net model to the prior works on the questions that require an understanding of surgical action triplet.Fig. 9 shows that the SSG-VQA-Net model can answer the questions that are generated based on the triplet annotations.

Reasoning Steps
Fig. 10 shows that SSG-VQA-Net can perform multiple steps of reasoning to generate the correct answer to complex questions.Compared to the prior works, it achieves better results when the questions are more complex.

Fig. 2 :
Fig. 2: Pipeline of SSG-VQA construction.The dataset is constructed from the well-designed question engine, which takes the scene graph as input and changes the parameters of question templates to generate diverse question-answer pairs.

Fig. 3 :
Fig. 3: Pipeline of the SSG-VQA-Net.It requires three types of inputs, textual, visual, and scene knowledge.The textual and scene embeddings are fed into the SIM and generate refined scene embeddings.The visual embeddings are generated from the RoIAlign.Finally, we concat them to feed into the self-attention transformer to get the final answer.Here, G, H, A, and L represent class labels; x 1 , y 1 , x 2 , and y 2 represent bounding box coordinates (G: gallbladder, H: hook; A: abdominal wall cavity; L: Liver).

Fig. 4 :
Fig. 4: Scene-embedded interaction module.It is a stack of layers of cross-attention and self-attention.The cross-attention modulates the scene embeddings based on the text queries, while the self-attention refines the scene embeddings .

Fig. 7 :
Fig. 7: SSG-VQA Dataset -different complexities: We generate different type of question-answer pairs for each surgical scene image.The text in red indicates the reasoning step in the image, e.g., the fifth question requires two steps of reasoning to solve, therefore categorized into single-and type.

Fig. 8 :
Fig. 8: Qualitative Results: Comparison of SSG-VQA-Net with baseline surgical VQA models on the question-answer pairs based on spatial, color, counting, anatomy, and tool presence attributes.

Fig. 9 :
Fig. 9: Qualitative Results: Comparison of SSG-VQA-Net with baseline surgical VQA models on the question-answer pairs based on surgical action triplets.

Fig. 10 :
Fig. 10: Qualitative Results: It shows that the questions from our SSG-VQA dataset require visual reasoning to solve.Also, our SSG-VQA-Net with scene knowledge outperforms the prior baselines in terms of complex questions.

Table 3 :
Comparison results for baselines and our models.The SSG-VQA-Net with scene knowledge achieves the best results.The SSG-VQA-Net (oracle) model refers to the model that uses detection labels from the SSG-VQA dataset to construct the scene embeddings instead of using the trained YOLOv7 object detector.

Table 5 :
Results on the Cholec80-VQA.SSG-VQA-Net achieves better results than the state-of-the-art models, even w.r.t SurgicalGPT, which contains a heavy sequence decoding module of GPT-2

Table 6 :
Effect of different modules.RoIAlign pooling boosts results, and the Scene-embedded Interaction Module further enhances them.Both modules offer complementary benefits.