Exploring Semantic Consistency in Unpaired Image Translation to Generate Data for Surgical Applications

In surgical computer vision applications, obtaining labeled training data is challenging due to data-privacy concerns and the need for expert annotation. Unpaired image-to-image translation techniques have been explored to automatically generate large annotated datasets by translating synthetic images to the realistic domain. However, preserving the structure and semantic consistency between the input and translated images presents significant challenges, mainly when there is a distributional mismatch in the semantic characteristics of the domains. This study empirically investigates unpaired image translation methods for generating suitable data in surgical applications, explicitly focusing on semantic consistency. We extensively evaluate various state-of-the-art image translation models on two challenging surgical datasets and downstream semantic segmentation tasks. We find that a simple combination of structural-similarity loss and contrastive learning yields the most promising results. Quantitatively, we show that the data generated with this approach yields higher semantic consistency and can be used more effectively as training data.The code is available at https://gitlab.com/nct_tso_public/constructs.


Introduction
The field of surgical data science has witnessed a resurgence in recent years, propelled by rapid advancements in data science methodologies and deep learning (DL) techniques [35].Meanwhile, clinical innovations such as robotand computer-assisted surgical systems have revolutionized minimally-invasive surgery over the last decade by providing intraoperative guidance and decision support [17,36,5].
However, the field encounters a significant constraint by the limited access to large annotated datasets, which impedes the potential for training large and powerful mod-Figure 1.Generation of realistic data from synthetic surgical images with unpaired image translation method.The semantic mismatch between domains can lead to inconsistent translations, like blood texture (red color) getting mapped onto different structures (highlighted in white boxes).Some regions with consistent semantic translation are indicated in yellow boxes.els [36,35].Multiple challenges contribute to this limitation, including the technical complexities in acquiring patient data directly from the operating room [16], legal regulations on data sharing, and the substantial costs involved in expert labeling, given the restricted availability of domain specialists (i.e., surgical professionals).One potential solution to overcome these challenges is adopting synthetic training data generated through computer simulations [48,43,63,44].Synthetic data presents the advantage of automatically generating substantial volumes of fully labeled data.Nonetheless, enforcing real-world characteristics in such synthetic datasets can be a significant hurdle.
Image-to-image translation (I2I) methods are generative modeling techniques that have gained popularity for translating images between different domains.Within the field of data generation, the applicability of paired image translation methods [23,34] is limited.Conversely, unpaired image translation methods [69], which do not require corresponding image pairs, have emerged as promising solutions for various computer vision tasks.These methods are employed in tasks such as translating synthetic images into realistic ones, performing style transfer, and adapting images across different domains [43,3,22,33,66,8,19,11,30].Overall unpaired image translation methods are suitable for surgical applications, but they face challenges in preserving contextual and semantic details across the domains.
In practice, translation methods aim to align the image statistics between the two domains.In addition to the difference in image distributions, semantic variations in distributions also exist, which is commonly referred to as "unmatched semantic statistics" [24] and poses a critical problem in preserving the semantics during translation.As displayed in Figure 1, when faced with unmatched semantic distributions, attempting to align the distributions between translated and target images forcibly can result in spurious solutions, where semantic information is distorted [15,24].
Several approaches have been proposed to preserve semantics during image translations and mitigate semantic distortion.However, these methods might require additional supervision or pre-trained models [53,19].Alternatively, some approaches are excessively restrictive, tailored to specific datasets, and prone to introducing artifacts [3,67].Furthermore, methods based on mutual information between the images [15] and optimization of robustness loss [24] have also been explored to mitigate this issue.
In real surgical scenarios, an additional challenge arises from the variations in lighting conditions, which may not be adequately reflected in existing baseline datasets [10,23,9].While synthetic images can incorporate such parameters, creating such a realistic environment takes time and effort.The central idea would be to develop simple virtual scenarios and utilize DL approaches to enhance realism.Semantic consistency can be affected when such variations exist and addressing these short comings is essential as without doing so, the generated data lacks practical utility for subsequent training of models (Section 4.4).
To the best of our knowledge, this study represents the first comprehensive investigation of unpaired image translation techniques to generate data in the context of surgical applications.We summarize our contribution as follows.
• We empirically analyze various methods for unpaired image translation by assessing both the semantic consistency of the translated images and their utility as training data in diverse downstream tasks.
• We tackle the underexplored problem of creating semantic consistent datasets with annotations.We fo-cus on translating synthetic anatomical images into realistic surgical images on datasets from minimallyinvasive surgeries, namely, cholecystectomy and gastrectomy.
• Guided by our analysis, we define a novel combination of an image quality assessment metric [62] as a loss function with the contrastive learning framework [41] as a simple yet effective modification to tackle the challenge of semantic distortion.
• We found promising results that this method is more effective than many of the existing unpaired translation methods, highlighting its effectiveness in maintaining semantic consistency.

Related Work & Background
Image-to-Image translation.The objective is to generate images in a desired target domain while preserving the structure and semantics of the input.Generative adversarial networks (GANs) [13] have proven to be a powerful approach for image translation, learning the mapping between input and output images.While alternative methods like diffusion models and variational autoencoders (VAEs) have been proposed, research is still in a nascent stage, and they are mainly employed in supervised conditional settings or purely generative contexts [47,58,60,46,52].GANs still remain a prime choice for unpaired image translation tasks.In the case of unpaired translation, a technique called cycle consistency [69] was introduced, which seeks to learn the reverse mapping between different domains by leveraging reconstruction loss.Various approaches have been proposed to address multi-modal and domain translations, focusing on disentangling images' content and style information in distinct spaces [33,22,43,8,32,70].Although these approaches effectively exploit cycle consistency, they often rely on the assumption of a bijective relationship between domains, which can be overly restrictive.Achieving perfect reconstruction becomes challenging, mainly when a semantic mismatch exists between the domains [41,24].
To address this limitation, one-sided translation methods have been proposed as alternatives to cycle consistency.For instance, GcGAN [12] incorporates an equivariance constraint for geometric image transformations, while Dis-tanceGAN [55] enforces consistency regularization based on distances between the images.Similarly, Harmonic-GAN [67] enforces visual similarities between domains, and TraVeLGAN [2] preserves the arithmetic properties of embedding vectors.Attention-based techniques [54,61,1,49,59] have also been proposed to maintain semantic coherence before and after the translation process.Efforts such as [11,25,38] have been made to minimize the perceptual or content loss by utilizing a pre-trained VGG model to decrease the content disparity between the domains.However, this approach is computationally expensive and lacks adaptability to the available data.A contrastive learning-based image translation method was proposed in the CUT [41] model.
Semantic robustness via losses.Recently, two approaches were proposed to minimize semantic distortion during translation.SRUNIT [24], based on CUT, proposed a semantic robustness loss that is optimized between the input features of the domain X with the perturbated variant of the same.The intuition is that the semantics of the output should remain invariant to any perceptual (image level) changes in the input.The loss is defined as, where F r indicates the feature extractor, G r f is the feature extracted from the f t h layer of the generative model G and τ is the perturbation parameter.Similarly, a structural consistency constraint (SCC) [15] was proposed to maintain the semantics.The color randomness in the pixel values of the images before and after the translation was reduced by exploiting mutual information.The SCC loss is defined as, where rSM I is the relative squared loss mutual information, N is the number of samples and R (•) denotes the random variables for pixels in x i and T (y).Methods like NEGCUT [61] trained a separate generator to generate negative samples dynamically, effectively bringing positive and query examples closer together, whereas F-LeSim [68] focused on preserving scene structures by minimizing losses based on spatially-correlative maps.Medical imaging Cross modality image synthesis was proposed in [64,65] to improve sample quality and efficiency in MR images based on cycle consistency.Similarly, cycle consistency was used for endoscopic image synthesis [48] while contrastive learning has been used for medical image segmentation in [6,21,42].In this study we focus on developing semantic consistent unpaired image translation of surgical images that differ in modality to MR images and the application proposed so far.Pfeiffer et al. [43] proposed a variant of MUNIT [22] combining MS-SSim [62] as a loss for laparoscopic surgery application.This model still falls behind in maintaining semantic consistency (Table 1).Hereafter, we refer to this model as LapMUNIT.

Model setup
Our goal is to maintain the content and semantic correlation between the anatomical structures during translation.
In this section, we define various components of the translation approach.

Adversarial learning
GANs [13] have been promising candidates for image translation tasks.The main goal of such an image translation technique is to acquire the ability to map between two domains, X and Y, based on training samples x i and y j drawn from the distributions p(X) and p(Y ) respectively.The generator G X Y learns the mapping between domains and generates the translated image T (y) and the discriminator D Y is trained to distinguish between original images x and translated images.The adversarial loss is defined as, (3) Typically, the loss is used to encourage the distributional match between the translated images and images from domain Y.

Patch constrastive learning
This framework was formulated on noise contrastive estimation (NCE), aiming to maximize the mutual information between the domains.The InfoNCE loss [40] was used to learn embeddings between the domains and establish associations between corresponding patches of input and output images while disassociating them if unrelated.Let s be the query vector and s + and s − be the positive and negative vectors from the images, respectively.The s − vectors are sampled at N different locations in the input.Finally, the loss is formulated as an (N+1)-way classification and defined as ) where τ is a scaling parameter to factor the distances between the vectors.
A multilayer patch-based contrastive loss was further employed within the CUT framework, formally defined as PatchNCE.It leverages the ready availability of the generator G X Y to extract features from L layers at S spatial locations.The PatchNCE loss is defined as, Despite its aim to promote semantic consistency between input and output images, CUT still faces challenges when the two domains have different semantic characteristics (Section 4.4).This challenge stems from the limited ability of the contrastively learned semantics to enforce correspondence across different domains effectively.

Semantic consistency
Next, we define the multi-scale structural similarity (MS-SSim) [62] metric.This measure was proposed as a metric for image quality assessment.The extracted structure information from the images is compared on varying image resolutions with a weighting factor for each.Initially, given two images, x and y, let Then contrast sensitivity(cs) and structure map (ss) are defined as, where µ (•) and σ (•) are the mean and variance and σ x,y is the covariance between x and y.C 1 and C 2 are small constants depending on the pixel values.The MS-SSim metric is defined as, where i = 1 • • • K denotes the number of different image scales and W i the weight for the i th scale.Hereafter, we mention the loss as semantic loss.It is defined as, Constrastive learning coupled with MS-SSIM The semantic loss concentrates on maintaining the feature structure and considers the lighting conditions.We coupled Contrastive Learning with Structural Similarity (ConStructS) as a model to tackle semantic distortion.To the best of our knowledge, this combination has not been proposed yet.
The model overview is shown in Figure 2. The generated image should preserve the content information from domain X, while the style should be drawn from domain Y .Additionally, semantic consistency should to be maintained.The final objective is defined as, where λ x , λ y , and λ ss are weighting parameters for the PatchNCE and semantic losses, respectively.The L P atch (Y ) resembles the identity loss [69] and is applied to the images y to prevent degenerate cases.

Experiments
In this section, we outline our experiments where the performance of several popular unpaired image translation models namely, CycleGAN [69], the VGG-based perpetual loss [25], GcGAN [12], DistanceGAN [55], DRIT++ [32], LapMUNIT [43], UGAT-IT [27], NEGCUT [61] are compared.We demonstrate the effectiveness of Constructs in translating synthetic data to the realistic domain with minimal semantic distortion.Also, various configurations of contrastive-based models were investigated.The CUT model was trained with the SCC loss [15] and SRC [26].FeSim and LeSim [68] were trained with the CUT as the backbone.The feature perturbated version of CUT, SRUNIT [24] was also compared.
In particular, the existing baselines exhibit distinct strengths and weaknesses.While certain baselines excel in specific tasks, they may falter in others.Except for LapMU-NIT [43], no tailored approach exists for surgical scenarios.Consequently, we evaluate ConStructS against several other methods to align with the prevailing research.
Finally, we provide a rationale for the design choices made in the ConStructS model to ensure semantic consistency with an ablation study.We train the model without the semantic loss, which reverts to the basic CUT model [41] and without the PatchNCE loss.Similarly, we combined the semantic loss with cycle consistency into the CycleGAN model for a different combination.

Data
We evaluated the translation methods mentioned above on two different surgical datasets.
Cholecystectomy dataset.This surgery serves to remove the gallbladder.For the simulated domain X , we utilized the publicly available synthetic dataset resembling laparoscopic scenes [43].The dataset consists of different anatomical structures such as the liver, liver ligament, gallbladder, abdominal wall, and fat, as well as surgical tools.A total of 20, 000 rendered images forms the synthetic dataset.The real images for the domain Y are taken from 80 videos of the Cholec80 data set (videos of 80 laparoscopic cholecystectomies) [56].The videos in which the gallbladder was still intact were identified, and frames were extracted.We finally created a training dataset of approximately 26, 000 images from 75 patients.A separate segmentation dataset of 5 patients was chosen.The liver was manually segmented in 196 images for the downstream evaluation (Section 4.2).
The images were cropped to 256 x 512 pixels, and the training set consists of 17, 500 images, with the remaining 2500 serving as the test set.
Gastrectomy dataset.For this case, we utilized the real and synthetic dataset from [63], based on 40 real surgical videos of distal gastrectomy.Along with the surgical tools, five different structures exist the gallbladder, liver, pancreas, spleen, and stomach.The dataset consists of 3400 synthetic and 4500 real images with corresponding segmentation masks.2400 images constituted the training set, with 1000 images as the test set.The images were resized and cropped to 512 x 512 pixels.

Evaluation
The careful selection of appropriate metrics to quantitatively evaluate translation performance is paramount.Our specific focus lies in maintaining semantic consistency and realism during translation.While widely used metrics like FID, KID, and MMD [18,14,4] have gained popularity, they do not account for the unmatched semantics inherent in unpaired image translation datasets [24].Therefore, we opted for two different quantitative evaluation schemes to overcome these limitations.We also provide qualitative evaluations.
Train:Real− →Eval:Synthetic Firstly, we adopted the practice of computing metrics based on an off-the-shelf segmentation model following [9,12,41,63].We train a segmentation model on the real images of the specific dataset.Then the translated synthetic images are tested using this pre-trained model i.e, the metrics are computed against ground truth labels of the synthetic images.The underlying intuition of this approach is that, if the translation model is able to reduce the domain gap between the real and synthetic images, then the segmentation accuracy from this pretrained model on the translated synthetic images would be higher [23].This method assesses both the quality, as well as semantic consistency of the translated images.We refer to this method as eval-1.
Translated images as training data: Furthermore, we assess the practical utility of the translated images in a downstream task, as explored in [43].We show the usefulness of the translated synthetic images in two different methods.Firstly, we train a segmentation model using only the translated images and evaluate the performance of this model on segmenting the liver organ on a patient-specific categorized dataset consisting of real images.This approach aligns with the intuition mentioned above and provides insights into the realism of the translated images.Secondly, we fine-tune this model on the real data and evaluate them on the same test set of real images.The performance is also compared to a baseline model trained only on real images.By adopting these methodologies, we effectively demonstrate the value of utilizing the translation approach to generate realistic training data.Hereafter, we refer to this method as eval-2.

Implementation details
We intentionally matched the architectures and hyperparameters to have a reasonable performance comparison to the CUT [41] and CycleGAN [69] models and their variants.We used a ResNet [25] based generator and a Patch-GAN [23] as the discriminator-the LS-GAN loss [37] used while training the generator with Adam optimizer and a batch size of 1.The λ x and λ y were maintained at a value of 1.Following [43], the semantic loss was operated on the images' brightness (average over the channels) as this retains the brightness variations while avoiding the penalization of style-related changes in hue.All the models were trained based on author's code.The same image size was maintained throughout for training of all the baseline models.
We used a DeepLabV3+ [7] model for both evaluation schemes.This model generally performs well for semantic segmentation of surgical images [45,63,29,50].A 3-fold cross-validation was performed for eval-1.We report mean-pixel accuracy (pxAcc), class accuracy (classweighted pixel accuracy) (clsAcc), and mean intersectionover-union (mIOU) metrics.For the segmentation of the liver on the real images (eval-2), the median dice scores are reported.For more details on training, the readers can refer to the supplementary material.

Results
Cholecystectomy dataset.The quantitative results are presented in Table 1, highlighting the performance of different models.Comparatively, the CycleGAN model with the VGG loss demonstrates better performance than the SCC loss variant.The geometric consistency in GcGAN [12] leads to a comparable class-accuracy value with ConStructS while outperforming DistGAN [55] and DRIT++ [31].The LapMUNIT [43] model achieves better scores than the attention-based models.As for the variants of CUT, the addition of SCC loss did not improve its performance further.The SRC [26] loss helps achieve a class-accuracy score of 0.43 which coincides with LeSim [68].SRUNIT [24] and NEGCUT [61] indicate similar performance.Overall, as evidenced by the results, the ConStructS model minimizes semantic distortion to a greater extent and outperforms the recent methods [24,15].
Table 2 indicates the results of eval-2 method.When the translated images are solely used as training data, the Con-StructS model yields comparable results on segmenting the liver to GcGAN [12] and CycleGAN [69].A gain of 9% in dice score is obtained compared to the baseline model.Fine-tuning the same model on real data shows that the Con-StructS method outperforms the baseline models, showing a 4% − 6% improvement in dice scores.Overall, a 25% improvement is obtained with this model.The qualitative results in Figure 3 indicate that the ConStructS model reduces the semantic distortion, although not completely, but better than most other translation methods.3, quantitative analysis reveals that Lap-MUNIT [43] outperforms both GcGAN [12] and Cycle-GAN [69]    all the other models.

Method pxAcc clsACC mIOU
CycleGAN [69] 0.39 ± 0.12 0.17 ± 0.14 0.09 ± 0.10 GcGAN [12] 0.40 ± 0.13 0.18 ± 0.01 0.10 ± 0.01 LapMUNIT [43] 0.43 ± 0.01 0.21 ± 0.10 0.11 ± 0.09 CUT [41] 0.42 ± 0.01 0.22 ± 0.02 0.11 ± 0.05 SRUNIT [24] 0.44 ± 0.01 0.20 ± 0.01 0.10 ± 0.05 SRC [26] 0. Ablation study.The qualitative results of the ablation study are presented in Figure 5.When examining the CUT model, specifically ConStructS, without semantic loss, we observe that the structure is well preserved during translation.However, there is a noticeable mismatch in texture in regions with reduced brightness, which can be attributed to variations in lighting conditions.In the absence of the PatchNCE loss, as there is no explicit control over image patches, structure information is mixed, resulting in the mapping of styles from different structures (e.g., fat or blood) to unlikely regions.Lastly, the combination of the semantic loss with the CycleGAN model yields an improvement compared to the basic CycleGAN model.Regardless, as seen from Table 1, this combination still lacks performance.
Sensitivity analysis.In this section, we study the sensitivity of the parameter λ SS and the direct influence of the semantic loss.The parameter λ SS is varied between the values 1, 2, 3, 5 and 10.As the λ SS value is increased, there is a performance improvement up to a certain threshold.Figure 6 and Table 4 indicate that setting the appropriate λ SS (here, 5) effectively controls the irregular texture between the gallbladder and the tool.However, it is necessary to note that large values limit the benefits of the semantic loss, as the model primarily focuses on reducing structure distortion and disregards the style information.A binary search could identify the largest λ SS value that maintains the semantic character.

Discussion
Traditional approaches prioritizing distance preservation, such as DistanceGAN [55] or the L 1 reconstruction loss used in CycleGAN [69], typically do not effectively enhance semantic consistency.These pixel-based metrics tend to be highly sensitive to structural transformations and variations in lighting conditions, which can introduce artifacts in the generated images during translation.While SRUNIT [24] shows promise in reducing semantic distortion by introducing perturbations in the feature space, it alone is insufficient for the specific application at hand.On the contrary, the NEGCUT [61] model aims to preserve the overall structure during translation but needs to be more accurate in mapping textures between these structures.The same limitation has been observed in the LeSim [68] model.
The results of our ablation study demonstrate the crucial role of combining PatchNCE with semantic loss in mitigating semantic distortion.We posit that leveraging the contrastive learning approach on the generator's encoded feature space makes learning higher-level attributes, such as organ or tool structures, possible.However, relying solely on this aspect for matching semantic information has limitations [24].To address this, we introduced the semantic loss as a regularizer that operates on the multiple scales of the images (i.e., different resolutions), checking for the structure consistency and lighting conditions (Equation 6).The combination of PatchNCE and semantic loss proves effective in preserving the semantic characteristics throughout the translation process.
Limitations.The combination of semantic loss and contrastive learning holds promise for mitigating semantic inconsistencies; however, it is essential to acknowledge its limitations and potential failure cases.Achieving a comprehensive and universal solution to the semantic distortion challenge solely through a single image-to-image translation method is a complex task.Existing approaches primarily focusing on one-sided translation have overlooked the synthesis of multi-modal data.In this context, the Con-StructS model emerges as a promising candidate for future exploration and research.By incorporating multi-modal outcomes, this model can be utilized to generate diverse data with improved semantic consistency.

Conclusion
In this study, we conducted an empirical investigation on the issue of semantic inconsistency in unpaired image translation, focusing on its relevance to surgical applications where labeled data is minimal.We extensively evaluate several state-of-the-art unpaired translation methods, explicitly targeting the translation of images from a simulated domain to a realistic environment.Addressing the problem of semantic distortion, we found a novel combination of a structure similarity metric with contrastive learning as the most effective.Surprisingly, this simple model reduces semantic distortion while preserving the realism of the translated images and shows the highest utility as training data for downstream tasks.

Supplementary material: Exploring Semantic Consistency in Unpaired Image Translation to
Generate Data for Surgical Applications

A. Dataset
For the cholecystectomy dataset, the liver meshes were taken from a public dataset (3D-IRCADb 01 data set, IR-CAD, France), while all other structures were designed manually.The camera was moved around along with the light source of the laparoscope, and the synthetic images were rendered.For the real dataset, the images were extracted at a frame rate of five frames per second.A total of 75 videos were chosen and the images were then curated to remove scenes with only anatomical structures or tools, and finally, a dataset of 26, 000 images was composed.The remaining 5 videos were chosen fro the downstream evaluation.The synthetic dataset has been downloaded from http://opencas.dkfz.de/image2image/.Similarly, for the gastrectomy dataset, the entire dataset along with labels has been downloaded from https://www.kaggle.com/datasets/yjh4374/sisvse-dataset. Figure 7 shows some examples of the dataset.

B. Training details
The architectures and hyperparameters were intentionally matched to have a reasonable performance comparison to the CUT [41] and CycleGAN [69] models and their variants.Following [43], the semantic loss was operated on the images' brightness (average over the channels) as this retains the brightness variations while avoiding the penalization of style-related changes in hue.
For the ConStructS model, we used a discriminator similar to CUT [41] but replaced the normalization layers with the spectral norm [39] for stabilized training.The Adam optimizer [28] was utilized with a learning rate of 2e−4 with a linear decay in learning rate.The generator also serves the purpose of the feature encoder to compute the contrastive loss.Correspondingly, the features were encoded from the 1, 4, 8, 12 and 16 th layer of the generator.This was constantly maintained for both datasets.The encoded features are passed through a two-layer MLP with 256 neurons each to extract the feature vectors and are normalized with the L2 norm.These feature vectors were utilized for computing the PatchNCE loss at 256 different locations.A batch size of 1 was employed throughout.For the cholecystectomy dataset, the model was trained for approximately 600K iterations, whereas for the gastrectomy dataset, the training was carried out until 500K iterations.The models were trained on single NVIDIA RTX A5000 GPUs with 24GB memory.

C. Evaluation details
For the cholecystectomy dataset, the CholecSeg8K dataset [20], which is an annotated subset of real images from the Cholec 80 dataset, was used for training.The training dataset consists of 6000 images with a test set of 2020 images.The DeepLabV3+ [7] model was chosen as the segmentation network.We curated the dataset and defined six classes, namely, the liver, abdominal wall, fat, ligament, gallbladder, surgical tools, and background.The evaluation was posed as a multiclass segmentation problem.The different partitions of the tools were fused together into a single tool class.Similarly, for the gastrectomy dataset, six classes were defined: surgical tools, liver, stomach, spleen, pancreas, and gallbladder.Following [63], the models were trained on three different folds of train and test datasets.
For the downstream eval.method, five separate videos were chosen from the Cholec80 dataset.These videos were chosen such that they were not present in the CholecSeg8K dataset.The translation models are not exposed these images during training.Since annotating all the tissues and tools would require the guidance of a medical professional and, to simplify the process, only the liver tissue was annotated.The labelme [57] package was used to manually annotate the liver tissue in 196 images from five different patients.The regions of the liver with minimal lighting were under-segmented in case of doubt to ease the annotation process.A similar DeepLabV3+ model was employed to classify the liver organ.The models were trained and evaluated in leave-one-patient-out method i.e., the model is trained on four patients and evaluated on one patient and this procedure is followed five times.Finally, the mean dice scores is reported.The Adam optimizer was used along with a learning rate of 1e−5.The Onecycle [51] scheduler was used to modulate the learning rate during training.

D.1. Cholecystectomy
We provide additional qualitative results for both datasets.The Figure 8 further indicates that the ConStructS model is able to maintain both structure and semantic consistency when compared to many of the other models.Furthermore, the additional results of the ablation study from Figure 9 show the importance of combining the Patch-NCE loss with the semantic loss to reduce semantic distortion.the L P atch (Y ).The model performance deteriorates in the absence of L P atch (Y ).

D.2. Gastrectomy
The visual results depicted in Figure 10 demonstrate that ConStructS significantly mitigates semantic mismatches, particularly in regions characterized by differing specularity compared to other models.For the gastrectomy dataset, we noticed that the real images contain three extra classes compared to the synthetic domain.Mainly, there existed a class/element gauze(white) that exceeded in proportion to the other classes.This semantic imbalance was reflected in the translation performance, where many models mapped this texture, especially to regions with higher specularity.

D.3. Downstream evaluation
In Table 9 the segmentation scores of various models are shown for multi-class downstream evaluation.The translated images from the LapMUNIT [43] outperforms most the SOTA models.However, the images from the Con-StructS model proves useful in improving the scores upto 6%.In Table 6, the results on eval-2 method is reported.When compared to the baseline models, using only the translated images as training data from ConStructS leads to an overall improvement of 9% in dice scores

D.4. Metrics
We report FID score for the different models on the cholecystectomy dataset.The FID values for both the datasets on various layers are shown in Table 7 and Table 8 respectively.The general practice is to indicate the values on the layer with 2048 features.However, here we find an effect that different layers show different values.This is one of the reasons for the adoption of different evaluation schemes in this study.11.Additional results from the sensitivity analysis on the cholecystectomy dataset.The structure of the surgicall tool is the best maintained with λSS = 5.Similarly, for the same value we find that blood texture is not mixed with either the liver organ or the junction between the ligament and liver.

Figure 2 .
Figure 2. The overview of the ConStructS model with different loss functions.

Figure 3 .
Figure 3. Qualitative results of various translation methods on the cholecystectomy dataset.At the junction of two structures, the textures were interchanged in most of the models.Although not solved completely, the ConStructS model reduces semantic inconsistency.Some regions are highlighted in white boxes.
Figure 4 demonstrate that ConStructS significantly mitigates semantic mismatches, particularly in regions characterized by differing specularity compared to other models.As presented in Table

Figure 4 .
Figure 4. Qualitative samples from the gastrectomy dataset.The white boxes highlight some regions.The red box indicates one of the failure cases of ConStructS, where a tool-like texture is mapped on the liver.

Figure 5 .
Figure 5. Qualitative results of the ablation study on the cholecystectomy dataset.Texture mismatch occurs in low-lighting regions without the semantic loss.As seen from the 2 nd row without the PatchNCE loss, no explicit boundary exists between the liver and abdominal wall leading to both regions having the same semantic textures.

Figure 6 .
Figure 6.Sensitivity analysis examples on the cholecystectomy dataset.For the λSS value being 5, liver texture is maintained (1 st row) and the tool texture (grey lining) is avoided between the junction of the structures (2 nd row).

Figure 7 .
Figure 7. Examples from the surgical datasets.

Figure 8 .
Figure 8.Additional results from the cholecystectomy dataset.

Figure 9 .
Figure 9.Additional results from the ablation study on the cholecystectomy dataset.

Figure 10 .
Figure 10.Additional results from the the gastrectomy dataset.

Figure
Figure 11.Additional results from the sensitivity analysis on the cholecystectomy dataset.The structure of the surgicall tool is the best maintained with λSS = 5.Similarly, for the same value we find that blood texture is not mixed with either the liver organ or the junction between the ligament and liver.

Table 1 .
The results of various translation models on the cholecystectomy dataset.pxAcc and clsAcc denotes the pixel and mean class accuracy respectivly.mIOU is the mean intersection over union scores.The best result is indicated in bold, and the second best is underlined.
models.Conversely, the ConStructS model exhibits a moderate improvement in performance compared to

Table 2 .
The quantitative results for eval-2 method.Pt. refers to a patient, and the dice scores are reported.The baseline model is only trained on real images.When training with translated images from ConStructS and fine-tuning on real images leads to large gain in segmentation performance.

Table 3 .
The quantitative results of the translational models on the gastrectomy dataset.

Table 4 .
Quantitative results of the sensitivity analysis on λSS for the cholecystectomy dataset.

Table 5 .
Table 5 indicates the results for an ablation study where the ConStructS method is trained with and without LP atch(Y ) 0.50 ± 0.06 0.41 ± 0.13 0.25 ± 0.08 ConStructS 0.59 ± 0.07 0.44 ± 0.12 0.29 ± 0.09 The consistency evaluation results with and without the L P atch (Y ) loss.