Exploring semantic consistency in unpaired image translation to generate data for surgical applications

Venkatesh, Danush Kumar; Rivoir, Dominik; Pfeiffer, Micha; Kolbinger, Fiona; Distler, Marius; Weitz, Jürgen; Speidel, Stefanie

doi:10.1007/s11548-024-03079-1

Exploring semantic consistency in unpaired image translation to generate data for surgical applications

Original Article
Open access
Published: 26 February 2024

Volume 19, pages 985–993, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Exploring semantic consistency in unpaired image translation to generate data for surgical applications

Download PDF

Danush Kumar Venkatesh ORCID: orcid.org/0009-0004-1904-6444^1,2,3,
Dominik Rivoir^1,4,
Micha Pfeiffer¹,
Fiona Kolbinger^1,3,
Marius Distler³,
Jürgen Weitz^3,4 &
…
Stefanie Speidel^1,2,3,4

937 Accesses
13 Altmetric
Explore all metrics

Abstract

Purpose

In surgical computer vision applications, data privacy and expert annotation challenges impede the acquisition of labeled training data. Unpaired image-to-image translation techniques have been explored to automatically generate annotated datasets by translating synthetic images into a realistic domain. The preservation of structure and semantic consistency, i.e., per-class distribution during translation, poses a significant challenge, particularly in cases of semantic distributional mismatch.

Method

This study empirically investigates various translation methods for generating data in surgical applications, explicitly focusing on semantic consistency. Through our analysis, we introduce a novel and simple combination of effective approaches, which we call ConStructS. The defined losses within this approach operate on multiple image patches and spatial resolutions during translation.

Results

Various state-of-the-art models were extensively evaluated on two challenging surgical datasets. With two different evaluation schemes, the semantic consistency and the usefulness of the translated images on downstream semantic segmentation tasks were evaluated. The results demonstrate the effectiveness of the ConStructS method in minimizing semantic distortion, with images generated by this model showing superior utility for downstream training.

Conclusion

In this study, we tackle semantic inconsistency in unpaired image translation for surgical applications with minimal labeled data. The simple model (ConStructS) enhances consistency during translation and serves as a practical way of generating fully labeled and semantically consistent datasets at minimal cost. Our code is available at https://gitlab.com/nct_tso_public/constructs.

Generating Large Labeled Data Sets for Laparoscopic Image Processing Tasks Using Unpaired Image-to-Image Translation

Minimal data requirement for realistic endoscopic image generation with Stable Diffusion

Article Open access 07 November 2023

Surgical Scene Segmentation Using Semantic Image Synthesis with a Virtual Surgery Environment

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The rapid advancements in deep learning (DL) techniques in the last decade has led to the growth of surgical data science [1]. However, the potential for training large and powerful models is impeded by the requirement of large annotated datasets [1, 2]. Multiple challenges contribute to this limitation, including the technical complexities in acquiring patient data directly from the operating room [3], legal regulations on data sharing, and the substantial costs involved in expert labeling, given the restricted availability of domain specialists (i.e., surgical professionals). One potential solution to overcome these challenges is adopting synthetic training data generated through computer simulations [4,5,6]. Synthetic data present the advantage of automatically generating substantial volumes of fully labeled data. Nonetheless, enforcing real-world characteristics in such synthetic datasets can be a significant hurdle.

Image-to-image translation (I2I) methods are generative modeling techniques that have gained popularity for translating images between different domains. Within the field of data generation, the applicability of paired image translation methods [7] is limited. Conversely, unpaired image translation methods [8], which do not require corresponding image pairs, have emerged as promising solutions for various computer vision tasks. Overall these methods are suitable for surgical applications, but they face challenges in preserving contextual and semantic details across the domains.

In practice, translation methods aim to align the image statistics between the two domains. In addition to the difference in image distributions, semantic variations in distributions also exist, which is commonly referred to as “unmatched semantic statistics” [9] and poses a critical problem in preserving the semantics during translation. As displayed in Fig. 1, when faced with unmatched semantic distributions, attempting to align the distributions between translated and target images forcibly can result in spurious solutions, where semantic information is distorted [9, 10].

In real surgical scenarios, an additional challenge arises from the variations in lighting conditions, which may not be adequately reflected in existing baseline datasets [7, 11]. While synthetic images can incorporate such parameters, creating such a realistic environment takes time and effort. Also, semantic consistency can be affected when such variations exist and addressing these short comings is essential as without doing so, the generated data lacks practical utility for subsequent training of models (Section "Results").

Our contribution

To the best of our knowledge, this study represents the first comprehensive investigation of unpaired image translation techniques to generate data in the context of surgical applications. We summarize our contribution as follows.

We empirically analyze various methods for unpaired image translation by assessing both the semantic consistency of the translated images and their utility as training data in diverse downstream tasks.
We tackle the underexplored problem of creating semantic consistent datasets with annotations (see Fig. 2). We focus on translating synthetic anatomical images into realistic surgical images on datasets from minimally-invasive surgeries, namely, cholecystectomy and gastrectomy.
Guided by our analysis, we define a novel combination of an image quality assessment metric [12] as a loss function with the contrastive learning framework [13] as a simple yet effective modification to tackle the challenge of semantic distortion.
We found that this simple combination to be more effective than many of the existing unpaired translation methods in maintaining semantic consistency. When the translated images from this method are mixed with the real images, we found a $22\%$ improvement in segmentation score compared to a model trained only using the real images.

Related work

Image-to-image translation

The objective is to generate images in a desired target domain while preserving the structure and semantics of the input. Generative adversarial networks (GANs) [14] have proven to be a powerful approach for image translation, learning the mapping between input and output images. In the case of unpaired translation, cycle consistency [8] was introduced, which seeks to learn the reverse mapping between different domains by leveraging reconstruction loss. Various approaches have been proposed to address multi-modal and domain translations, focusing on disentangling images’ content and style information in distinct spaces [4, 15,16,17]. In the context of the surgical application, [18] utilized cycle consistency for endoscopic image synthesis. The paired translation was adopted in [5] whereas cycle consistency with structural similarity was combined to generate laparoscopic image [4](LapMUNIT) and video data [6], respectively. Although these approaches effectively exploit cycle consistency, they often rely on the assumption of a bijective relationship between domains, which can be overly restrictive. Achieving perfect reconstruction becomes challenging, and they still fall behind in maintaining semantic consistency during translation.

In contrast, one-sided translation methods such as GcGAN [19], which incorporates an equivariance constraint, and DistGAN [20] enforcing consistency regularization based on distances between the images have been proposed. Efforts such as [21, 22] have been made to minimize the perceptual or content loss by utilizing a pre-trained VGG model to decrease the content disparity between the domains. However, this approach is computationally expensive and lacks adaptability to the available data. Our approach is based on a contrastive learning method proposed in CUT [13], where embeddings are learned by associating similar signals in contrast to negatives.

Semantic robustness via losses

Despite its aim to promote content (structure) consistency, the CUT [13] method still faces challenges when the two domains have different semantic characteristics. This challenge stems from the limited ability of the contrastively learned semantics to enforce correspondence across different domains effectively. Recently, two approaches were proposed to minimize semantic distortion during translation. SRUNIT [9], based on CUT, defined a semantic robustness loss that is optimized between the input features of the domain ${\mathcal {X}}$ with the perturbated variant of the same. Similarly, a structural consistency constraint (SCC) [10] was proposed to maintain the semantics. The color randomness in the pixel values of the images before and after the translation was reduced by exploiting mutual information.

Methods like NEGCUT [23] trained a separate generator to generate negative samples dynamically, effectively bringing positive and query examples closer together, whereas F-LeSim [24] focused on preserving scene structures by minimizing losses based on spatially-correlative maps. The standalone use of any of these models fails to simultaneously reduce the domain gap and maintain semantic consistency during translation.

In this work, we devise an approach that is a novel combination of different losses, namely, the patch-based contrastive loss along with the multi-scale structural similarity [12], that regularizes the model on various image resolutions, thereby maintaining consistent translations between the simulated and realistic domains. This approach relies neither on cycle consistency nor other additional networks during translation, thereby paving the way for one-sided, unpaired image translation. Many of the stated approaches have focused primarily on just realism as the central concept during translation. However, for surgical application in hand, it is equally important to access both the semantic consistency and the usefulness of such translated images in downstream applications.

Model setup

In this section, we provide an overview of the essential components for the formulation of the approach that preserves both the content and semantic correlation between the anatomical structures during translation.

Adversarial learning

GANs [14] have been promising candidates for image translation tasks. The main goal of such an image translation technique is to acquire the ability to map between two domains, ${\mathcal {X}}$ and ${\mathcal {Y}}$, based on training samples ${x_i}$ and ${y_j}$ drawn from the distributions p(X) and p(Y), respectively. The generator $G_{\mathcal{X}\mathcal{Y}}$ learns the mapping between domains and generates the translated image ${\mathcal {T}}(y)$ and the discriminator $D_{{\mathcal {Y}}}$ is trained to distinguish between original images x and translated images. The adversarial loss is defined as,

$$\begin{aligned}&{\mathcal {L}}_{G A N}\left( G_{\mathcal{X}\mathcal{Y}}, D_{{\mathcal {Y}}}\right) ={\mathbb {E}}_{y \sim p(Y)}\left[ \log D_{{\mathcal {Y}}}(y)\right] \nonumber \\&\quad +{\mathbb {E}}_{x \sim p(X)}\left[ \log \left( 1-D_{{\mathcal {Y}}}\left( G_{\mathcal{X}\mathcal{Y}}(x)\right) \right) .\right. \end{aligned}$$

(1)

Typically, the loss is used to encourage the distributional match between the translated images and images from domain ${\mathcal {Y}}$.

Patch constrastive learning

This framework was formulated on noise contrastive estimation (NCE), aiming to maximize the mutual information between the domains. The InfoNCE loss [25] was used to learn embeddings between the domains and establish associations between corresponding patches of input and output images while disassociating them if unrelated. The central idea lies in associating a “query” point with the “positive” points while contrasting away from other “negative” points in the dataset. Let s be the query vector and $s^{+}$ and $s^{-}$ be the positive and negative vectors from the images, respectively. The $s^{-}$ vectors are sampled at N different locations in the input. Finally, the loss is formulated as an (N+1)- way classification and defined as

$$\begin{aligned} {\mathcal {L}}_{NCE} = -\log \left[ \frac{\exp \left( {\varvec{s}} \cdot {\varvec{s}}^{+} / \tau \right) }{\exp \left( {\varvec{s}} \cdot {\varvec{s}}^{+} / \tau \right) +\sum _{n=1}^N \exp \left( {\varvec{s}} \cdot {\varvec{s}}_n^{-} / \tau \right) }\right] \nonumber \\ \end{aligned}$$

(2)

where $\tau $ is a scaling parameter to factor the distances between the vectors. The query vector is drawn from the translated images, while $s^{+}$ and $s^{-}$ are the corresponding and non-corresponding image (feature) vectors from the input images. We refer to the suppl. material for the computation procedure of these vectors.

A multilayer patch-based contrastive loss was further employed within the CUT framework, formally defined as PatchNCE. It leverages the ready availability of the generator $G_{\mathcal{X}\mathcal{Y}}$ to extract features from L layers at S spatial locations. The PatchNCE loss is defined as,

$$\begin{aligned} {\mathcal {L}}_{\textrm{Patch}}(X)={\mathbb {E}}_{{\varvec{x}} \sim X} \sum _{l=1}^L \sum _{s=1}^{S} {\mathcal {L}}_{NCE} \end{aligned}$$

(3)

Semantic consistency

Next, we define the multi-scale structural similarity (MS-SSim) [12] metric. This measure was proposed as a metric for image quality assessment. The extracted structure information from the images is compared on varying image resolutions with a weighting factor for each. Initially, given two images, ${\textbf{x}}$ and ${\textbf{y}}$, let $v_1 = 2 \sigma _{xy} + C_2$ and $v_2 = \sigma _{x}^{2} + \sigma _{y}^{2} + C_2$. Then contrast sensitivity($\textbf{cs}$) and structure map ($\textbf{ss}$) are defined as,

$$\begin{aligned} {\text {cs}}({\textbf{x}},{\textbf{y}}) = \frac{v_1}{v_2}, \quad {\text {ss}}({\textbf{x}},{\textbf{y}}) = \frac{(2 \mu _{x} \mu _{y} + C_1) v_1}{(\mu _{x}^{2} + \mu _{y}^{2} + C_1) v_2} \end{aligned}$$

(4)

where $\mu _{(\cdot )}$ and $\sigma _{(\cdot )}$ are the mean and variance of the image(pixels) and $\sigma _{x,y}$ is the covariance between ${\textbf{x}}$ and ${\textbf{y}}$. $C_1$ and $C_2$ are stability constants computed as $(K*L)^2$ and $K \ll 1$, L depending on the dynamic range of pixels ($0-255$). The MS-SSim metric is defined as,

$$\begin{aligned} {\text {MS-SSim}}({\textbf{x}}, {\textbf{y}})=\left[ W_i\right] \cdot \prod _{i=1}^K\textbf{cs}_i \cdot \textbf{ss}_i \end{aligned}$$

(5)

where $i=1\cdots K$ denotes the number of different image scales and $W_i$ the weight for the ith scale. Hereafter, we mention the loss as semantic loss. It is defined as,

$$\begin{aligned} {\mathcal {L}}_{\textrm{semantic}} = 1 - {\text {MS-SSim}}(x,y) \end{aligned}$$

(6)

Constrastive learning coupled with MS-SSim

We couple the strengths of both Contrastive Learning with Structural Similarity (ConStructS) as a model to tackle semantic distortion. To the best of our knowledge, this combination has not been proposed yet. As a combined loss the image features at patch level are learned to enforce correspondences during translation. The final objective is defined as,

$$\begin{aligned} {\mathcal {L}}_{\textrm{total}} = {\mathcal {L}}_{GAN} {+} \lambda _{x} {\mathcal {L}}_{\textrm{Patch}}(X) {+} \lambda _{y} {\mathcal {L}}_{\textrm{Patch}}(Y) {+} \lambda _{ss} {\mathcal {L}}_{\textrm{semantic}}\nonumber \\ \end{aligned}$$

(7)

where $\lambda _x$, $\lambda _y$, and $\lambda _{ss}$ are weighting parameters for the PatchNCE and semantic losses, respectively. The ${\mathcal {L}}_{Patch}(Y)$ resembles the identity loss [8] and is applied between the images $y \in {\mathcal {Y}}$ and translated images. The ConStructS approach is a one-sided unpaired translation method that relies on no additional generators or discriminators and imposing the ${\mathcal {L}}_{\textrm{Patch}}(Y)$ component is necessary to prevent degenerate cases from the generator.

Experiments

In this section, we outline our experiments where the performance of several popular unpaired image translation models are compared. The models include CycleGAN [8], the VGG-based perpetual loss [22], DRIT${++}$ [26], LapMUNIT [4], UGAT-IT [27] using cycle consistency, one-sided approach such as GcGAN [19] and DistGAN [20] Also, various configurations of contrastive-based models were investigated. The CUT model was trained with the SCC loss [10] and SRC [28]. F/LeSim [24], SRUNIT [9] and NEGCUT [23] were trained with the CUT as the backbone. We demonstrate the effectiveness of ConStructS in translating synthetic data to the realistic domain with minimal semantic distortion. In particular, the existing baselines exhibit distinct strengths and weaknesses. While certain baselines excel in specific tasks, they may falter in others. Except for LapMUNIT [4] and CycleGAN [8], no tailored approach exists for surgical scenarios. Consequently, we evaluate ConStructS against several other methods to align with the prevailing research.

Finally, we provide a rationale for the design choices made in the ConStructS model to ensure semantic consistency with an ablation study. We train the model without the semantic loss, which reverts to the basic CUT model [13] and without the PatchNCE loss. Similarly, we combined the semantic loss with cycle consistency into the CycleGAN model for a different combination. For the details on implementation the readers can be refer to the suppl. material.

Data

We evaluated the methods mentioned above on two different surgical datasets consisting of anatomical organs such as the liver, liver ligament, gallbladder, abdominal wall, pancreas as well as surgical tools.

Cholecystectomy dataset

This surgery serves to remove the gallbladder. For the simulated domain ${\mathcal {X}}$, we utilized the publicly available synthetic dataset resembling laparoscopic scenes [4]. A total of 20, 000 rendered images forms the synthetic dataset. The real images for the domain ${\mathcal {Y}}$ are taken from the Cholec80 data set [29]. We finally created a training dataset of approximately 26, 000 images from 75 patients. A separate segmentation dataset of 5 patients was chosen. The liver was manually segmented in 196 images for the downstream evaluation (Sect. 4.2.2). The images were cropped to 256 x 512 pixels, and the training set consists of 17, 500 images, with the remaining 2500 serving as the test set.

Gastrectomy dataset

For this case, we utilized the real and synthetic dataset from [5], based on 40 real surgical videos of distal gastrectomy. The dataset consists of 3400 synthetic and 4500 real images with corresponding segmentation masks. 2400 images constituted the training set, with 1000 images as the test set. The images were resized and cropped to 512 x 512 pixels.

Evaluation

We adopted two different schemes to assess both the semantic consistency and the usefulness of generating such data.

Train:Real $\xrightarrow {}$Eval:Synthetic

Firstly, we adopted the practice of computing metrics based on an off-the-shelf segmentation model following [5, 11, 13, 19]. We train a segmentation model on the real images of the specific dataset. Then the translated synthetic images are tested using this pre-trained model i.e, the metrics are computed against ground truth labels of the synthetic images. The underlying intuition is that, if the translation model is able to reduce the domain gap, then the segmentation accuracy from this pre-trained model on the translated synthetic images would be higher [7]. This method assesses both the quality, as well as semantic consistency of the translated images. We refer to this method as consistency evaluation.

Translated images as training data

Furthermore, we assess the practical utility of the translated images in a downstream task in two different methods. Firstly, we train a segmentation model using only the translated images and evaluate the performance of this model on segmenting the organ liver on real images. Secondly, we fine-tune this model on the real data and evaluate them on the same test set of real images. The performance is also compared to a baseline model trained only on real images. This approach aligns with the intuition mentioned above and provides insights into the realism of the translated images. We report the mean dice scores for this method. Hereafter, we refer to this method as downstream evaluation.

Table 1 Consistency evaluation results of various translation models on the cholecystectomy dataset. pxAcc and clsAcc denotes the pixel and mean class accuracy, respectively. mIOU is the mean intersection over union scores

Full size table