Unsupervised Domain Transfer with Conditional Invertible Neural Networks

Synthetic medical image generation has evolved as a key technique for neural network training and validation. A core challenge, however, remains in the domain gap between simulations and real data. While deep learning-based domain transfer using Cycle Generative Adversarial Networks and similar architectures has led to substantial progress in the field, there are use cases in which state-of-the-art approaches still fail to generate training images that produce convincing results on relevant downstream tasks. Here, we address this issue with a domain transfer approach based on conditional invertible neural networks (cINNs). As a particular advantage, our method inherently guarantees cycle consistency through its invertible architecture, and network training can efficiently be conducted with maximum likelihood training. To showcase our method's generic applicability, we apply it to two spectral imaging modalities at different scales, namely hyperspectral imaging (pixel-level) and photoacoustic tomography (image-level). According to comprehensive experiments, our method enables the generation of realistic spectral data and outperforms the state of the art on two downstream classification tasks (binary and multi-class). cINN-based domain transfer could thus evolve as an important method for realistic synthetic data generation in the field of spectral imaging and beyond.


data-driven domain transfer
Fig. 1: Pipeline for data-driven spectral image analysis in the absence of labeled reference data.A physics-based simulation framework generates simulated spectral images with corresponding reference labels (e.g., tissue type or oxygenation (sO 2 )).Our domain transfer method based on cINNs leverages unlabeled real data to increase their realism.The domain-transferred data can then be used for supervised training of a downstream task (e.g.classification).
The success of supervised learning methods in the medical domain led to countless breakthroughs that might be translated into clinical routine and have the potential to revolutionize healthcare [8,15].For many applications, however, labeled reference data (ground truth) may not be available for training and validating a neural network in a supervised manner.One such application is spectral imaging which comprises various non-interventional, non-ionizing imaging techniques that can resolve functional tissue properties such as blood oxygenation in real time [1,3,4,25].While simulations have the potential to overcome the lack of ground truth, synthetic data is not yet sufficiently realistic [11].Cycle Generative Adversarial Networks (GAN)-based architectures are widely used for domain transfer [5,14] but may suffer from issues such as unstable training, hallucinations, or mode collapse [17].Furthermore, they have predominantly been used for conventional RGB imaging and one-channel cross-modality domain adaptation, and may not be suitable for other imaging modalities with more channels.We address these challenges with the following contributions: Domain transfer method: We present an entirely new sim-to-real transfer approach based on conditional invertible neural networks (cINNs) (cf.Fig. 1).Our architecture features inherent cycle consistency and the possibility of conducting maximum likelihood learning while still maintaining the high visual quality of adversarial networks, without the possibility of mode collapse.Instantiation to spectral imaging: We show that our method can generically be applied to two complementary modalities: photoacoustic tomography (PAT; image-level) and hyperspectral imaging (HSI; pixel-level).Comprehensive validation: In comprehensive validation studies based on more than 2,000 PAT images (real: ∼ 1,000) and more than 6 million spectra for HSI (real: ∼ 6 million) we investigate and subsequently confirm our two main hypotheses: (H1) Our cINN-based models can close the domain gap between simulated and real spectral data better than current state-of-the-art methods regarding spectral plausibility.(H2) Training models on data transferred by our cINN-based approach can improve their performance on the corresponding (clinical) downstream task without them having seen labeled real data.Concept overview.Our domain transfer approach (cf.Fig. 2).It is based on the assumption that data samples from both domains carry domain-invariant information (e.g., on optical tissue properties) and domain-variant information (e.g., modality-specific artifacts).The invertible architecture, which inherently guarantees cycle consistency, transfers both simulated and real data into a shared latent space.While the domain-invariant features are captured in the latent space, the domain-variant features can either be filtered (during encoding) or added (during decoding) by utilizing a domain label D. The additional tissue label Y for simulated data implicitly carries information to aid the spectral consistency, whereas the randomly generated proxy label for the unlabeled real data does not.The joint distribution is learned using the maximum likelihood loss and by adding two multiscale discriminators Dis sim and Dis real , adversarial training ensures high visual quality of the generated data.

Materials and Methods
Model design.The proposed cINN (cf.Fig. 2) is roughly based on the work of Ardizzone et.al. [2] and consists of multiple (i) scales of N i -chained affine conditional coupling (CC) blocks [9].These scales are necessary in order to increase the receptive field of the network and are achieved by Haar wavelet downsampling [13].A CC block consists of subnetworks that can be freely chosen depending on the data dimensionality (e.g., fully connected or convolutional networks) as they are only evaluated in the forward direction.The CC blocks receive a condition consisting of two parts: domain label and tissue label, which are then concatenated to the input along the channel dimension.In the case of PAT, the tissue label is a full semantic and random segmentation map for the simulated and real data, respectively.In the case of HSI, the tissue label is a one-hot encoded vector for organ labels.
Model training.In the following, the proposed cINN with its parameters θ will be referred to as f (x, DY, θ) and its inverse as f −1 for any input x ∼ p D from domain D ∈ {D sim , D real } with prior density p D and its corresponding latent space variable z.The condition DY is the combination of domain label D as well as the tissue label Y ∈ {Y sim , Y real }.Then the maximum likelihood loss ML for a training sample x i is described by For the adversarial training, we employ the least squares training scheme [20] for generator Gen D = f −1 D • f D and discriminator Dis D for each domain with x D as input from the source domain and x D as input from the target domain: (2) Finally, the full loss for the proposed model comprises the following: Model inference.The domain transfer is done in two steps: 1) A simulated image is encoded in the latent space with conditions D sim and Y sim to its latent representation z, 2) z is decoded to the real domain via D real with the simulated tissue label Y sim :

Spectral Imaging Data
Photoacoustic tomography data.PAT is a non-ionizing imaging modality that enables the imaging of functional tissue properties such as tissue oxygenation [24].The real PAT data (cf.Fig. 3) used in this work are images of human Fig. 3: Training data used for the validation experiments.For PAT, 960 real images from 30 volunteers were acquired.For HSI, more than six million spectra corresponding to 460 images and 20 individuals were used.The tissue labels PAT correspond to 2D semantic segmentations, whereas the tissue labels for HSI represent 10 different organs.For PAT, ∼ 1600 images were simulated, whereas around 210,000 spectra were simulated for HSI.forearms that were recorded from 30 healthy volunteers using the MSOT Acuity Echo (iThera Medical GmbH, Munich, Germany) (all regulation followed under study ID: S-451/2020, and the study is registered with the German Clinical Trials Register under reference number DRKS00023205).In this study, 16 wavelengths from 700 nm to 850 nm in steps of 10 nm were recorded for each image.The resulting 180 images were semantically segmented into the structures shown in Fig. 3 according to the annotation protocol provided in [22].Additionally, a full sweep of each forearm was performed to generate more unlabeled images, thus amounting to a total of 955 real images.The simulated PAT data (cf.Fig. 3) used in this work comprises 1,572 simulated images of human forearms.They were generated with the toolkit for Simulation and Image Processing for Photonics and Acoustics (SIMPA) [10] based on a forearm literature model [23] and with a digital device twin of the MSOT Acuity Echo.Hyperspectral imaging data.HSI is an emerging modality with high potential for surgery [6].In this work, we performed pixel-wise analysis of HSI images.The real HSI data was acquired with the Tivita ® Tissue (Diaspective Vision GmbH, Am Salzhaff, Germany) camera, featuring a spectral resolution of approximately 5 nm in the spectral range between 500 nm and 1000 nm.In total, 458 images, corresponding to 20 different pigs, were acquired (all regulations followed under study IDs: 35-9185.81/G-161/18 and 35-9185.81/G-262/19)and annotated with ten structures: bladder, colon, fat, liver, omentum, peritoneum, skin, small bowel, spleen, and stomach (cf.Fig. 3).This amounts to 6,410,983 real spectra in total.The simulated HSI data was generated with a Monte Carlo method (cf.algorithm provided in the supplementary material).This procedure resulted in 213,541 simulated spectra with annotated organ labels.

Experiments and Results
The purpose of the experiments was to investigate hypotheses H1 and H2 (cf.Sec. 1).As state-of-the-art method for the experiments, an unsupervised imageto-image translation (UNIT) network [18] in its original version (fully convolutional) and an adapted version for the one-dimensional HSI data was implemented.To make the comparison fair, the tissue label conditions were concatenated with the input.Realism of synthetic data (H1): According to qualitative analyses (Fig. 4) our domain transfer approach improves simulated PAT images with respect to key properties, including the realism of skin, background, and sharpness of vessels.A principal component analysis (PCA) performed on all artery and vein spectra of the real and synthetic datasets demonstrates that the distribution of the synthetic data is much closer to the real data after applying our domain transfer approach (cf.Fig. 5 a)).The same holds for the absolute difference, as shown in Fig. 5 b).Slightly better performance was achieved with the cINN compared to the UNIT.Similarly, our approach improves the realism of HSI spectra, as illustrated in Fig. 6, for spectra of five exemplary organs (colon, stomach, omentum, spleen, and fat).The cINN-transferred spectra generally match the real data very closely.Failure cases where the real data has a high variance (translucent band) are also shown.
Benefit of domain-transferred data for downstream tasks (H2): We examined two classification tasks for which reference data generation was feasible: classification of veins/arteries in PAT and organ classification in HSI.For both modalities, we used the completely untouched real test sets, comprising 162 images in the case of PAT and ∼ 920,000 spectra in the case of HSI.For both tasks, a calibrated random forest classifier (sklearn [21] with default parameters) was trained on the simulated, the domain-transferred (by UNIT and cINN), and real spectra.As metrics, the balanced accuracy (BA), area under receiver operating characteristic (AUROC) curve, and F1-score were selected based on [19].
As shown in Fig. 7, our domain transfer approach dramatically increases the classification performance for both downstream tasks.Compared to physicsbased simulation, the cINN obtained a relative improvement of 37% (BA), 25% (AUROC), and 22% (F1 Score) for PAT whereas the UNIT only achieved a relative improvement in the range of 20%-27% (depending on the metric).For HSI, the cINN achieved a relative improvement of 21% (BA), 1% (AUROC), and 33% (F1 Score) and it scored better in all metrics except for the F1 Score than the UNIT.For all metrics, training on real data still yields better results (see supplementary material).

Discussion
With this paper, we presented the first domain transfer approach that combines the benefits of cINNs (exact maximum likelihood estimation) with those of GANs (high image quality).A comprehensive validation involving qualitative and quantitative measures for the remaining domain gap and downstream  tasks suggests that the approach is well-suited for sim-to-real transfer in spectral imaging.For both PAT and HSI, the domain gap between simulations and real data could be substantially reduced, and a dramatic increase in downstream task performance was obtained -also when compared to the popular UNIT approach.The only similar work on domain transfer in PAT has used a cycle GAN-based architecture on a single wavelength with only photon propagation as PAT image simulator instead of full acoustic wave simulation and image reconstruction [16].This potentially leads to spectral inconsistency in the sense that the spectral information either is lost during translation or remains unchanged from the source domain instead of adapting to the target domain.Outside the spectral/medical imaging community, Liu et al. [18] and Grover et al. [12] tasked variational autoencoders and invertible neural networks for each domain, respectively, to create the shared encoding.They both combined this approach with adversarial training to achieve high-quality image generation.Das et al. [7] built upon this approach by using labels from the source domain to condition the domain transfer task.In contrast to previous work, which used en-/decoders for each domain, we train a single network as shown in Fig. 2. with a two-fold condition consisting of a domain label (D) and a tissue label (Y ) from the source domain, which has the advantage of explicitly aiding the spectral domain transfer.
The main limitation of our approach is the high dimensionality of the parameter space of the cINN as dimensionality reduction of data is not possible due to the information and volume-preserving property of INNs.This implies that the method is not suitable for arbitrarily high dimensions.Future work will comprise the rigorous validation of our method with tissue-mimicking phantoms for which reference data are available.
In conclusion, our proposed approach of cINN-based domain transfer is a novel method enabling the generation of realistic spectral data.As it is not limited to spectral data, it could develop into a powerful method for domain transfer in the absence of labeled real data for a wide range of image modalities in the medical domain and beyond.

2. 1
Domain Transfer with Conditional Invertible Neural Networks

Fig. 2 :
Fig. 2: Proposed architecture based on cINNs.The invertible architecture transfers both simulated and real data into a shared latent space (right).By conditioning on the domain D (bottom), a latent vector can be transferred to either the simulated or the real domain (left) for which the discriminator Dis sim and Dis real calculate the losses for adversarial training.

Fig. 4 :
Fig. 4: Qualitative results.In comparison to simulated PAT images (left), images generated by the cINN (middle) resemble real PAT images (right) more closely.All images show a human forearm at 800 nm.

Fig. 5 :
Fig.5: Our domain transfer approach yields realistic spectra (here: of veins).The PCA plots in a) represent a kernel density estimation of the first and second components of a PCA embedding of the real data, which represent about 67% and 6% of the variance in the real data, respectively.The distributions on top and on the right of the PCA plot correspond to the marginal distributions of each dataset's first two components.b) Violin plots show that the cINN yields spectra that feature a smaller difference to the real data compared to the simulations and the UNIT-generated data.The dashed lines represent the mean difference value, and each dot represents the difference for one wavelength.

Fig. 6 :
Fig.6: The cINN-transferred spectra are in closer agreement with the real spectra than the simulations and the UNIT-transferred spectra.Spectra for five exemplary organs are shown from 500 nm to 1000 nm.For each subplot, a zoom-in for the near-infrared region (> 900 nm) is shown.The translucent bands represent the standard deviation across spectra for each organ.

Fig. 7 :
Fig. 7: Both domain transfer methods increase the classification performance compared to simulated data, but the cINN outperforms the UNIT.BA, AUROC, and F1-score values were weighted and aggregated to account for class imbalance for both a) artery-vein classification in the case of PAT and b) organ classification in the case of HSI.The zero-baseline corresponds to the reference values of the simulated data.