Keywords

1 Introduction

The success of supervised learning methods in the medical domain led to countless breakthroughs that might be translated into clinical routine and have the potential to revolutionize healthcare [6, 13]. For many applications, however, labeled reference data (ground truth) may not be available for training and validating a neural network in a supervised manner. One such application is spectral imaging which comprises various non-interventional, non-ionizing imaging techniques that can resolve functional tissue properties such as blood oxygenation in real time [1, 3, 23]. While simulations have the potential to overcome the lack of ground truth, synthetic data is not yet sufficiently realistic [9]. Cycle Generative Adversarial Networks (GAN)-based architectures are widely used for domain transfer [12, 24] but may suffer from issues such as unstable training, hallucinations, or mode collapse [15]. Furthermore, they have predominantly been used for conventional RGB imaging and one-channel cross-modality domain adaptation, and may not be suitable for other imaging modalities with more channels. We address these challenges with the following contributions:

Fig. 1.
figure 1

Pipeline for data-driven spectral image analysis in the absence of labeled reference data. A physics-based simulation framework generates simulated spectral images with corresponding reference labels (e.g. tissue type or oxygenation (sO\(_2\))). Our domain transfer method based on cINNs leverages unlabeled real data to increase their realism. The domain-transferred data can then be used for supervised training of a downstream task (e.g. classification).

Domain Transfer Method: We present an entirely new sim-to-real transfer approach based on conditional invertible neural networks (cINNs) (cf. Fig. 1) specifically designed for data with many spectral channels. This approach inherently addresses weaknesses of the state of the art with respect to the preservation of spectral consistency and, importantly, does not require paired images.

Instantiation to Spectral Imaging: We show that our method can generically be applied to two complementary modalities: photoacoustic tomography (PAT; image-level) and hyperspectral imaging (HSI; pixel-level).

Comprehensive Validation: In comprehensive validation studies based on more than 2,000 PAT images (real: \(\sim \)1,000) and more than 6 million spectra for HSI (real: \(\sim \)6 million) we investigate and subsequently confirm our two main hypotheses: (H1) Our cINN-based models can close the domain gap between simulated and real spectral data better than current state-of-the-art methods regarding spectral plausibility. (H2) Training models on data transferred by our cINN-based approach can improve their performance on the corresponding (clinical) downstream task without them having seen labeled real data.

2 Materials and Methods

2.1 Domain Transfer with Conditional Invertible Neural Networks

Concept Overview. Our domain transfer approach (cf. Fig. 2) is based on the assumption that data samples from both domains carry domain-invariant information (e.g. on optical tissue properties) and domain-variant information (e.g. modality-specific artifacts). The invertible architecture, which inherently guarantees cycle consistency, transfers both simulated and real data into a shared latent space. While the domain-invariant features are captured in the latent space, the domain-variant features can either be filtered (during encoding) or added (during decoding) by utilizing a domain label D. To achieve spectral consistency, we leverage the fact that different tissue types feature characteristic spectral signatures and condition the model on the tissue label Y if available. For unlabeled (real) data, we use randomly generated proxy labels instead. To achieve high visual quality beyond spectral consistency, we include two discriminators \(Dis_{sim}\) and \(Dis_{real}\) for their respective domains. Finally, as a key theoretical advantage, we avoid mode collapse with maximum likelihood optimization. Implementation details are provided in the following.

Fig. 2.
figure 2

Proposed architecture based on cINNs. The invertible architecture transfers both simulated and real data into a shared latent space (right). By conditioning on the domain D (bottom), a latent vector can be transferred to either the simulated or the real domain (left) for which the discriminator \(\text {Dis}_\text {sim}\) and \(\text {Dis}_\text {real}\) calculate the losses for adversarial training.

cINN Model Design. The core of our architecture is a cINN [2] (cf. Fig. 2), comprising multiple (i) scales of \(N_i\)-chained affine conditional coupling (CC) blocks [7]. These scales are necessary in order to increase the receptive field of the network and are achieved by Haar wavelet downsampling [11]. A CC block consists of subnetworks that can be freely chosen depending on the data dimensionality (e.g. fully connected or convolutional networks) as they are only evaluated in the forward direction. The CC blocks receive a condition consisting of two parts: domain label and tissue label, which are then concatenated to the input along the channel dimension. In the case of PAT, the tissue label is a full semantic and random segmentation map for the simulated and real data, respectively. In the case of HSI, the tissue label is a one-hot encoded vector for organ labels.

Model Training. In the following, the proposed cINN with its parameters \(\theta \) will be referred to as \(f(x, DY,\theta )\) and its inverse as \(f^{-1}\) for any input \(x\sim p_D\) from domain \(D\in \{D_{sim}, D_{real}\}\) with prior density \(p_D\) and its corresponding latent space variable z. The condition DY is the combination of domain label D as well as the tissue label \(Y\in \{Y_{sim}, Y_{real}\}\). Then the maximum likelihood loss \(\mathcal{M}\mathcal{L}\) for a training sample \(x_i\) is described by

$$\begin{aligned} \underset{D}{\mathcal{M}\mathcal{L}}=\mathbb {E}_{i}\left[ \frac{||f(x_i,D Y,\theta )|| _2^2}{2} - log |J_i| \right] \text { with } J_i=det\left( \left. \frac{ \partial f }{\partial x } \right| _{x_i}\right) . \end{aligned}$$
(1)

For the adversarial training, we employ the least squares training scheme [18] for generator \(Gen_D=f^{-1}_D\circ f_{D'}\) and discriminator \(Dis_D\) for each domain with \(x_{D'}\) as input from the source domain and \(x_D\) as input from the target domain:

$$\begin{aligned} \underset{{Gen}_D}{\mathcal {L}}=\underset{x_{D'}\sim p_{D'}}{\mathbb {E}}\left[ (Dis_D(Gen_D(x_{D'})-1))^2\right] \end{aligned}$$
(2)
$$\begin{aligned} \underset{{Dis}_D}{\mathcal {L}}=\underset{x_D\sim p_D}{\mathbb {E}}\left[ (Dis_D(x_D)-1)^2\right] + \underset{x_{D'}\sim p_{D'}}{\mathbb {E}}\left[ (Dis_D(Gen_D(x_{D'})))^2\right] . \end{aligned}$$
(3)

Finally, the full loss for the proposed model comprises the following:

$$\begin{aligned} \underset{Total_{Gen}}{\mathcal {L}}=\underset{real}{\mathcal{M}\mathcal{L}} + \underset{sim}{\mathcal{M}\mathcal{L}} + \underset{{Gen}_{real}}{\mathcal {L}} + \underset{{Gen}_{sim}}{\mathcal {L}} \text { and } \underset{Total_{Dis}}{\mathcal {L}}=\underset{{Dis}_{real}}{\mathcal {L}} + \underset{{Dis}_{sim}}{\mathcal {L}}. \end{aligned}$$
(4)

Model Inference. The domain transfer is done in two steps: 1) A simulated image is encoded in the latent space with conditions \(D_{sim}\) and \(Y_{sim}\) to its latent representation z, 2) z is decoded to the real domain via \(D_{real}\) with the simulated tissue label \(Y_{sim}\): \(x_{sim \rightarrow real}= f^{-1}(\cdot , D_{real}Y_{sim},\theta ) \circ f(\cdot , D_{sim}Y_{sim},\theta )(x_{sim}).\)

Fig. 3.
figure 3

Training data used for the validation experiments. For PAT, 960 real images from 30 volunteers were acquired. For HSI, more than six million spectra corresponding to 460 images and 20 individuals were used. The tissue labels PAT correspond to 2D semantic segmentations, whereas the tissue labels for HSI represent 10 different organs. For PAT, \(\sim \)1600 images were simulated, whereas around 210,000 spectra were simulated for HSI.

2.2 Spectral Imaging Data

Photoacoustic Tomography Data. PAT is a non-ionizing imaging modality that enables the imaging of functional tissue properties such as tissue oxygenation [22]. The real PAT data (cf. Fig. 3) used in this work are images of human forearms that were recorded from 30 healthy volunteers using the MSOT Acuity Echo (iThera Medical GmbH, Munich, Germany) (all regulations followed under study ID: S-451/2020, and the study is registered with the German Clinical Trials Register under reference number DRKS00023205). In this study, 16 wavelengths from 700 nm to 850 nm in steps of 10 nm were recorded for each image. The resulting 180 images were semantically segmented into the structures shown in Fig. 3 according to the annotation protocol provided in [20]. Additionally, a full sweep of each forearm was performed to generate more unlabeled images, thus amounting to a total of 955 real images. The simulated PAT data (cf. Fig. 3) used in this work comprises 1,572 simulated images of human forearms. They were generated with the toolkit for Simulation and Image Processing for Photonics and Acoustics (SIMPA) [8] based on a forearm literature model [21] and with a digital device twin of the MSOT Acuity Echo.

Hyperspectral Imaging Data. HSI is an emerging modality with high potential for surgery [4]. In this work, we performed pixel-wise analysis of HSI images. The real HSI data was acquired with the Tivita® Tissue (Diaspective Vision GmbH, Am Salzhaff, Germany) camera, featuring a spectral resolution of approximately 5 nm in the spectral range between 500 nm and 1000 nm nm. In total, 458 images, corresponding to 20 different pigs, were acquired (all regulations followed under study IDs: 35-9185.81/G-161/18 and 35-9185.81/G-262/19) and annotated with ten structures: bladder, colon, fat, liver, omentum, peritoneum, skin, small bowel, spleen, and stomach (cf. Fig. 3). This amounts to 6,410,983 real spectra in total. The simulated HSI data was generated with a Monte Carlo method (cf. algorithm provided in the supplementary material). This procedure resulted in 213,541 simulated spectra with annotated organ labels.

3 Experiments and Results

The purpose of the experiments was to investigate hypotheses H1 and H2 (cf. Sect. 1). As comparison methods, a CycleGAN [24] and an unsupervised image-to-image translation (UNIT) network [16] were implemented fully convolutionally for PAT and in an adapted version for the one-dimensional HSI data. To make the comparison fair, the tissue label conditions were concatenated with the input, and we put significant effort into optimizing the UNIT on our data.

Fig. 4.
figure 4

Qualitative results. In comparison to simulated PAT images (left), images generated by the cINN (middle) resemble real PAT images (right) more closely. All images show a human forearm at 800 nm.

Realism of Synthetic Data (H1) : According to qualitative analyses (Fig. 4) our domain transfer approach improves simulated PAT images with respect to key properties, including the realism of skin, background, and sharpness of vessels. A principal component analysis (PCA) performed on all artery and vein spectra of the real and synthetic datasets demonstrates that the distribution of the synthetic data is much closer to the real data after applying our domain transfer approach (cf. Fig. 5a)). The same holds for the absolute difference, as shown in Fig. 5b). Slightly better performance was achieved with the cINN compared to the UNIT. Similarly, our approach improves the realism of HSI spectra, as illustrated in Fig. 6, for spectra of five exemplary organs (colon, stomach, omentum, spleen, and fat). The cINN-transferred spectra generally match the real data very closely. Failure cases where the real data has a high variance (translucent band) are also shown.

Fig. 5.
figure 5

Our domain transfer approach yields realistic spectra (here: of veins). The PCA plots in a) represent a kernel density estimation of the first and second components of a PCA embedding of the real data, which represent about 67% and 6% of the variance in the real data, respectively. The distributions on top and on the right of the PCA plot correspond to the marginal distributions of each dataset’s first two components. b) Violin plots show that the cINN yields spectra that feature a smaller difference to the real data compared to the simulations and the UNIT-generated data. The dashed lines represent the mean difference value, and each dot represents the difference for one wavelength.

Fig. 6.
figure 6

The cINN-transferred spectra are in closer agreement with the real spectra than the simulations and the UNIT-transferred spectra. Spectra for five exemplary organs are shown from 500 nm to 1000 nm. For each subplot, a zoom-in for the near-infrared region (>900 nm) is shown. The translucent bands represent the standard deviation across spectra for each organ.

Benefit of Domain-Transferred Data for Downstream Tasks (H2): We examined two classification tasks for which reference data generation was feasible: classification of veins/arteries in PAT and organ classification in HSI. For both modalities, we used the completely untouched real test sets, comprising 162 images in the case of PAT and \(\sim \) 920,000 spectra in the case of HSI. For both tasks, a calibrated random forest classifier (sklearn [19] with default parameters) was trained on the simulated, the domain-transferred (by UNIT and cINN), and real spectra. As metrics, the balanced accuracy (BA), area under receiver operating characteristic (AUROC) curve, and F1-score were selected based on [17].

Table 1. Classification scores for different training data. The training data refers to real data, physics-based simulated data, data generated by a CycleGAN, by a UNIT without and with tissue labels (\(\text {UNIT}_\text {Y}\)), and by a cINN without (\(\text {cINN}_\text {D}\)) and with (proposed \(\text {cINN}_\text {DY}\)) tissue labels as condition. Additionally, \(\text {cINN}_\text {DY without GAN}\) refers to a \(\text {cINN}_\text {DY}\) without the adversarial training. The best-performing methods, except if trained on real data, are printed in bold.

As shown in Table 1, our domain transfer approach dramatically increases the classification performance for both downstream tasks. Compared to physics-based simulation, the cINN obtained a relative improvement of 37% (BA), 25% (AUROC), and 22% (F1 Score) for PAT whereas the UNIT only achieved a relative improvement in the range of 20%-27% (depending on the metric). For HSI, the cINN achieved a relative improvement of 21% (BA), 1% (AUROC), and 33% (F1 Score) and it scored better in all metrics except for the F1 Score than the UNIT. For all metrics, training on real data still yields better results.

4 Discussion

With this paper, we presented the first domain transfer approach that combines the benefits of cINNs (exact maximum likelihood estimation) with those of GANs (high image quality). A comprehensive validation involving qualitative and quantitative measures for the remaining domain gap and downstream tasks suggests that the approach is well-suited for sim-to-real transfer in spectral imaging. For both PAT and HSI, the domain gap between simulations and real data could be substantially reduced, and a dramatic increase in downstream task performance was obtained - also when compared to the popular UNIT approach.

The only similar work on domain transfer in PAT has used a cycle GAN-based architecture on a single wavelength with only photon propagation as PAT image simulator instead of full acoustic wave simulation and image reconstruction [14]. This potentially leads to spectral inconsistency in the sense that the spectral information either is lost during translation or remains unchanged from the source domain instead of adapting to the target domain. Outside the spectral/medical imaging community, Liu et al. [16] and Grover et al. [10] tasked variational autoencoders and invertible neural networks for each domain, respectively, to create the shared encoding. They both combined this approach with adversarial training to achieve high-quality image generation. Das et al. [5] built upon this approach by using labels from the source domain to condition the domain transfer task. In contrast to previous work, which used en-/decoders for each domain, we train a single network as shown in Fig. 2. with a two-fold condition consisting of a domain label (D) and a tissue label (Y) from the source domain, which has the advantage of explicitly aiding the spectral domain transfer.

The main limitation of our approach is the high dimensionality of the parameter space of the cINN as dimensionality reduction of data is not possible due to the information and volume-preserving property of INNs. This implies that the method is not suitable for arbitrarily high dimensions. Future work will comprise the rigorous validation of our method with tissue-mimicking phantoms for which reference data are available.

In conclusion, our proposed approach of cINN-based domain transfer enables the generation of realistic spectral data. As it is not limited to spectral data, it could develop into a powerful method for domain transfer in the absence of labeled real data for a wide range of image modalities in the medical domain and beyond.