1 Introduction

Nuclei instance segmentation, which demands both accurate localization and precise boundary delineation of each cell nucleus, plays an essential role in computer-aided digital pathology analysis (Dey, 2010). It captures rich characteristics of cell nuclei clusters, including their spatial distribution information and pleomorphic features, to comprehensively represent the properties of the tumor microenvrionment and is thus valuable for various clinical tasks, such as cancer identification and grading (Baručić et al., 2022; Dunne & Going, 2001; Veta et al., 2012).

Fig. 1
figure 1

Examples of histopathology images and cropped regions of different cancer types from the Kumar dataset (Kumar et al., 2017). From left to right: liver cancer, kidney cancer, and colon cancer

Recently, deep learning-based methods have been raised as a popular line of research for nuclei instance segmentation (Fujii et al., 2021; Ling et al., 2020; Liu et al., 2021; Mertanová et al., 2022; Raza et al., 2019). Nevertheless, these methods still have non-negligible weaknesses that they heavily depend on elaborate labeled images for fully-supervised model training (Feng et al., 2021; He et al., 2021), and their performance degrades drastically under data distribution shifts (also known as domain shifts, e.g., changes in imaging modality, staining technique and cancer type between training and testing data Hou et al. 2019; Liu et al. 2020b).

Fig. 2
figure 2

Illustration of the OCDA setting in a benchmark performing domain adaptation from fluorescence microscopy to histopathology images. Note that, unlike multi-target UDA (Saporta et al., 2021), the cancer type of each image patch is unavailable during training

A promising solution is to introduce unsupervised domain adaptation (UDA) method, which trains a model on the labeled source and unlabeled target domain (Kouw & Loog, 2019). It has recently gained a lot of traction and been regarded as a potential solution to alleviate the domain shift issue and maintain label-efficiency (Shen et al., 2021). Notably, there have also been several attempts to perform domain adaptive nuclei instance segmentation (Hsu et al., 2021; Liu et al., 2020a, b). They performed unsupervised nuclei segmentation in histopathology images by exploiting domain-invariant knowledge from another modality (e.g., fluorescence microscopy).

However, the existing approaches consider the target histopathology image domain as homogeneous. They propose to align the target domain integrally with the source domain, whereas the intra-domain heterogeneity of histopathology images is neglected. Due to inconsistent cancer types, histopathology image patches and cropped regions could exhibit diverse patterns and styles at both global image level and local instance level, as depicted in Fig. 1. In this case, the conventional UDA method which is designed for uniform target data distribution tends to derive a biased alignment to which only target data with similar distribution to the source data can be successfully aligned (Park et al., 2020). Moreover, as these methods only regularize the model according to limited training data, they normally suffer from inferior generalization capability, especially in the realistic clinical scenario where testing images could come from divergent cancer types which do not exist in the training set. To transcend the bottlenecks in these conventional single-source-single-target UDA approaches, it is necessary to explicitly model the heterogeneity within the histopathology image domain.

Table 1 Comparison between OCDA and other DA settings

A trivial solution is to partition the whole target domain into several subdomains, following the settings of multi-target DA (Saporta et al., 2021; Zhang et al., 2022). However, such an approach has outstanding limitations that it requires domain labels to indicate the subdomain of each target sample, and it is not flexible with the complexity of target domain (i.e., the number of subdomains).

In this paper, we propose a novel framework from the perspective of open compound domain adaptation (OCDA) (Liu et al., 2020d) to address the intra-domain heterogeneity in the target histopathology dataset. The task of this setting is to transfer knowledge from a labeled source domain to an unlabeled compound target domain, which contains multiple related yet divergent subdomains without domain labels. In addition, the adapted model for OCDA is concurrently expected to possess better generalization capability. Therefore, the model’s performance can be maintained when dealing with data from unseen subdomains at the test time, as showcased in Fig. 2. An extensive comparison between OCDA and other UDA scenarios is illustrated in Table 1.

OCDA is a more realistic yet relatively unexplored setting, with only a few works making an early attempt to provide a solution (Gong et al., 2021; Liu et al., 2020d; Park et al., 2020). Nevertheless, they focus on down-stream tasks like image classification and semantic segmentation, where image-level semantic features are dominant. It is noted that there is an absence of OCDA framework for instance segmentation where local-level instance features are equivalently crucial and indispensible. As for technical defects, the current works mostly propose to split the compound target domain according to the style features of each sample extracted by a pre-trained model and assign unchangable domain labels at the beginning of the training stage. Since style feature extraction is performed via models pre-trained on other tasks, there inevitably exists noise in the encoded style representations, which causes the partition of compound target domain to be inaccurate. Then the model training at each following step would be deteriorated in consequence. Another shortcoming of the existing methods is that they are based on an assumption that the unseen testing subdomain can be constructed as a combination of all seen training subdomains, which is actually incorrect for the histopathology image domain in regard of its complexity and countless attributes contributing to subdomain variations. In addition, we observe that there exists a lack of morphology-level supervision in the image synthesis framework deployed by those methods. As a consequence, the transformed images would lose essential nucleus shape details and incur incorrespondence between images and segmentation annotations.

To this end, we propose a novel two-stage disentanglement framework to tackle nuclei instance segmentation in the OCDA setting. It captures the domain-agnostic semantics (content) and the domain-specific modality/stain/cancer factors (style) seperately at both global image level and local instance level for mutual-complementing. In the first image-level disentanglement stage, we present a cross-domain image translation network to transform source images to target-like ones. In the second stage, we conduct feature disentanglement at local level to further alleviate cross-domain discrepancy in instance-level representations. Considering the aforementioned defects of existing methods, we specifically propose four technical insights. In Stage I, firstly, we integrate the learning of style encoding together with the image translation task and propose a progressive clustering and separation strategy to facilitate style feature extraction during synthesis task learning. Then, we seek inspiration from the recent advances of domain generalization and introduce the style randomization technique (Jackson et al., 2019) for data augmentation. It strengthens the model’s robustness and generalizability to maintain its performance on unseen testing subdomains. Furthermore, we pose a dual-branch morphological regularization on top of the image translation network to minimize nucleus deformation and incorrespondence during translation. In Stage II, we devise a global–local style consistency mechanism to stabilize the instance-level domain-invariant feature generation.

Our key contributions can be summarized as follows:

  • We propose a holistic two-stage disentanglement framework for cross-domain nuclei instance segmentation in the OCDA setting to explicitly address the heterogeneity of histopathology images. To the best of our knowledge, it is the first work to explicitly model the heterogeneity of histopathology images in UDA and design an OCDA framework for instance segmentation.

  • To overcome the limitations of the existing OCDA methods, in the global image-level alignment, a progressive clustering and separation strategy is incorporated to benefit the style feature disentanglement. To enhance the model’s generalization capability for unseen testing subdomains, we introduce style randomization to generate fake histopathology images in arbitrary style for data augmentation.

  • In the local instance-level alignment, we leverage the global–local style consistency to facilitate feature disentanglement and domain-invariant representation learning.

  • We further develop a novel regularization module based on semantic masks and object boundaries to preserve shape and structural details of nucleus in image translation.

  • We comprehensively evaluate our approach and demonstrate its effectiveness on both cross-modality and cross-stain UDA nuclei instance segmentation. It significantly outperforms the state-of-the-art conventional UDA and OCDA methods for unsupervised domain adaptive nuclei instance segmentation in histopathology images.

2 Related Work

2.1 Unsupervised Domain Adaptation

A prominent barrier hampering the application of deep learning-based methods to healthcare is the annotated data scarcity (Han et al., 2022; Nie & Shen, 2020; Stepec & Skocaj, 2017). The data collection and labeling process heavily depends on domain knowledge and requires exhaustive participation of physicians. As a result, acquiring sufficient data with high-quality annotation could be prohibitively expensive (Cao et al., 2023). Unsupervised domain adaptation (UDA) method, which aims to address the challenge by transferring domain-invariant knowledge from source domains with labeled data to unannotated target domains, has advanced rapidly and indicated its effectiveness in various applications (Dong et al., 2020; Guan & Liu, 2021). One representative approach for solving the UDA task is through learning domain-agnostic features. It is dedicated to mitigating the domain discrepancies by minimizing a specific metric (e.g., MMD) (Chen et al., 2019; Gong et al., 2014; Yan et al., 2017) or performing adversarial feature alignment (Tzeng et al., 2017; Zhao et al., 2021). As an alternative, another line of research aims to take advantage of deep generative models (Benigmim et al., 2023; Hoffman et al., 2018; Li et al., 2021; Zhao et al., 2022) or transform operations (Araslanov & Roth, 2021; Huang et al., 2021a, b; Yang & Soatto, 2020) to align different domains from the image appearance level. An exemplary pipeline in this stream is to perform cross-domain visual mapping based on swapping of disentangled attributes (Lee et al., 2020). With the same insight, following works introduce a set of ancillary constituents, such as collaborative training (Zheng et al., 2019; Zou et al., 2020), non-linear modeling (Lee et al., 2021), and identifiability constraint (Kong et al., 2022), to further enhance the fidelity of disentangled representations. Moreover, considering the complementary nature of feature alignment and appearance transform approaches, an integrative solution is proposed to combine them into a unified framework (Chen et al., 2019; Dong et al., 2020; Liu et al., 2020a). By encouraging the mutual interactions and cooperations between the two perspectives of adaptations, it achieves synergistic adaptation and considerably lifts the performance. Despite those appealing efforts, in UDA, it is assumed that both the source and target domains should strictly follow the uni-modal distribution (Zhou et al., 2022). The over-simplified paradigm cannot handle the intra-domain heterogeneity across disparate subpartitions and therefore suffer from inferior robustness in the context of multi-modal distribution (Isobe et al., 2021).

2.2 Unsupervised Domain Adaptation for Nuclei Instance Segmentation

In regard of nuclei instance segmentation in microscopy images, some pioneering works are specifically designed to handle the domain shifts in image appearance and object characteristics. The dominating approach is to firstly perform image translation with learning-based generative model and subsequently conduct hierarchical feature alignment (Liu et al., 2020b). Auxiliary modules can be developed on top to further facilitate cross-domain generalization, such as task reweighting (Liu et al., 2020a) and pseudo-labeling (Hsu et al., 2021). However, these methods do not take the inner-discrepancy of histopathology images into consideration and instead simplify them into a homogeneous target domain. As a result, a biased adaptation tends to be derived and only minor subpartitions of the target domain can be reasonably aligned (Park et al., 2020; Wang et al., 2019). In the spirit towards a balanced and unbiased adaptation procedure, the practice followed by conventional UDA methods which conduct one-to-one alignment is inadequate (Gong et al., 2021). Recently, Zhang et al. (2022) attempted to exploit the complementarity between H &E-stained and IHC-stained images and perform multi-target DA. Nevertheless, the proposed method cannot address the cancer and organ-wise heterogeneity within histopathology image domain as patch-wise subpartition labels are inaccessible (Kumar et al., 2017).

2.3 Open Compound Domain Adaptation

Taking a step further beyond UDA with the assumption of uni-modal target data distributions, open compound domain adaptation (OCDA) tackles a more challenging yet practical scenario. It models the target domain as a union of multiple subdomains and has shown appealing promises in several benchmarks of image classification (Liu et al., 2020d) and semantic segmentation (Park et al., 2020). To enhance the characterization of inner-structure with respect to the target domain, different training strategies like curriculum learning (Liu et al., 2020d), meta optimization (Gong et al., 2021), and multi-teacher co-regularization (Pan et al., 2022) could be adopted. However, these methods possess serious shortcomings with respect to the disregard of local-level instance attributes and biased style encoding. In this work, we propose a holistic representation decomposition framework to bypass their limitations and pave the way for unbiased cross-domain alignment.

2.4 Learning-Based Nuclei Segmentation

In the current literature, deep learning-based methods have become prevalent in the field of nuclei instance segmentation owing to their strong feature representation capability. These methods can be generally divided into two categories, namely proposal-free and proposal-based methods. For the proposal-free methods, they generally follow a two-stage pipeline. At first, similar to the semantic segmentation task, each pixel is assigned a label to denote whether it corresponds to nuclei or tissue background. Then, by exploiting the spatial arrangement and morphological characteristics of the nuclei clusters, a post-processing technique is proposed to separate the overlapping nucleus entities. As one of the exemplar works, a deep contour-aware network (DCAN) (Chen et al., 2017) is formulated as a multi-task learning framework to integrate contour information with object appearance, which contributes to precise separation of the attached nuclei. These methods primarily rely on the global semantic characteristics yet pay less attention to the local object-level properties, and hence struggle to precisely delineate the borders between touching nuclei. In contrast, the proposal-based methods depend on global contextual features to a less degree. They adopt the segmentation by detection procedure and constructed the segmentation branch along with the classification and box regression branches for simultaneous class, box-offset and segmentation mask predictions (Chen et al., 2023; Liu et al., 2021). In this regard, the local instance-wise attributes can be emphasized and lead to better inter-nuclei separation.

3 Methods

We propose a two-stage disentanglement framework for heterogeneity-aware unsupervised domain adaptive nuclei instance segmentation from the view of OCDA setting, as demonstrated in Figs. 3 and 5. The model is trained sequentially such that the inference results of Stage I (i.e., synthesized target-like source images) are forwarded to Stage II as inputs. We present the details and overall objective function of each stage in this section.

Fig. 3
figure 3

Overview of Stage I for the proposed two-stage framework. The main objective of Stage I is to mitigate the significant image appearance discrepancy between different modalities and staining techniques with cross-domain image translation. A DRIT (Lee et al., 2020)-like architecture is employed as backbone with several auxiliary modules to overcome its limitations

3.1 Stage I: Cross-Domain Image Translation with Global Image-Level Disentanglement

To mitigate the large appearance discrepancy across images from different modalities and staining techniques, we at first propose to perform cross-domain image translation to synthesize target-like source domain images. Previous work (Liu et al., 2020a) resorted to utilizing CycleGAN (Zhu et al., 2017) to achieve the appearance-level adaptation. However, we observe that the styles of the synthesized images from CycleGAN are dominated by only one or two specific cancer styles. This is because CycleGAN does not explicitly model the intra-domain heterogeneity. To this end, we aim to enhance the image translation with explicit disentanglement of domain-invariant content features and domain-specific style features for more precise modeling of various cancer types.

3.1.1 Backbone

Inspired by DRIT (Lee et al., 2020), we construct the framework with content encoders \(E_c\), style encoders \(E_s\), image generators G, and domain discriminators \(D_{{\textit{image}}}\) for both source and target domains, as well as a domain-invariant content discriminator \(D_{{\textit{content}}}\). We follow the network weight sharing strategy employed in (Lee et al., 2020) that the last layer of \(E_c\) and the first layer of G are shared across the two domains. A disentangle-swap-reconstruct pipeline is additionally employed to regularize and guarantee the effectiveness of the feature disentanglement procedure.

To be specific, we denote the images and their corresponding annotations from source domain as \(X_{src}=\left\{ (x_{src}, y_{src}) \right\} \) and the unlabeled compound histopathology target domain as \(X_{tgt}=\left\{ x_{tgt}^i \right\} _{i=1}^{N_t}\), where \(N_t\) indicates the number of sub-target domains which is unknown in practice. Given an image \(x_{tgt}\) from target domain, it is concurrently forwarded to the partially shared content encoder \(E_c^{tgt}\) and the domain-specific style encoder \(E_s^{tgt}\) to characterize its histopathological structure information \(z_{c}^{tgt}\) and appearance variation \(z_{s}^{tgt}\) incurred by its modality, stain, and cancer type. Similarly, images \(x_{src}\) from the labeled source domain are also encoded to extract their content and style features \(\left\{ z_{c}^{src}, z_{s}^{src} \right\} \). Subsequently, these disentangled representations are swapped and forwarded to image generators for cross-domain image reconstruction, i.e. \(X_{src}^{{\textit{swap}}}=G^{src}(z_{c}^{tgt}, z_{s}^{src}), \; X_{tgt}^{{\textit{swap}}}=G^{tgt}(z_{c}^{src}, z_{s}^{tgt})\). To maintain the structure and appearance details, the disentangle-swap-reconstruct procedure is repeated on the synthesized fake images to recover the inputs as a cycle, illustrated as \(X_{src}^{{\textit{cycle}}}=G^{src}(E_c^{tgt}(X_{tgt}^{{\textit{swap}}}), E_s^{src}(X_{src}^{{\textit{swap}}})), \; X_{tgt}^{{\textit{cycle}}}=G^{tgt}(E_c^{src}(X_{src}^{{\textit{swap}}}), E_s^{tgt^{}}(X_{tgt}^{{\textit{swap}}}))\).

Fig. 4
figure 4

Illustration of the progressive clustering and separation strategy. In this module, we enforce intra-subdomain style compactness as well as inter-subdomain style separation to benefit feature disentanglement. Considering that the pseudo subdomain labels are highly noisy, especially in the early training stage, we only compute losses based on reliable samples which have high confidence for clustering results. As the style encoder is gradually trained, more samples will become reliable and consequently the style encodings of all image patches will form clear cluster organizations

3.1.2 Progressive Clustering and Separation of Cancer-Specific Subdomains

With respect to the drastic distribution inner-variance present in the target domain, a severe defect of DRIT is that it priorly assumes the target domain to be homogeneous and expects the encoded style attributes for all images to follow a uniform distribution. In this case, cancer-incurred style variations would be neglected, leading to distorted feature disentanglement and cross-domain alignment (Park et al., 2020). To transcend the bottleneck, we introduce a progressive clustering and partitioning strategy for style feature regularization to explicitly explore the inner-discrepancy of histopathology images, as illustrated in Fig. 4.

For images from the compound target domain, they can be categorized into K subdomains based on their disentangled style feature vectors, which represent the cancer type-specific and image patch-wise low-level texture characteristics. Here, K is a hyper-parameter indicating the number of different subdomains for all the patches in the target domain, which is related yet in practice often not equal to the number of different cancer types within the target domain due to the sampling variations across patches. We propose to first collect the subdomain-specific attributes of each target image modeled by the style encoder into a memory bank and employ the K-means clustering technique (Kanungo et al., 2002) on top to secure the centroid of each subdomain. Here, the content of memory bank and the clustering results are concurrently updated with the progress of training. Then, for well-grounded subdomain structuring and separation, we propose to enforce the inter-subpartition disparity and intra-subpartition affinity of the style encodings. Specifically, we encourage the style representations of instances in the target domain to be adjacent to the centroid of its subdomain and distant from the centroids of others. Thereafter, we propose to progressively select reliable samples for style encoder optimization and dynamically update the style centroids. At the early stage of model training, most subdomain splits are noisy and the corresponding image samples are therefore excluded from optimization. As the training progresses, the proportion of incorrect subpartition assignments could be decreased. The samples with high inference confidence are then incorporated in the training and further benefit the style encoding performance. In practice, we compute Eq. (1) to measure the style similarity between an image and its subdomain centroid as the clustering confidence metric and Eq. (2) to attain the loss value:

$$\begin{aligned} S_{ij}&=\frac{1}{\sum _{k=1}^K \left( \frac{\left\| {z_s}_i^{{\textit{img}}}-c_j^{{\textit{img}}} \right\| }{\left\| {z_s}_i^{{\textit{img}}}-c_k^{{\textit{img}}} \right\| } \right) ^ \frac{2}{m-1}}, \end{aligned}$$
(1)
$$\begin{aligned} \mathcal {L}_{{\textit{style-cluster}}}^{{\textit{target}}}&=\frac{1}{N^{{\textit{img}}}}\sum _{i=1}^{N^{{\textit{img}}}} \bigg [\left( S_{ij}>\gamma \right) \cdot \bigg (\left\| {z_s}_i^{{\textit{img}}}-c_j^{{\textit{img}}} \right\| \nonumber \\&\left. \left. \quad -\frac{1}{K-1}\sum _{k=1,k \ne j}^K\left\| {z_s}_i^{{\textit{img}}}-c_k^{{\textit{img}}} \right\| \right) \right] . \end{aligned}$$
(2)

Here, given an image instance i and a subdomain index j, \({z_s}_i^{{\textit{img}}}\) and \(c_{j}^{{\textit{img}}}\) denote the style encoding of the instance and the centroid of the subdomain, respectively. K and \(N^{{\textit{img}}}\) denote the number of target subdomains and image samples in total. We conduct thorough analysis on the effect of different latent subdomain numbers in the following sections. m is a parameter to regulate the fuzziness of the measurement and set to 2 for \(l_2\) normalization. \(\gamma \) corresponds to the confidence threshold that image tiles with confidence lower than \(\gamma \) are considered as spurious and would be excluded from model training.

3.1.3 Shape and Structure Preservation Along Image Translation

Image translation techniques based on generative neural network model (e.g., CycleGAN and DRIT) have shown remarkable success in the histopathology domain to handle image appearance variations caused by discrepant image modalities and staining techniques (de Bel et al., 2021; Liu et al., 2020a). However, due to the lack of supervision to explicitly induce shape and structure consistency, the nuclei in synthesized images suffer from severe deformation which inevitably results in the misalignment between synthesized images and instance segmentation labels (showcased in Sect. 5.3).

To this end, we propose to set up two auxiliary blocks on top of the image translation pipeline for precise preservation of nucleus shape and structural details. The synthesized target-like source image is forwarded to both of the branches in parallel. In the semantic segmentation branch, we employ a RGB-space feature encoder followed by a binary mask predictor to separate nuclei regions from the background. As for the boundary delineation branch, we first transform the input image from RGB color space to HED color space and extract H-space color map for further processing, so as to utilize the unique characteristics of H &E stained histopathology images and highlight the nuclei boundaries. Thereafter, a feature encoder and an object boundary predictor are similarly utilized for nuclei boundary prediction. By enforcing the semantic masks and the boundaries of the synthesis images to be consistent with those in the raw image, nucleus over-generation and shape deformation along image translation can be effectively mitigated.

3.1.4 Overall training objectives

As demonstrated in Fig. 3, the overall loss function for Stage I cross-domain image alignment is composed of several items:

$$\begin{aligned} \mathcal {L}_{{\textit{DRIT}}}&=\mathcal {L}_{{\textit{adv}}}^{{\textit{content}}}+\mathcal {L}_{{\textit{image-adv}}}^{{\textit{source}}} +\mathcal {L}_{{\textit{image-adv}}}^{{\textit{target}}}\nonumber \\&\quad +\mathcal {L}_{{\textit{cycle}}}^{{\textit{source}}}+\mathcal {L}_{{\textit{cycle}}}^{{\textit{target}}} +\mathcal {L}_{{\textit{style-reg}}}^{{\textit{source}}}+\mathcal {L}_{{\textit{style-reg}}}^{{\textit{target}}},\end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{{\textit{stageI}}}&=\lambda _1 \mathcal {L}_{{\textit{DRIT}}}+\lambda _2 \mathcal {L}_{{\textit{style-cluster}}}^{{\textit{target}}}\nonumber \\&\quad +\lambda _3 \mathcal {L}_{{\textit{mask}}}^{{\textit{fake-target}}}+\lambda _4 \mathcal {L}_{{\textit{boundary}}}^{{\textit{fake-target}}}, \end{aligned}$$
(4)

where \(\lambda \) represents hyper-parameter to control weights of each module. \(\mathcal {L}_{{\textit{DRIT}}}\) includes the typical adversarial and image reconstruction losses designed in DRIT, along with the regularization losses to constrain the style vectors can be drawn from a prior Gaussian distribution N(0, 1). \(\mathcal {L}_{{\textit{mask}}}^{{\textit{fake-target}}}\) and \(\mathcal {L}_{{\textit{mask}}}^{{\textit{fake-target}}}\) denote the cross entropy loss to supervise mask and boundary prediction respectively.

3.1.5 Style Randomization for Diverse Image Synthesis

At the end of Stage I, we leverage the trained image translation model to generate target-like source images and hence mitigate image-level domain shifts. Motivated by the discovery that style augmentation considerably improves the model’s generalization capability (Jackson et al., 2019), we adopt the style randomization technique for diverse and arbitrary histopathology-like image synthesis. Specifically, instead of synthesizing images conditioned on style features extracted from real histopathology image patches, arbitrary style attribute vectors are sampled from a prior Gaussian distribution N(0, 1) and integrated with content features of source image patches for cross-domain image translation. The aim is to learn domain-invariant visual representations via the augmented images and consequently alleviate the performance drop of the trained model on open testing subdomains.

Fig. 5
figure 5

Overview of Stage II for the proposed two-stage framework. In this stage, the inputs are the target-like source images synthesized with model trained in Stage I. Afterward, cross-domain feature alignment is performed via a Mask RCNN-based instance-level feature disentanglement network for domain adaptive instance segmentation

3.2 Stage II: Cross-Domain Feature Alignment with Local Instance-Level Disentanglement

3.2.1 Backbone

Following the design of previous works (Liu et al., 2020a; Hsu et al., 2021), we build the Stage II model upon the commonly adopted Mask R-CNN (He et al., 2017) architecture but propose several modifications to achieve instance-level feature disentanglement and consequently alleviate the domain-specific factors within the feature representations for detection and segmentation task learning. All modules are shared for images from the target-like source domain and real target domain. The detailed framework of Stage II model can be referred to in Fig. 5.

3.2.2 Domain-Invariant Feature Alignment via Feature Disentanglement

In the domain adaptive Mask R-CNN framework, feature representations forwarded to region proposal network (RPN) and ROI prediction heads are expected to be agnostic and indistinguishable across domains so as to prevent the model from overfitting to source domain data. To this end, two feature extractors are deployed in parallel for simultaneous ROI content and style encoding, followed by a feature regenerator to re-fuse the disentangled representations with a ROI feature consistency constraint \(L_{{\textit{ROI}}}\) to circumvent potential information loss in the feature encoding and disentanglement step. Moreover, two adversarial domain discriminators are deployed for the global features extracted by the backbone network and the disentangled instance-level content features respectively. Gradient reversal layers (GRLs) are inserted ahead of the discriminators to incorporate their training along with the main body.

3.2.3 Global–Local Style Consistency

It is noted that for feature disentanglement model, the domain-invariant content and domain-specific style attributes are mutual-complementary and orthogonal in nature (Wu et al., 2021). In this regard, precise and distinctive modeling of subdomain-specific characteristics plays a vital role for unbiased encoding of content attributes and is a crucial intermediate step towards ideal disentanglement on instance-level features, especially in the context of OCDA. To regularize the encoded ROI style representation, a straightforward approach is to adopt the similar mechanism employed in Stage I, which is to at first assign subdomain labels for each instance based on its style representation and then encourage intra-subdomain style compactness as well as inter-subdomain style separation. However, given the multi-class nature of cells, nuclei from the same images but of different categories typically possess divergent shape and spatial distribution. We find in practice that such cross-category heterogeneity inside each subdomain inevitably incurs serious style inconsistency and drastically compromises the accuracy of the assigned instance-level subdomain labels based on clustering.

As a result, we design a global–local style consistency mechanism to attain stable and category-agnostic instance-level style representation. In order to restrict the encoded style attribute to focus on subdomain-specific characteristics and exclude pattern variations caused by different nucleus categories, we assign a subdomain label for each local instance based on the global image-level style representation. In details, the image-level style encoder trained in Stage I is reused to extract global style representation for images from the entire domain and then K-means clustering is applied on top to assign a subdomain label for each image. It is noted that the clustering results obtained here are different from the ones in Stage I, as additional images synthesized with style randomization are further integrated. Next, we enforce all instances from images with the same subdomain label to share close style representations and instances from images with different subdomain labels to possess disparate ones. A style consistency loss is designed for this objective:

$$\begin{aligned} \mathcal {L}_{{\textit{style-cons}}}&=\frac{1}{N^{{\textit{ins}}}}\sum _{i=1}^{N^{{\textit{ins}}}} \bigg (\left\| {z_s}_i^{ins}-c_j^{ins} \right\| \nonumber \\&\left. \quad -\frac{1}{K-1}\sum _{k=1,k \ne j}^K\left\| {z_s}_i^{{\textit{ins}}}-c_k^{{\textit{ins}}} \right\| \right) , \end{aligned}$$
(5)

where \({z_s}_i^{{\textit{ins}}}\) and \(c_j^{{\textit{ins}}}\) respectively denote the style encoding of a ROI instance and its subdomain centroid. \(c_j^{{\textit{ins}}}\) is attained similar to Stage I that for each subdomain, a memory bank of instance-level style feature vectors is firstly maintained. Then, the centroid for each subdomain is calculated by averaging all instance style encodings from images with the corresponding subdomain label. K and \(N^{{\textit{ins}}}\) denote the number of target subdomains and ROI instances in total.

3.2.4 Overall Training Objectives

To summarize, the overall training objective of Stage II is to minimize the following losses:

$$\begin{aligned} \mathcal {L}_{{\textit{stageII}}}&=\lambda _5 \mathcal {L}_{{\textit{MaskRCNN}}}^{{\textit{source}}}+\lambda _6 (\mathcal {L}_{{\textit{adv}}}^{{\textit{global}}}+\mathcal {L}_{{\textit{adv}}}^{{\textit{local}}}) \nonumber \\&\quad +\lambda _7 (\mathcal {L}_{{\textit{ROI}}}^{{\textit{source}}}+\mathcal {L}_{{\textit{ROI}}}^{{\textit{target}}})\nonumber \\&\quad + \lambda _8 (\mathcal {L}_{{\textit{style-cons}}}^{{\textit{source}}}+\mathcal {L}_{{\textit{style-cons}}}^{{\textit{target}}}), \end{aligned}$$
(6)

where \(\mathcal {L}_{{\textit{MaskRCNN}}}^{{\textit{source}}}\) is the standard Mask R-CNN instance segmentation loss for images from the source domain and can be formulated as:

$$\begin{aligned} \mathcal {L}_{{\textit{MaskRCNN}}}^{{\textit{source}}}=\mathcal {L}_{{\textit{proposal}}}^{{\textit{source}}}+\mathcal {L}_{{\textit{mask}}}^{{\textit{source}}} +\mathcal {L}_{{\textit{box}}}^{{\textit{source}}}+\mathcal {L}_{{\textit{class}}}^{{\textit{source}}}. \end{aligned}$$
(7)

\(\mathcal {L}_{{\textit{adv}}}^{{\textit{global}}}\) and \(\mathcal {L}_{{\textit{adv}}}^{{\textit{local}}}\) represent the cross entropy losses for domain discriminators at two levels. \((\mathcal {L}_{{\textit{ROI}}}^{{\textit{source}}} + \mathcal {L}_{{\textit{ROI}}}^{{\textit{target}}})\) and \((\mathcal {L}_{{\textit{style-cons}}}^{{\textit{source}}} + \mathcal {L}_{{\textit{style-cons}}}^{{\textit{target}}})\) respectively denote the L1 reconstruction loss of ROI features and global–local style consistency loss for both the synthesized target-like source domain and the real target domain.

4 Experiments

4.1 Datasets

To comprehensively verify our method, we consider two representative cross-domain nuclei instance segmentation scenarios, cross-modality and cross-stain.

4.1.1 Cross-Modality Adaptation

Towards straighforward comparison against previous works focusing on cross-modality adaptation, we choose the fluorescence microscopy BBBC039 dataset (Ljosa et al., 2012) as the source domain and two histopathology image datasets containing H &E-stained images of multiple cancer types, namely Kumar (Kumar et al., 2017) and CPM17 (Vu et al., 2019), are employed as the target domain.

Following the same data split and preprocessing procedure as the previous work (Liu et al., 2020a), with the BBBC039 dataset, 100 training images and 50 validation images are utilized. With common data augmentation techniques including scaling, rotation, and flipping employed, around 10,000 image patches in size \(256 \times 256\) are extracted from the training set. As for the multi-cancer histopathology image datasets, to fit in the OCDA setting and evaluate the model’s generalizability on unseen subdomains, images of specific cancer types are excluded from the training set and only exist in the testing set. In the Kumar dataset, which contains in total 30 \(1000 \times 1000\) histopathology images from seven types of cancer, 16 images from liver, kidney, prostate, and breast cancer (four images from each type of cancer) are selected to be mixed as the compound target domain and formulated into the training set, while the remaining 14 images comprising three unseen subdomains (bladder, stomach, colon) are left for testing. The data split strategy is suggested to ensure that evident visual discrepancy and data distribution shifts exist across the base and open subdomains, as illustrated in Fig. 6. With this respect, the generalizability of the trained model to be adapted to the clinical wild can be fairly evaluated. Likewise, in the CPM17 dataset consisting of 64 \(500 \times 500 \) or \(600 \times 600 \) images from four types of cancer, images from lower grade glioma (LGG) cancer are formulated as the open subdomain for generalization assessment. 24 images from non small cell lung cancer (NSCLC), head and neck squamous cell carcinoma (HNSCC), and glioblastoma multiforme (GBM) cancer (eight images from each type of cancer) compose the training set, and 32 images (eight images from each type of cancer) are used for testing, as summarized in Table 2. All the images are randomly cropped into \(256 \times 256 \) patches during preprocessing.

Fig. 6
figure 6

Example image tiles from the base and open subdomains for two histopathology data collections. It can be observed that data distribution shifts evidently exist across the two sets of subdomains

Table 2 Data split strategy for histopathology image datasets

4.1.2 Cross-stain adaptation

In this scenario, we aim to adapt knowledge learned from IHC-stained images to H &E-stained images. DataSeg (Shu et al., 2020) contains 52 images in size \(200 \times 200\) which are captured at a core center or a region center surrounded by a large number of positively stained nuclei. After scaling and stitching, \(256 \times 256\) image patches are generated and formulate the source domain. With respect to target H &E-stained domain, we reuse the Kumar dataset and follow the same data split strategy as mentioned in Sect. 4.1.1.

4.2 Implementation Details

In Stage I framework, all the content encoders, style encoders, image generators, feature discriminators, and image discriminators are implemented with the same architecture as in Lee et al. (2020). The implementation for both mask and boundary prediction branches follows the structure of the semantic segmentation branch in Kirillov et al. (2019) to fuse the multi-scale features and then conduct prediction. In Stage II, we employ the Mask R-CNN with ResNet101 (He et al., 2016) in conjunction with FPN (Lin et al., 2017) as the backbone.

We implement the proposed method with Pytorch and MaskRCNN-Benchmark (Massa & Girshick, 2018). As for hyperparameter configurations, we empirically set the confidence threshold \(\gamma = 0.5\) in Eq. (2), \(N^{{\textit{mem}}} = 10^4, l^{{\textit{style}}} = 32, F^{{\textit{upd}}} = 500, T = 10^5\) in Algorithm 1, \(\lambda _1 = 1, \lambda _2 = 2, \lambda _3 = 5, \lambda _4 = 10\) in Equation (4), \(\lambda _5 = 1, \lambda _6 = 1, \lambda _7 = 2, \lambda _8 = 1\) in Eq. (6). Adam optimizer with a learning-rate of \(10^{-4}\) is employed to train Stage I’s model, whereas following the design of Massa and Girshick (2018), SGD optimizer with an initial learning-rate of \(5 \times 10^{-4}\) is used in Stage II’s training.

4.3 Evaluation Metrics

For the purpose of fair comparison, we adopt three metrics to evaluate the performance of nuclei instance segmentation which are broadly used in previous works. Panoptic quality (PQ) is a unified score which integrates detection quality (DQ) and segmentation (SQ) (Graham et al., 2019). The consolidated metric inherits capability to simultaneously measure the accuracy with respect to both detection and segmentation tasks. It is considered as a robust quantification for comprehensive evaluation of instance segmentation result. Additionally, we adopt DICE and AJI (Kumar et al., 2017) to perform supplemental evaluation at semantic and instance level respectively.

Table 3 Performance comparison for cross-modality nuclei instance segmentation on the BBBC039 \(\rightarrow \) Kumar benchmark
Table 4 Performance comparison for cross-modality nuclei instance segmentation on the BBBC039 \(\rightarrow \) CPM17 benchmark

4.4 Evaluation on Cross-Modality Adaptation

We at first validate the proposed method on cross-modality domain adaptation, i.e., from fluorescence microscopy to histopathology images. We compare against the state-of-the-art methods, including conventional UDA methods, OCDA method, and fully supervised method. The details are provided as follows:

  • UDA: DARCNN (Hsu et al., 2021) and PDAM (Liu et al., 2020a): DARCNN and PDAM are the most state-of-the-art domain adaptation methods for cross-modality nuclei instance segmentation. Advanced techniques such as self-supervised representation consistency loss and feature similarity maximization mechanism are exploited in these works.

  • OCDA: DHA (Park et al., 2020), CSFU (Gong et al., 2021), and ML-BPM (Pan et al., 2022): DHA, CSFU, and ML-BPM are the recent efforts proposed for addressing the OCDA challenge in the context of semantic segmentation. We extend those methods to the instance segmentation task by replacing their semantic segmentation branch with a Mask R-CNN module.

  • Supervised: We train a Panoptic FPN (Kirillov et al., 2019) model exploiting imaging data with high-quality annotations from both source and target in a fully-supervised manner to illustrate the performance upper bound of the nuclei analysis task. Comparison methods in the context of UDA and OCDA are not required to compete with the supervised approach since they retain distinct levels of dependence on data and corresponding annotations.

Fig. 7
figure 7

Visualization of cross-modality nuclei instance segmentation results on two H &E-stained histopathology datasets. The first column provides four examples of the source fluorescence microscopy images. The images of the top two rows are from Kumar dataset, and the bottom two rows are from CPM17 dataset. Separated nuclei are indicated with different colors. The red rectangles are plotted to highlight the difference of all results (Color figure online)

The quantitative comparison results are presented in Tables 3 and 4, from which we can find that our method outperforms all the DA methods in all metrics. It even demonstrates superior performance compared with the fully supervised method in the Kumar dataset. We also perform one-tailed paired t-test to evaluate the statistical significance between our proposed method and the competing UDA and OCDA methods in terms of PQ. All the resulting p-values are under 0.001, except when comparing with PDAM (Liu et al., 2020a) on the BBBC039 to Kumar benchmark. However, the p-value in this case is still under 0.01. Those results indicate the statistical significance of our achieved improvements. In particular, compared with DARCNN which only conducts feature-level alignment and target domain-specific model fine-tuning, our method suggests holistic cross-domain alignment and in consequence dramatically lifts the performance with the proposed two-stage disentanglement framework. PDAM adopts the similar two-stage model design yet solely proposes to obtain domain-agnostic representation for downstream tasks and neglects the multi-modal data distribution in histopathology images. In contrast, our method models the intra-domain heterogeneity of histopathology images explicitly by exploiting the rich subdomain-specific characteristics in both image translation and feature alignment. We further design a nucleus shape and structure preserving module to enhance the correspondance between the synthesized nuclei objects and the annotations. As a result, we exceed PDAM in both Kumar and CPM17 datasets, especially in terms of PQ on which we can observe an improvement over \(3{\%}\sim 6{\%}\). DHA, CSFU, and ML-BPM are a set of frameworks specifically designed for OCDA which propose to tackle the intro-domain heterogeneity by firstly performing latent domain discovery and subsequently simplifying the OCDA setting into multi-target DA. Despite their success on global-level semantic segmentation, we observe that such approaches generally fail to capture the tissue/cancer-wise pattern variations in the histopathology domain and tend to generate erroneous subdomain labels, which inevitably incurs error accumulation along following steps. In addition, those frameworks only perform semantic-level adaptation, whereas our proposed method also benefits from instance-level adaptation via debiased representation decomposition. As shown in the quantitative comparison, our method attains considerably better results over all adaptation benchmarks. In comparison with the Supervised Upper Bound, it is observed that our method even achieves superior results on the BBBC \(\rightarrow \) Kumar benchmark, without any requirements for annotated target domain data. Our method also attains appealing accuracy comparable to the supervised upper bound on the other adaptation benchmarks, which substantiates its promise in data-efficient scenarios.

Aside from the quantitative comparison, we additionally present the qualitative comparison results on four image patches of different cancer types in Fig. 7. As can be seen from the red rectangles, all the competing methods seriously suffer from poor modeling of instance-level characteristics. To be specific, they either generate overdense instance predictions that one integrated nucleus is splitted into several isolated ones, or cannot precisely separate touching nuclei clusters. This observation provides an explaination for what we find in the quantitative comparison results that compared with AJI and Pixel-F1, our method surpasses the others in terms of PQ by a larger margin. AJI and Pixel-F1 only measure the overall score of instance segmentation and mainly focus on the prediction accuracy in pixel-level. In contrast, PQ is calculated by multiplying two instance-level metrics, detection quality (DQ), and segmentation quality (SQ). In other words, it is mostly determined by prediction accuracy of each individual object, rather than global pixel-wise accuracy. Owing to its capability to precisely characterize instance-level and object-specific attributes, our method improves the previous works in terms of PQ remarkably. Meanwhile, in particular to DHA, the prediction results indicate its inferior performance as it fails to detect and accurately delineate the boundaries of several nuclei. This can be attributed to its biased target domain partition and subdomain-wise image translation. In practice we find that in DHA, as the assigned subdomain labels are drastically noisy, the divided subdomains are still mixed and with multi-modal data distribution. As a consequence, the following subdomain-wise image translation step is prone to synthesizing unrealistic and spurious nucleus texture patterns, which leads to failure in cross-domain nuclei instance segmentation.

Fig. 8
figure 8

Visualization of cross-stain nuclei instance segmentation results on Kumar dataset. The first column provides two examples of the source IHC-stained images

Table 5 Performance comparison in specific to seen subdomains and unseen subdomains on Kumar dataset

To evaluate the generalization capability to unseen subdomains in the OCDA setting, we further present the quantitative results on testing images from seen subdomains and unseen subdomains individually. As shown in Table 5, our method exhibits outstanding robustness to unseen subdomains. It is the reason why our method can outperform fully supervised upper bound in this dataset. Fully supervised method is vulnerable to distribution shifts among training set and testing set, which is relatively serious in Kumar. Since our method is built upon the disentanglement architecture, it is able to acquire domain-invariant representations at both the image level and the instance level. Along with the style randomization technique deployed for diverse data augmentation, our proposed method demonstrates effectiveness in dealing with images from unseen subdomains.

4.5 Evaluation on Cross-Stain Adaptation

We further perform the comparison study on another cross-domain scenario, i.e. cross-stain adaptation, to verify the efficacy and robustness of our method. As mentioned in Sect. 4.1.2, IHC-stained histopathology image dataset DataSeg is set as the source domain, while H &E-stained dataset Kumar is reused as the compound target domain.

The quantitative and qualitative comparison results are shown in Table 6 and Fig. 8, respectively. The overall observation is consistent with our previous discovery that our method reaches peak performance and even attains more significant improvements over the previous works. We postulate it is because the employed DataSeg dataset is rather small, and with the one-to-one image translation model (CycleGAN) adopted in previous works, only limited fake target images can be synthesized which consequently results in an intense over-fitting issue. In our method, due to the disentanglement framework and style randomization technique, a huge amount of images of various styles can be flexibly generated to bypass the obstacle. Moreover, we notice that since the IHC-stained histopathology images possess far more complicated texture patterns than the fluorescence microscopy ones, conventional image translation model without nucleus-specific supervision typically loses essential nucleus shape and structural details when conducting translation, which inevitably incurs mismatch between synthesized images and nuclei segmentation labels (showcased in Sect. 5.3). In comparison, with the regularization of the proposed nucleus shape and structure preserving module, images generated by our method showcase promising semantic consistency and provide unbiased supervision for following steps.

Table 6 Performance comparison for cross-stain nuclei instance segmentation

5 Discussion

To explore and validate the effectiveness of the key modules deployed in our method, we conduct an ablation study on the BBBC039 to Kumar benchmark. The quantitative performance comparison is presented in Table 7, where PCS denotes the progressive clustering and separation module, NSSP denotes the nucleus shape and structure preserving module, ID denotes instance-level disentanglement, GLSC denotes the global–local style consistency module. Additionally, to support our claim in Sect. 3.2.3, we provide the results when the clustering-based strategy in Stage I is similarly applied in Stage II for reference, denoted as Clust II.

Table 7 Quantitative analysis of key components in our method on BBBC039 to Kumar benchmark

5.1 Effectiveness of the Progressive Clustering and Separation Module

We first evaluate the effectiveness of the progressive clustering and separation module in Stage I. To this end, we set \(\lambda _2\) in Eq. (4) to 0 and do not amplify the separation of subdomain-specific characteristics. In addition, we plot the clustering of the style encodings extracted from images of four divergent cancer types. As shown in Fig. 9, style representations encoded without the proposed module are highly mixed and indistinguishable among different cancer types, whereas with the regularization of the style separation module, style representations exhibit apparent cluster organization. The observation is supported by the quantitative comparison (1st row, Table 7) as well as the fact that the instance segmentation results are enhanced under all three metrics with strengthened style division.

Fig. 9
figure 9

Visualization of clustering results. a t-SNE visualization of encoded style representations w/o PCS. b t-SNE visualization of encoded style representations with PCS. c Example images from each type of cancer

5.2 Effect of Different Latent Target Subdomain Numbers

The number of latent target subdomains, K, is an important hyperparameter in the previous progressive clustering and separation module. We further study the effect of different choices of K. Specifically, considering that the training target domain images are from four types of cancer and the training image patches are cropped from 16 image tiles, we vary the number of K from 4 to 16, with an interval of 3, and present the corresponding results in Table 8. It is observed that our method is robust to the hyperparameter K such that it outperforms the w/o PCS baseline under all settings. When \(K=10\), our method achieves the highest performance in terms of PQ and AJI. Example images of each clustered subdomain are demonstrated in Fig. 10. It is noteworthy that although these images are only from four types of cancer, each image patch cluster contains its own distinctive style pattern and should be considered as an individual subdomain. This finding supports our statement in Sect. 1 that multi-target DA is unsuitable for our task where subdomain labels cannot be directly assigned according to the image’s cancer type. In comparison, as for the extreme case when K is set to the number of cancer types, i.e. \(K=4\), the performance is relatively modest, which indicates that the divergent style characteristics are not fully exploited. On the other hand, when K is set to the number of image tiles, i.e. \(K=16\), we find that some of the clustered subdomains are quite similar to each other. The repetition of discovered subdomains and the degraded performance suggest that similar style patterns are shared among different image tiles and enforcing image tile-wise style separation is inappropriate. When implementing our method, we set \(K=10\) for all experiments.

Table 8 Quantitative parameter analysis of K on the BBBC039 to Kumar benchmark

Additionally, we have evaluated the performance for different types of clustering methods (i.e., Mean-Shift algorithm (Cheng, 1995)) by replacing the K-Means algorithm adopted in Sect. 3.1.2. On the BBBC039 \(\rightarrow \) Kumar adaptation benchmark, the Mean-Shift algorithm attains inferior results in terms of all evaluation metrics (\(\textrm{PQ}=0.5194\), \(\textrm{AJI}=0.5523\), \(\textrm{DICE}=0.7917\)) compared with K-Means. We observe that the performance degradation is as a consequence of the biased cluster split estimation. For the Mean-Shift algorithm, it leverages region searching to automatically determine the number of clusters. However, in the domain of histopathology, the data distribution intrinsically lies on a high-dimensional manifold with intricate levels of representation hierarchies. Density-based clustering algorithms like Mean-Shift inevitably suffer from erratic subspace organization and structuring, deteriorating the representativeness of identified subpartitions.

Fig. 10
figure 10

Example images of each clustered subdomain when K is set to 10

5.3 Effectiveness of the Nucleus Shape and Structure Preserving Module

In cross-domain nuclei instance segmentation, due to the lack of nucleus object-level supervision, conventional image translation models such as CycleGAN and DRIT suffer from two weaknesses, namely nucleus over-generation and deformation. Firstly, as depicted in Fig. 11c, the histopathology image synthesized with CycleGAN contains inappropriate nuclei objects which do not exist in the raw fluorescence microscopy image. In the previous works (Liu et al., 2020a), an auxiliary object inpainting mechanism is introduced to tackle the issue. However, in this way, substantial background textures with rich semantic information are wiped out, as shown in Fig. 12b. In addition, as noted in Fig. 11d, the synthesized image fails to preserve the nucleus shape and structure details and results in boundary deformation.

Fig. 11
figure 11

Visual comparison for image translation results of different methods from fluorescence microscopy to histopathology images. a Source fluorescence microscopy image patch, b corresponding nuclei annotations, image synthesized using c CycleGAN, d our method w/o shape regularization, e our method with shape regularization, f plot segmentation masks on top of (e). In (b), different colors are used to distinguish touching nuclei

Fig. 12
figure 12

Illustration of the detrimental effects of nuclei inpainting and our improvements. a Corresponding nuclei annotations, image synthesized using b CycleGAN along with nuclei inpainting, c our method

Dedicated to overcoming those obstacles, we design the nucleus shape and structure preserving module to enforce consistency on both semantic masks and object boundaries in image translation. Figure 12c shows that our method is able to synthesize precisely matched histopathology images. As no post-calibration is required, background texture pattern can be left uncontaminated. Furthermore, from Fig. 11d–f, it can be seen that owing to the proposed module, our method almost perfectly preserves the nucleus shape details. The performance drop when removing the module (2nd row, Table 7) also confirms its importance.

To jointly measure the contributions of modules in Stage I, we conduct a comparative analysis by replacing our proposed image translation framework with existing disentanglement-based approach, i.e., DRIT (Lee et al., 2020) and DRANet (Lee et al., 2021). The experimental results are presented in the 6th and 7th rows in Table 7. It is observed that our framework outperforms all the competitive approaches by a large margin in terms of all evaluation metrics, which in turn verifies the efficacy to progressively model the inner structure of the heterogeneous histopathology domain and enforce semantic preservation of nucleus structures.

5.4 Effectiveness of Instance-Level Disentanglement and Style Consistency

In Stage I, the disentanglement is conducted at the global level to characterize image visual pattern variations and thus enable controllable image style transfer and synthesize diverse target-like patches. While in Stage II, the disentanglement is accomplished in the local instance-level and the purpose is to acquire domain-invariant feature representations. As shown in 3rd–4th row, Table 7, compared with the global-level disentanglement-only lower bound, our method with instance-level disentanglement as well as ROI reconstruction consistency constraint promotes the average PQ to 0.5287. Then by additionally introducing the global–local style consistency module to facilitate the unbiased modeling of subdomain-specific characteristics, our method is able to achieve an average PQ of 0.5527, which improves the instance segmentation performance by 4.4% compared with W/o instance-level disentanglement. The remarkable increase in instance segmentation accuracy confirms that explicitly formulating precise and distinctive domain-specific instance-level attributes is beneficial for the separation of domain-invariant representations from the highly-entangled feature maps and, as a result, fosters cross-domain adaptation. In comparison, as indicated in 5th row, Table 7, when the clustering-based pseudo subdomain label strategy employed in Stage I is similarly introduced in Stage II, a significant performance drop is observed. It backs up our claim that on account of the category-wise nuclei heterogeneity, simply encouraging instance appearance attributes to form subdomain-wise clusters would result in biased feature disentanglement and is detrimental to domain adaptation.

Fig. 13
figure 13

Sensitivity analysis of loss weighting terms \(\lambda _{1\sim 8}\) on the BBBC039 to Kumar benchmark

5.5 Sensitivity Analysis on Loss Weighting Terms

To further investigate how the choice of loss weighting terms impact the overall performance of the proposed method, we perform sensitivity analysis on those terms. Specifically, for each weighting term defined in Eqs. (4) and (6), we firstly set it according to the parameter configuration presented in Sect. 4.2. Then, we scale it by a factor 0.1, 0.3, and 3, respectively. The corresponding quantitative UDA nuclei instance segmentation performance on the BBBC039 to Kumar benchmark is presented in Fig. 13. It is observed that weighing terms corresponding to fundamental purposes (e.g., DRIT adversarial and image reconstruction losses in Stage I—\(\lambda _1\), and Mask R-CNN instance segmentation losses in Stage II—\(\lambda _5\)) could have a more significant impact on the final results. In addition, the figure indicates that the overall performance is not sensitive to the specific choices of those weighing terms as long as they are set within a reasonable interval (i.e., scale factor lies in \(0.3\sim 3\)).

Table 9 Evaluations for cross-stain nuclei classification and instance segmentation on the CoNSep \(\rightarrow \) PanNuke adaptation benchmark
Table 10 Evaluations for cross-stain nuclei classification and instance segmentation on the GlaS\(\rightarrow \)Dpath adaptation benchmark

5.6 Evaluations on Class-Aware Cross-Stain Evaluation

To verify the effectiveness and robustness of our method under diverse domain-adaptive scenarios and problematic formulation, we perform extended evaluations on two cross-stain settings with the aim to not only delineate the boundary of each nucleus but also identify its functional type. Specifically, we firstly consider the adaptation from CoNSep (Graham et al., 2019) to PanNuke (Gamper et al., 2019). Those two benchmarks are constructed with histopathology imaging data collected different countries and institutes, with domain shifts inherently present due to inconsistency in staining procedure. PanNuke dataset is composed of histopathology tiles sampled from a broad range of organs and cancers, conforming to the heterogeneity proposition in target domain. Next, we perform evaluations on the adaptation setting from GlaS to Dpath (Graham et al., 2021), for which the two datasets are collected from cohorts across different countries with evident data distribution shifts. The quantitative comparative results are presented in Table 9, where we adopt the F1 metric to demonstrate classification accuracy and use the class-wise PQ score to indicate segmentation performance for each type of nucleus. Epi., Inf., and Con. correspond to the nuclei of epithelial, inflammatory, and connective cells, respectively. Avg. denotes the class-averaged results. It is demonstrated that our proposed method consistently outperforms the competing methods in terms of all evaluation metrics over two technical tasks. The experimental comparison results justify the effectiveness and general applicability of our method in addressing miscellaneous data distribution shifts in the histopathology domain under different levels of output demands (Table 10).

5.7 Visualization Results of Style Randomization

In Fig. 14, we present several examples of the images generated with randomly sampled style attribute vectors. It can be observed that compared with images shown in Fig. 10, which are sampled from the Kumar training set (seen subdomains), those randomly synthesized images demonstrate more significant visual similarity to the unseen testing ones. This finding explains why our method could remarkably surpass the competing ones regarding the generalization capability to unseen subdomains, which is showcased in Table 5.

Fig. 14
figure 14

Visualization results of style randomization. Images in the first row are examples of the images generated with randomly sampled style attribute vectors. Images in the second row are the real image patches from unseen testing subdomains (i.e., Bladder, Stomach, and Colon). All images are from the BBBC039 to Kumar benchmark

In addition, it is noted that generating images similar to unseen subdomains is not necessary to strengthen the model’s generalization capability. Yamashita et al. (2021) argued that style transfer with arbitrary style sources, including the ones divergent from the task domain, could enhance the model’s robustness against domain shifts. They utilized artistic paintings as style sources and performed style transfer to augment histopathology images. The synthesized images are apparently dissimilar to the real histopathology images, yet they still attain substantial improvements when testing the trained model on unseen subsets of histopathology images. It indicates that the core to style augmentation is the learning of domain-invariant visual representations, instead of generating images similar to unseen subdomains.

5.8 Impacts of Color Normalization

Color normalization is a common approach to alleviate the heterogeneity of histopathology images, which is incurred by inconsistent acquisition, processing, and staining procedures Švihlík et al. (2015); Farahani et al. (2022). In this section, we employ two color normalization techniques to the Kumar dataset and then evaluate how the proposed method performs on those normalized images.

Specifically, we introduce two widely used color normalization methods, namely RGB histogram specification (Gonzalez & Woods, 2002) and color transfer (Reinhard et al., 2001). We randomly select one image as the reference image and accordingly normalize all the images in the Kumar dataset. Experiments are then conducted on the BBBC039 to Kumar benchmark. However, it is observed that those color normalization methods have negative effects on the results.

Fig. 15
figure 15

Examples of images generated in Stage I with and w/o color normalization (RGB histogram specification)

First, we find that in Stage I, the diversity of generated images is limited, as shown in Fig. 15. Here, the image diversity not only denotes the global color distribution, but also indicates other informative cancer-specific characteristics, such as the texture of nuclei and biological tissues. For instance, the nuclei in Fig. 15e, f possess distinct texture patterns compared with (a)–(d), whereas all nuclei generated with color normalization are very similar to each other. It reveals that the cancer-specific attributes are closely entangled with color patterns. Color normalization is indeed helpful to reduce the color discrepancy caused by stain variation, but it would also inevitably erase the cancer-specific attributes and is detrimental to the diversity of translated images, which consequently leads to poor cross-domain performance. And no matter how we choose the value of K, the performance is always inferior. It is because the cancer-specific attributes are removed, and the extracted style features cannot represent the unique characteristics of different cancer types. In this case, the clustering of subdomains in Section 3.1.2 is meaningless.

Second, we find that after Stage II, color normalization in fact deteriorates the overall accuracy of cross-domain nuclei instance segmentation. The quantitative results are presented in Table 11.

Table 11 Quantitative analysis on the impacts of color normalization on the BBBC039 to Kumar benchmark

We also notice that the performance drop is especially severe for image tiles shown in Fig. 16.

Fig. 16
figure 16

Examples of failure cases with color normalization

The images in the first column (a) are from prostate (seen subdomain), and the images in the other three columns (b)–(d) are from bladder, stomach, and colon (unseen subdomains). This observation substantiates our statement. When color normalization is not employed, our proposed method can synthesize images with various texture patterns, including the unique nuclei texture in prostate images Fig. 16a (the corresponding synthesized image is Fig. 15f). The diversity of generated images also contributes to its success on unseen subdomains, as discussed in Sect. 5.5. On the contrary, with color normalization, the synthesized nuclei are of unitary texture patterns. It compromises the Stage II model’s generalization capability and results in inferior overall performance.

6 Conclusion

Data distribution heterogeneity across various cancer types and sampling tissues arises as the major obstacle undermining the potential of applying UDA methods to facilitate digital pathology. In this paper, we present the first work to explicitly consider the composited nature of data distribution in the histopathology domain and thereby develop a holistic framework to rectify the biased alignment procedure along adaptation. With the aim of inducing well-regulated decomposition between the informative pathological attributes and the confounding modality/stain-specific factors, we propose the progressive subdomain partition and cross-scale co-regularization strategies. Those key components collaboratively shape the embedding space where domain-invariant structural content can be decoupled from the task-irrelevant distributional variance. For empirical evaluations of our method, we perform extensive experiments over a diverse set of cross-modality and cross-stain adaptation benchmarks to verify its effectiveness and broader applicability. The quantitative and qualitative comparison results demonstrate the superiority of our method over state-of-the-art UDA and OCDA approaches in various evaluation metrics across different tasks.