Introduction

Accurate and robust multi-organ segmentation is highly required in computer-aided diagnosis, and its successive breakthroughs have been witnessed with the application of deep learning [1, 2]. To apply deep learning to multi-organ segmentation, one can collect a large-scale dataset with dense annotations from multiple institutes to train a generalizable model [3]. However, realizing such an application is usually restricted in clinical practice. On one aspect, medical datasets cannot be easily shared among medical institutes or clients due to privacy-keeping regulations. In addition, annotations for multi-organ can be incomplete and inconsistent across institutes. For instance, institutes may annotate single or partial organs that do not overlap with others due to different research interests. Another observation in practice is that institutes may leave many images unlabeled since annotating dense labels is costly. Thus, datasets for multi-organ segmentation are usually distributed since they cannot be shared and centralized. They are also imperfect since they do not have full annotations of multiple organs for fully supervised model training.

Driven by these observations, this work considers the problem of using distributed, partially labeled, and unlabeled samples to train a federated model for multi-organ segmentation. To the best of our knowledge, this problem remains unexplored. Three subproblems should be well addressed for this challenge: (1) learning from partially labeled samples, (2) learning from unlabeled samples, and (3) learning from distributed samples from multi-institute. To this end, this paper proposes FPS-Seg, a practical framework incorporating Federated learning (FL), Partially supervised learning (PSL), and Semi-supervised learning (SSL) modules for multi-organ Segmentation. Briefly, FPS-Seg maintains one central server and several clients. Clients locally train in-house models with partially labeled and unlabeled samples with PSL and SSL modules. The FL module bridges client and central server communication to prepare a global statistical model. With the collaboration of three modules, valuable information can be mined from imperfect local datasets and aggregated to develop a generalizable model. Contributions of this work are summarized in the following.

  • A new problem setting, i.e., learning a model from decentralized, partially labeled, and unlabeled samples, is introduced for multi-organ segmentation, which is tougher and closer to clinical practice.

  • A practical framework is designed to address this problem by unifying federated, partially supervised, and semi-supervised learning.

  • The proposed method is extensively validated with several CT datasets. It shows a promising solution to this challenging problem. It also has a good generalization ability for downstream segmentation tasks.

Related works

Multi-organ segmentation remains a challenging task whose objective is to concurrently delineate multiple organs or anatomical structures from medical images, e.g., abdominal CT scans. Comprehensive insights into the domain can be gathered from dedicated reviews [1, 2].

The essence of semi-supervised learning (SSL) [4,5,6] is leveraging a small amount of labeled data alongside a much larger set of unlabeled data to train a model. Consistency learning [4, 5], expecting prediction invariance under perturbations, and pseudo-labeling [6], utilizing pseudo-labels for self-training, are two main strategies in SSL. Given the labor-intensive and costly nature of manual annotations in medical image analysis, SSL offers a viable alternative by tapping into the more accessible pool of unlabeled data. Several SSL methods [7, 8] have already been proposed for multi-organ segmentation.

Another key observation in practice is the substantial presence of datasets with only one or a few organs labeled in abdominal CT scans. To use these datasets with inconsistent and partially labeled annotations, a practical paradigm called partially supervised learning (PSL) has been introduced [9, 10]. While PSL is synonymous with SSL in some machine learning contexts [11], to clarify, this paper distinguishes between the two paradigms following prior works [9, 10]. In this work, SSL and PSL cater to different scenarios. SSL uses a mix of labeled and unlabeled data, whereas PSL manages datasets where each sample possesses some labels but not a full set.

Federated learning (FL) represents an advanced approach for decentralized data training, which is especially beneficial for sensitive fields like medical imaging, where datasets cannot be easily shared due to data privacy and regulations [12,13,14]. Several methods leveraging FL for multi-organ segmentation have been proposed. Notably, studies like [15, 16] have endeavored to train models on decentralized datasets with only partial annotations, combining PSL and FL for multi-organ segmentation.

Although great progress has been achieved by existing methods for multi-organ segmentation and other tasks in medical image analysis, these methods are primarily for a single task, i.e., SSL [7, 8], PSL [9, 10], and FL [13, 14], and dual tasks, e.g., federated semi-supervised learning [17, 18] and federated partial-label learning [15, 16]. Unlike previous works, this work introduces a more challenging and practical setting in multi-organ segmentation, which aims to learn a federated model from distributed, partially labeled, and unlabeled datasets by unifying SSL, PSL, and FL.

Method

Problem definition

Ideally, training a generalizable model for segmenting m organs requires numerous images \(\textbf{X}\) and the corresponding full annotations \(\textbf{Y}\) spanning \(\left( m+1\right) \) classes, where \(\mathcal {M} = \left\{ 0,1,\ldots ,m\right\} \) denotes the class set with \(\left\{ 0\right\} \) for background and \(\left\{ 1\right\} \) to \(\left\{ m\right\} \) for organs.

However, in clinical settings, datasets are often decentralized, with partial or no annotations. Given K medical institutes \(\left\{ Z_{i}\right\} _{i=1}^{K}\), each holds a dataset \(\mathcal {D}_{i} = \left\{ \mathcal {D}_{i}^{u}, \mathcal {D}_{i}^{l}\right\} \), where \(\mathcal {D}_{i}^{u} = \left\{ \textbf{X}_{i}^{u}\right\} \) contains images \(\textbf{X}_{i}^{u}\) devoid of annotations and \(\mathcal {D}_{i}^{l} = \left\{ \textbf{X}_{i}^{l}, \textbf{Y}_{i}^{l}\right\} \) consists of images \(\textbf{X}_{i}^{l}\) and partial annotations \(\textbf{Y}_{i}^{l}\). This study considers an extreme case where each client only owns annotations for a single organ. Suppose that the label sets of \(\left\{ \textbf{Y}_{i}^{l}\right\} _{i=1}^{K}\) are defined as \(\left\{ \mathcal {E}_{i}\right\} _{i=1}^{K}\), then \(\mathcal {E}_{1} \cap \mathcal {E}_{2} \cap \mathcal {E}_{3} \cap \cdots \cap \mathcal {E}_{K} = \left\{ 0\right\} \), and \(\mathcal {E}_{1} \cup \mathcal {E}_{2} \cup \mathcal {E}_{3} \cup \cdots \cup \mathcal {E}_{K} = \mathcal {M}\). These institutes are expected to utilize distributed, partially annotated, and unlabeled data to train a global model for multi-organ segmentation collaboratively.

Fig. 1
figure 1

Overview of proposed framework FPS-Seg

Overview

The proposed framework FPS-Seg is shown in Fig. 1. FPS-Seg simulates a practice where a central server coordinates three medical institutes (\(K=3\)) to collaboratively train a global model for multi-organ (liver, spleen, and stomach) segmentation. Institutes maintain teacher models \(\left\{ T_{i}\right\} _{i=1}^{K}\) and student models \(\left\{ S_{i}\right\} _{i=1}^{K}\). The teacher models use exponential moving averaging (EMA) weights of the student models. During local training phase, on one aspect, the student models learn from partially labeled samples. Besides, consistency is enforced between the outputs of teacher and student models to take advantage of unlabeled samples. The central server aggregates local student model weights to update the global model G. The FL, PSL, and SSL modules are introduced below.

Federated learning module

The FL module builds the bridge between local clients and the global server. Namely, it offers global model weight aggregation and local model weight updating functions. Its role is to train a global model \(G(\cdot ; \mathbf {\Theta }^{g})\) until convergence with a total of R rounds without data sharing to violate data privacy regulations. During training, at the r-th federated round, each institute of \(\left\{ Z_{i}\right\} _{i=1}^{K}\) would download the current global weight \(\mathbf {\Theta }^{g}_{(r)}\) from the server and assign it to the local model \(S_{i} \left( \cdot ; \mathbf {\Theta }_{i}^{s} \right) \), which shares the same architecture as the global model. Afterward, clients fine tune local models for e epochs using their private datasets. The central server will then collect local weights \(\left\{ \mathbf {\Theta }_{i(r)}^{s}\right\} _{i=1}^{K}\) and aggregates them to get updated global model weights \(\mathbf {\Theta }^{g}_{(r+1)}\). This study adopts federated average algorithm [19] to update the global model:

$$\begin{aligned} \mathbf {\Theta }^{g}_{(r+1)} = \sum _{i=1}^{K} \frac{N_{i}}{\sum _{i=1}^{K} N_{i}} \mathbf {\Theta }_{i(r)}^{s}, \end{aligned}$$
(1)

where \(N_{i}\) denotes the number of images for each dataset \(\mathcal {D}_{i}\) of client \(Z_{i}\).

Fig. 2
figure 2

Illustration of the PSL module, using the training procedure for client \(Z_{1}\) as a representative example. For clarity, four voxels alongside hypothetical probability values are specified from the background, liver, stomach, and spleen to aid explanation

Partially supervised learning module

Assuming that the background, liver, spleen, and stomach class indexes are 0, 1, 2, and 3, the class set \(\mathcal {M}\) is \(\left\{ 0,1,2,3\right\} \), and institutes \(Z_{1}\), \(Z_{2}\), and \(Z_{3}\), respectively, hold private liver, spleen, and stomach datasets that comprise a large part of unlabeled samples and a small part of labeled samples. The label sets \(\mathcal {E}_{1}\), \(\mathcal {E}_{2}\), and \(\mathcal {E}_{3}\) are, respectively, \(\left\{ 0,1 \right\} \), \(\left\{ 0,2 \right\} \), and \(\left\{ 0,3 \right\} \). The PSL module enables each client of \(\left\{ Z_{i}\right\} _{i=1}^{K}\) to train its local model \(S_{i} \left( \cdot ; \mathbf {\Theta }_{i}^{s} \right) \) with partially labeled samples \(\mathcal {D}_{i}^{l} = \left\{ \textbf{X}_{i}^{l}, \textbf{Y}_{i}^{l}\right\} \).

Considering that a mini-batch of samples \(\left\{ \textbf{x}_{i}^{l}, \textbf{y}_{i}^{l}\right\} \) is fetched from \(\mathcal {D}_{i}^{l}\), in which \(\textbf{x}_{i}^{l} \in \mathbb {R}^{B \times C \times H \times W \times D}\) denotes 3D CT volumes, where B, C, H, W, and D, respectively, indicate the size of the batch, channel, height, width, and depth, C is 1 for 3D CT volumes, and \(\textbf{y}_{i}^{l} \in \mathbb {R}^{B \times 2 \times H \times W \times D}\) indicates corresponding partial annotations in one-hot formation for a specific organ. \(S_{i} \left( \cdot ; \mathbf {\Theta }_{i}^{s} \right) \) outputs probability maps \(\textbf{p}_{i}^{l} \in \mathbb {R}^{B \times 4 \times H \times W \times D}\) with the input of \(\textbf{x}_{i}^{l}\). The optimization objective for this module employs the marginal and exclusion losses as described in [10]. Please refer to [10] for more technical details. On one aspect, all unlabeled organs are treated as the background and merged into the original background, and a marginal loss \(\mathcal {L}_\text {marg}\) is then calculated. In addition, the natural organ exclusiveness is added as additional prior knowledge to introduce a penalization in the form of an exclusion loss \(\mathcal {L}_\text {excl}\).

The training procedure of client \(Z_{1}\), as depicted in Fig. 2, is taken as an example, and note that other clients train models in a similar principle. \(Z_{1}\) holds labeled samples \(\mathcal {D}_{1}^{l}\) with annotations of the liver. The output probability maps of \(S_{1} \left( \cdot ; \mathbf {\Theta }_{1}^{s} \right) \) are denoted as \(\textbf{p}_{1}^{l}\). Since the spleen and stomach are not labeled, their corresponding channels in \(\textbf{p}_{1}^{l}\) can be merged into the first channel, and the new probability maps \(\hat{\textbf{p}}_{1}^{l} \in \mathbb {R}^{B \times 2 \times H \times W \times D}\) are then obtained. \(\hat{\textbf{p}}_{1}^{l}\) and \(\textbf{y}_{1}^{l}\) have the same channels, and a marginal loss \(\mathcal {L}_\text {marg}\) can be calculated between them. Besides, the exclusive labels \(\hat{\textbf{y}}_{1}^{l} \in \mathbb {R}^{B \times 4 \times H \times W \times D}\) are created for \(\textbf{p}_{1}^{l}\) based on \(\textbf{y}_{1}^{l}\). Expressly, for voxels belonging to the liver region in \(\textbf{x}_{1}^{l}\), the corresponding label values in \(\hat{\textbf{y}}_{1}^{l}\) are set to \(\left[ 1,0,1,1 \right] \), while the remaining label values are set to \(\left[ 0,1,0,0 \right] \). An exclusion loss \(\mathcal {L}_\text {excl}\) is enforced between \(\textbf{p}_{1}^{l}\) and \(\hat{\textbf{y}}_{1}^{l}\) to reduce their intersection.

Generally, the training objective \(\mathcal {L}_\text {psl}\) for each client of \(\left\{ Z_{i}\right\} _{i=1}^{K}\) is:

$$\begin{aligned} \mathcal {L}_\text {psl} \left( \textbf{p}_{i}^{l}, \hat{\textbf{p}}_{i}^{l}, \textbf{y}_{i}^{l}, \hat{\textbf{y}}_{i}^{l} \right) = \alpha \mathcal {L}_\text {marg}\left( \hat{\textbf{p}}_{i}^{l}, \textbf{y}_{i}^{l} \right) + \beta \mathcal {L}_\text {excl}\left( \textbf{p}_{i}^{l}, \hat{\textbf{y}}_{i}^{l} \right) , \end{aligned}$$
(2)

where

$$\begin{aligned}&\mathcal {L}_\text {marg} \left( \hat{\textbf{p}}_{i}^{l}, \textbf{y}_{i}^{l}\right) = \underbrace{-\sum _{j=1}^{2} \sum _{v=1}^{V} \textbf{y}_{i,j,v}^{l}\log {\hat{\textbf{p}}_{i,j,v}^{l}} }_{\mathcal {L}_\text {ce}} \nonumber \\&\qquad + \underbrace{ \sum _{j=1}^{2} \left( 1-\frac{ 2\sum _{v=1}^{V} \hat{\textbf{p}}_{i,j,v}^{l} \textbf{y}_{i,j,v}^{l}}{\sum _{v=1}^{V} \left( \hat{\textbf{p}}_{i,j,v}^{l}\right) ^{2} + \sum _{v=1}^{V} \left( \textbf{y}_{i,j,v}^{l}\right) ^{2}} \right) }_{\mathcal {L}_\text {dice}}, \end{aligned}$$
(3)

and

$$\begin{aligned}&\mathcal {L}_\text {excl} \left( \textbf{p}_{i}^{l}, \hat{\textbf{y}}_{i}^{l} \right) = \underbrace{\sum _{j=1}^{4} \sum _{v=1}^{V}\hat{\textbf{y}}_{i,j,v}^{l}\log {\textbf{p}_{i,j,v}^{l}} }_{\mathcal {L}_\text {ece}} \nonumber \\&\qquad + \underbrace{ \sum _{j=1}^{4} \frac{2\sum _{v=1}^{V}\textbf{p}_{i,j,v}^{l} \hat{\textbf{y}}_{i,j,v}^{l}}{\sum _{v=1}^{V} \left( \textbf{p}_{i,j,v}^{l}\right) ^{2} + \sum _{v=1}^{V} \left( \hat{\textbf{y}}_{i,j,v}^{l}\right) ^{2}}}_{\mathcal {L}_\text {edice}}, \end{aligned}$$
(4)

where j is the channel index, V is the number of voxels in an image, and v is the voxel index. \(\alpha \) and \(\beta \) are hyperparameters. The combination of cross-entropy (CE) loss \(\mathcal {L}_\text {ce}\) and Dice loss \(\mathcal {L}_\text {dice}\) is adopted as the marginal loss \(\mathcal {L}_\text {marg}\), and the combination of exclusion CE loss \(\mathcal {L}_\text {ece}\) and exclusion Dice loss \(\mathcal {L}_\text {edice}\) is employed as the exclusion loss \(\mathcal {L}_\text {excl}\).

Fig. 3
figure 3

Illustration of the training procedure for the SSL module

Semi-supervised learning module

The SSL module enables every client of \(\left\{ Z_{i}\right\} _{i=1}^{K}\) to further leverage its unlabeled samples \(\mathcal {D}_{i}^{u} = \left\{ \textbf{X}_{i}^{u}\right\} \). Inspired by the work of [4], another model \(T_{i}\left( \cdot ; \mathbf {\Theta }_{i}^{t} \right) \) is applied for each client of \(\left\{ Z_{i}\right\} _{i=1}^{K}\). \(T_{i}\left( \cdot ; \mathbf {\Theta }_{i}^{t} \right) \) and \(S_{i}\left( \cdot ; \mathbf {\Theta }_{i}^{s} \right) \) are regarded as the teacher and the student models. The teacher model shares the same architecture as the student model and uses the student model’s EMA weights. Consistency is imposed on their predictions for unlabeled data. Besides, input perturbation similar to the work [5] is further introduced since consistency regularization under harsher perturbations empirically benefits model generalization ability.

An illustration of the training procedure for the SSL module is shown in Fig. 3. Assuming that a mini-batch of unlabeled images \(\textbf{x}_{i}^{u}\) is fetched at each training iteration, these images are firstly fed into \(T_{i}\left( \cdot ; \mathbf {\Theta }_{i}^{t} \right) \) and \(S_{i}\left( \cdot ; \mathbf {\Theta }_{i}^{s} \right) \) to obtain probability maps \(\tilde{\textbf{p}}_{i}^{u} \in \mathbb {R}^{B \times 4 \times H \times W \times D}\) and \(\textbf{p}_{i}^{u} \in \mathbb {R}^{B \times 4 \times H \times W \times D}\). Same as Section “Partially supervised learning module,” the channels of unlabeled organs are then merged into the background for \(\tilde{\textbf{p}}_{i}^{u}\) and \(\textbf{p}_{i}^{u}\) to yield merged probability maps \(\tilde{\textbf{q}}_{i}^{u} \in \mathbb {R}^{B \times 2 \times H \times W \times D}\) and \(\textbf{q}_{i}^{u} \in \mathbb {R}^{B \times 2 \times H \times W \times D}\). Consistency learning regards \(\tilde{\textbf{q}}_{i}^{u}\) as pseudo-targets and calculates an unsupervised loss \(\mathcal {L}_\text {unsup}\) between \(\tilde{\textbf{q}}_{i}^{u}\) and \(\textbf{q}_{i}^{u}\).

Table 1 Details of three datasets used in our study
Table 2 Quantitative results of the proposed method
Table 3 Quantitative validation of various methods under localized, centralized, and federated learning settings, using different combinations of labeled (L) and unlabeled (U) samples
Table 4 Ablation study on FPS-Seg’s components and corresponding hyperparameters

However, \(\tilde{\textbf{q}}_{i}^{u}\) may inevitably contain fault and noisy predictions, and consistency regulation based on which may accumulate training errors and result in model performance degradation. Confidence thresholding [5, 20], which involves setting a threshold \(\tau \), offers a practical solution to stabilize training and enhance model performance. It allows for the extraction of confident predictions from \(\tilde{\textbf{q}}_{i}^{u}\), enabling consistency regularization to rely solely on these predictions. By incorporating confidence thresholding, the training object on unlabeled samples \(\mathcal {D}_{i}^{u}\) for each client of \(\left\{ Z_{i}\right\} _{i=1}^{K}\) is:

$$\begin{aligned} \mathcal {L}_\text {unsup} \left( \textbf{q}_{i}^{u}, \tilde{\textbf{q}}_{i}^{u}\right) = \frac{\sum _{j=1}^{2} \sum _{v=1}^{V} \mathbf {\Gamma }_{i,v} \left\| \textbf{q}_{i,j,v}^{u}-\tilde{\textbf{q}}_{i,j,v}^{u} \right\| ^{2}}{2\sum _{v=1}^{V} \mathbf {\Gamma }_{i,v}}, \end{aligned}$$
(5)

where

$$\begin{aligned} \mathbf {\Gamma }_{i,v} = {\left\{ \begin{array}{ll} 1 &{} \text {if } \max \left( \tilde{\textbf{q}}_{i,1,v}^{u}, \tilde{\textbf{q}}_{i,2,v}^{u}\right) > \tau \\ 0 &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$
(6)

in which \(\left\| \cdot \right\| ^{2}\) is the mean error function (MSE) and \(\mathbf {\Gamma }_{i} \in \mathbb {R}^{B \times H \times W \times D}\) denotes the binary masks that control consistency regularization only using confident predictions. j is the channel index, V is the number of voxels in an image, and v is the voxel index. The threshold \(\tau \) determines the extent of filtering, where \(\tau =0\) means all pseudo-target regions are included in the loss calculation, and \(\tau =1\) implies complete exclusion of pseudo-target regions.

Full training procedure of FPS-Seg

This part summarizes the full training procedure of FPS-Seg. At each federated round, each client first downloads the global model weight \(\mathbf {\Theta }^{g}\) and assigns it to student models \(\left\{ S_{i} \left( \cdot ; \mathbf {\Theta }_{i}^{s} \right) \right\} _{i=1}^{K}\). During the local training phase, student models \(\left\{ S_{i} \left( \cdot ; \mathbf {\Theta }_{i}^{s} \right) \right\} _{i=1}^{K}\) learn from labeled samples \(\left\{ \mathcal {D}_{i}^{l} \right\} _{i=1}^{K}\) with \(\mathcal {L}_\text {psl}\), and extract information learns from unlabeled samples \(\left\{ \mathcal {D}_{i}^{u} \right\} _{i=1}^{K}\) with the help of \(\left\{ T_{i} \left( \cdot ; \mathbf {\Theta }_{i}^{t} \right) \right\} _{i=1}^{K}\) using \(\mathcal {L}_\text {unsup}\). Thus, the total local training objective \(\mathcal {L}_\text {total}\) for each client of \(\left\{ Z_{i}\right\} _{i=1}^{K}\) is:

$$\begin{aligned} \mathcal {L}_\text {total} \left( \textbf{p}_{i}^{l}, \hat{\textbf{p}}_{i}^{l}, \textbf{y}_{i}^{l}, \hat{\textbf{y}}_{i}^{l}, \textbf{q}_{i}^{u}, \tilde{\textbf{q}}_{i}^{u}\right)&= \mathcal {L}_\text {psl} \left( \textbf{p}_{i}^{l}, \hat{\textbf{p}}_{i}^{l}, \textbf{y}_{i}^{l}, \hat{\textbf{y}}_{i}^{l} \right) \nonumber \\&\quad + \gamma \mathcal {L}_\text {unsup} \left( \textbf{q}_{i}^{u}, \tilde{\textbf{q}}_{i}^{u}\right) , \end{aligned}$$
(7)

in which \(\gamma \) is a trade-off hyperparameter. When the local training finishes, the central server will aggregate local student weights \(\left\{ \mathbf {\Theta }_{i}^{s}\right\} _{i=1}^{K}\) to update the global model \(G(\cdot ; \mathbf {\Theta }^{g})\) with Eq. (1). A global model can finally be obtained by repeating the above procedures.

Experiments and results

Experimental settings

Datasets and evaluation metrics

Datasets Three in-house contrast-enhanced abdominal CT datasets: #Set-A, #Set-B, and #Set-C, were applied. FPS-Seg was first evaluated with #Set-A, and its generalization ability was then validated by transferring it to downstream tasks on #Set-B and #Set-C. Details of three datasets are shown in Table 1. For data preprocessing, all volumes were resampled to an isotropic spatial resolution of 1.0 mm for each axis. The intensities were truncated to the range of \(\left[ -1000, 1000 \right] \) Hounsfield units (HU) and then normalized as zero mean and unit variance.

Evaluation metrics The Dice score [%] and 95% Hausdorff distance (95HD) [mm] were applied as evaluation metrics. The 95HD is a specific instance of the partial HD [21]. Given a surface point set \(\mathcal {A}\) of the prediction and a surface point set \(\mathcal {B}\) of the ground truth, the sets of directed HD from \(\mathcal {A}\) to \(\mathcal {B}\) and \(\mathcal {B}\) to \(\mathcal {A}\) are defined as

$$\begin{aligned} \omega (\mathcal {A}, \mathcal {B}) = \left\{ \underset{b \in \mathcal {B}}{\min } \Vert a - b\Vert \mid a \in \mathcal {A} \right\} \end{aligned}$$
(8)

and

$$\begin{aligned} \omega (\mathcal {B}, \mathcal {A}) = \left\{ \underset{a \in \mathcal {A}}{\min } \Vert a - b\Vert \mid b \in \mathcal {B} \right\} , \end{aligned}$$
(9)
Fig. 4
figure 4

Ablation study on aggregating student models and teacher models

respectively, where \(\left\| \cdot \right\| \) denotes the Euclidean norm. The values \(\omega _{\kappa } (\mathcal {A}, \mathcal {B})\) and \(\omega _{\kappa } (\mathcal {B}, \mathcal {A})\) that rank in the \(\kappa \)-th percentile of \(\omega (\mathcal {A}, \mathcal {B})\) and \(\omega (\mathcal {B}, \mathcal {A})\) can then be chosen to calculate the partial HD \(\Omega _{\kappa }(\mathcal {A}, \mathcal {B})\) with

$$\begin{aligned} \Omega _{\kappa }(\mathcal {A}, \mathcal {B}) = \max (\omega _{\kappa } (\mathcal {A}, \mathcal {B}), \omega _{\kappa } (\mathcal {B},\mathcal {A}). \end{aligned}$$
(10)

\(\kappa \) is set as 95 to compute the 95HD.

Fig. 5
figure 5

Qualitative results of outliers. These instances are notable for achieving satisfactory Dice scores yet exhibiting large 95HD values for the liver, spleen, and stomach, respectively, as indicated by red arrows

Fig. 6
figure 6

Qualitative results of various methods. 20 L: training with 20 labeled samples. 20 L + 30 U: training with 20 labeled and 30 unlabeled samples

Fig. 7
figure 7

Comparison of convergence rate and validation accuracy in transfer learning for pancreas and artery segmentation across various methods

Implementation details

Problem simulation One central server and three clients were maintained to simulate the problem. All experiments were conducted with fourfold cross-validation. #Set-A was split into 150/50 for training/validation at each fold. These 150 volumes were split into three sub-datasets (50/50/50) for three clients, and every sub-dataset was divided into 20 labeled samples and 30 unlabeled samples. Each sub-dataset only used annotations of a single organ.

Experimental setup All experiments were performed on the PyTorch platform. 3D U-Net [22] was chosen as the backbone. An SGD optimizer with a momentum of 0.9 and a weight decay of \(10^{-4}\) was utilized to train the global model for 600 federated rounds. The local training epoch e was set to 1. A warm-up two-stage training strategy was adopted. Specifically, clients trained models using labeled samples under a poly-learning rate with an initial learning rate of \(10^{-2}\) at the first 300 rounds and trained models using both labeled and unlabeled samples under a poly-learning rate with an initial learning rate of \(10^{-3}\) at the second 300 rounds. Sub-volumes with the size of \(256 \times 256 \times 112\) were randomly cropped for training. Random flipping and random rotation were applied as augmentation schemes. For hyperparameter settings, please refer to Section “Ablation studies.” During the testing phase, a sliding window strategy was applied.

Experiment results

Quantitative results

Table 2 shows the quantitative results of FPS-Seg with fourfold cross-validation. Table 3 provides a detailed quantitative validation of different methods in localized, centralized, and federated learning scenarios. In the localized learning scenario, each client trained its local model on its private data with single organ annotations. Centralized learning involved training FPS-Seg using centralized datasets, employing PSL and SSL modules while excluding the FL module. The multitask federated learning (MTFL) approach [16] was implemented for comparison in the FL mode. These evaluations were conducted under three data scenarios: 50 L (50 labeled samples), 20 L (20 labeled samples), and 20 L + 30 U (20 labeled and 30 unlabeled samples).

Each method demonstrated its upper bound accuracy with 50 L. The performance of each method obtained with 20 L + 30 U consistently surpassed that with 20 L, validating the efficacy of SSL in utilizing unlabeled data. FPS-Seg consistently improved over localized learning, indicating its capability to exploit local datasets through FL. Additionally, FPS-Seg outperformed MTFL [16] in the FL mode and yielded competitive results comparable to its performance in the centralized learning mode.

Ablation studies

Effects of FPS-Seg’s components and their hyperparameters \(\alpha \) and \(\beta \) in Eq. (2) are associated with the PSL module, \(\gamma \) in Eq. (7) relates to the SSL module, and \(\tau \) in Eq. (6) is for confidence thresholding. This ablation study was divided into three sub-steps. Initially, the roles of \(\alpha \) and \(\beta \) were examined with the PSL module using only labeled data. Once optimal values for \(\alpha \) and \(\beta \) were established, the model incorporated unlabeled data by enabling the SSL module with varying \(\gamma \). After determining suitable values for \(\alpha \), \(\beta \), and \(\gamma \), the model applied confidence thresholding with varying \(\tau \). This searching process allowed for an assessment of the individual contributions of each component, as well as evaluating the corresponding hyperparameters. Results are shown in Table 4. This paper set \(\alpha =4\), \(\beta =1\), \(\gamma =1\), and \(\tau =0\), under which FPS-Seg achieved superior performance.

Evaluating aggregation of student versus teacher models As depicted in Fig. 4, aggregating student models outperformed using teacher models. While teacher models maintain the EMA weights of the student models, this finding suggests that student models, which undergo direct gradient descent, are more effective for global model updating in our study.

Qualitative results

Three distinct instances with outliers are visualized in Fig. 5. These instances are notable for achieving satisfactory Dice scores yet exhibiting large 95HD values for the liver, spleen, and stomach, respectively, as indicated by red arrows. This visualization underscores the challenge FPS-Seg faces in certain instances where segmentation results contain outliers.

The qualitative results of various methods are displayed in Fig. 6. With 20 L, FPS-Seg outperformed localized training, surpassed the FL method MTFL [16], and showed results on par with centralized learning. Moreover, incorporating 30 U into the training further enhanced FPS-Seg’s performance.

Transfer to downstream tasks

Initially pretrained on #Set-A, FPS-Seg was transferred to pancreas and artery segmentation on #Set-B and #Set-C, respectively. The datasets were divided into 60/20 for training/validation for both #Set-B and #Set-C. Comparisons were conducted among 3D U-Net trained from scratch, initialized with weights obtained by pretraining on single organs such as the liver, spleen, and stomach, and initialized with pretrained FPS-Seg. These comparisons were drawn throughout 300 epochs until convergence was reached. As depicted in Fig. 7, models initialized with pretrained FPS-Seg exhibited faster convergence and superior validation performance compared to those trained from scratch and those pretrained on single organs across the two downstream tasks.

Discussion and conclusion

This paper introduced a challenging multi-organ segmentation problem, which was considered based on the following observations in reality: (1) datasets cannot be easily shared, and thus, we cannot collect a large-scale dataset to train a generalizable model; (2) a large part of images is unlabeled across institutes since annotation is costly; and (3) only a small number of images may be partially labeled, and annotations are inconsistent across institutes due to different research targets. Training a generalizable model using these distributed, partially labeled, and unlabeled samples is highly required in clinical practice and remains unexplored.

A practical approach, FPS-Seg, was introduced to tackle this problem. FPS-Seg comprised three key modules: partially supervised learning, semi-supervised learning, and federated learning modules. These modules respectively, managed to learn from partially labeled, unlabeled, and distributed samples. This method was straightforward in addressing partially supervised, semi-supervised, and federated learning in a unified way. Extensive experiments were conducted to show FPS-Seg’s successful solution for this challenging problem and good generalization ability for downstream segmentation tasks.

The proposed method was evaluated with liver, spleen, and stomach segmentation in CT images. Extending this method to segment additional organs using various modalities is considered an avenue for future work.