Introduction

PET is widely used for in vivo quantification of physiological processes at the molecular level [1]. The introduction of hybrid imaging, in the form of PET/CT has thrived its adoption in a clinical setting, particularly for oncological applications [1]. Corrections for physical degrading factors mainly linked to the interaction of annihilation photons with matter, such as attenuation and Compton scattering, are needed to achieve the full potential of quantitative PET imaging [2]. During the image formation process, a significant number of annihilation photons undergo photoelectric absorption and multiple Compton interactions with underlying material along their trajectory (patient body, scanner hardware, etc.) before reaching the PET detectors [2, 3]. Attenuation and scattering interactions result in undetected annihilation events and the recording of anomalous coincidences, respectively [4]. This leads to a large tracer uptake quantification bias. It has been reported that a fraction of around 30–35% of all detected events in 3D brain scanning are recorded from scattered photons, while this fraction exceeds 50–60% in whole-body scanning [5]. The probability of photon interactions increases either with the traveling distance (patient’s size) or the electron density of the medium [4]. Hence, for an effective attenuation/scatter correction (AC/SC) of PET images, a prior knowledge of the attenuation map at 511 keV through the traveling medium is required [4].

The problem of AC/SC to achieve quantitative PET imaging has been relatively successfully resolved following the commercial emergence of hybrid PET/CT modality where CT-based correction algorithms are commonly implemented on commercial systems [4, 6]. However, AC and SC remain challenging on PET/MRI and PET-only scanners [7, 8]. Unlike PET/CT, direct attenuation correction in PET/MRI is not straightforward owing to the lack of direct correlation between MR signals, i.e., proton density and time-relaxation properties of tissues and electron density [8]. Hence, various strategies have been devised for MRI-guided AC/SC, including bulk segmentation, atlas-based algorithms, and emission-based techniques. Although these methods improve the quantification accuracy of PET images, they are affected by the misclassification of tissues (segmentation-based approach), as well as inter/intra-subject variability of MR images for co-registration to the best-fitted atlas model (atlas-based approach) [8]. Furthermore, in PET-only scanners, emission-based algorithms that estimate directly the attenuation map from the emission data, time-of-flight (TOF) information, and anatomical prior knowledge have been proposed [9, 10].

The past decade has witnessed significant progress in the development and implementation of artificial intelligence (AI)-based methods in different areas of medical image analysis, e.g., detection, segmentation, classification, regression, and outcome prediction [11,12,13,14]. Several AI-based algorithms, in particular deep convolutional neural networks, have been developed to address the limitations of conventional attenuation correction techniques, demonstrating significant benefits in terms of improved image quality and quantitative accuracy of PET imaging [15, 16]. In this context, four main learning-based approaches for AC/SC of PET data, including (i) the generation of synthesized CT from MR images [17], (ii) generating synthesized CT from non-corrected PET images [18], (iii) predicting the scattered component from emission information (TOF, event position) in either the image or sinogram domain [10, 19], and (iv) generating directly AC/SC PET images from non-attenuation/scatter corrected images [20]. Although, a number of studies reported promising performance of deep learning (DL)-based algorithms within an acceptable clinical tolerance, the size of training and testing datasets is a major limitation of these methods [21]. To build a generalizable and trustworthy DL model, a large multicenter dataset is required to tune millions of model parameters [22,23,24]. However, the sensitivity of medical images, and the ensuing ethical/legal considerations and regulations, challenge gathering large datasets to feed such data-hungry algorithms [22,23,24]. To address this issue, federated learning (FL), initially developed for mobile technologies, is being increasingly considered in the healthcare domain [4].

A single hospital, often, cannot provide a sufficient number of samples, as required for successful training of machine learning models with acceptable accuracy, generalizability, and trustworthiness [22,23,24,25]. As such, it may not be feasible to train a high-quality model for PET AC/SC images based on a limited sample dataset available from a single hospital. Moreover, all hospitals do not have infrastructures and expertise for machine learning model developments. One strategy involves collecting data from different hospitals to train a more accurate model. However, this approach is challenged by various privacy regulations and policies on data sharing. FL techniques enable the collaborative training of machine learning models among multiple parties without exchanging the local data to preserve privacy and solve the concerns of data users and data owners [22,23,24].

A typical FL protocol consists of three main components: (i) the manager (e.g., trusted server), (ii) participating parties as data owners (e.g., hospitals and departments), and (iii) computation-communication framework to train the local and global models [26]. Depending on the parties, FL protocols can be divided into two settings: (i) cross-device FL, where the parties are edge devices; and (ii) cross-silo FL, where the parties are reliable organizations (e.g., hospitals). In designing an FL system, one needs to consider three properties regarding the participating parties namely, (a) computational and storage capacity of the parties, (b) stability and scale of the parties, and (c) data distribution among the parties [27, 28]. The manager (trusted server or party) supervises the training procedure of the global model and manages the communication between the data owners and itself. To produce an accurate model, the stability and reliability of the server need to be guaranteed [27, 28]. In the cross-device setting, various solutions have been proposed to increase the reliability of the system [29,30,31,32]. Fortunately, in the cross-silo setting, organizations have powerful computational machines, better facilitating FL [29,30,31,32]. Hence, one possible option is to consider one of the organizations as the manager of the FL model [27, 28]. Alternatively, the organizations can act in a fully decentralized setting. In this setting, all the participated parties communicate with each other directly [29,30,31,32].

Collaborative models could be trained in a decentralized manner using an FL framework without exchanging data between the different centers/hospitals [27, 28]. In recent years, FL-based DL models have been applied to multi-institutional data for different medical imaging tasks, including image segmentation [33,34,35] and abnormality detection and classification [36,37,38]. The main contribution of the present study is to propose, implement, and assess a robust FL algorithm for attenuation/scatter correction of PET data to achieve a generalized model using a limited data obtained from each center without direct sharing of data amongst the different centers. The hope is to propose this development for potential applications on standalone CT-less PET scanners or enhanced quality assurance in PET/CT scanners.

Materials and methods

PET/CT datasets

Non-attenuation-corrected and CT-based attenuation-corrected 18F-FDG PET images of 300 patients were included in this study. The dataset were acquired at 6 different centers, each providing 50 patients acquired on various PET scanners, using different image acquisition and reconstruction protocols across the different centers, more information of dataset is provided in Table 1 [20, 39,40,41,42,43,44,45,46,47]. All images were reviewed to include only high-quality and artifact-free PET images. PET images were converted to standardized uptake values (SUVs) for both corrected and non-corrected images.

Table 1 Patients demographics and PET/CT image acquisition and reconstruction settings across the six different centers

Image preprocessing

We converted all PET images to SUV values and resampled both NAC and CT-ASC PET images to the same voxel spacing (3 × 3 × 4 mm3) and finally normalized by empirical values of 3 and 9 for NAC and CT-ASC, respectively. To harmonize the intensity range of PET images across the different centers, the voxel values of both non-AC and CT-AC PET images were converted to SUV. Subsequently, non-ASC and CT-ASC PET images were normalized by SUV factors of 3 and 9, respectively. In this way, the intensity range of all PET images across the different centers was between 0 and 5.

Global FL training

In typical machine learning problems, the goal is to minimize an appropriate loss function \({F}\left(\uptheta \right)\), where \(\uptheta \in {\mathbb{R}}^{d}\) denotes the parameters of the model. The loss function \({F}\left(\uptheta \right)\) represents the average of empirical loss functions over the available data samples with respect to model parameter \(\uptheta\). A common approach to minimize the loss function \({F}\left(\uptheta \right)\) is to use the iterative Stochastic Gradient Descent (SGD) algorithm. The idea of federated DL originates from the fact that SGD allows parallelization [26, 48,49,50,51,52,53]. Hence, one can optimize a machine learning model using distributed SGD. The framework is as follows.

Consider the FL system with \(K\) parties, where the \(k\)-th party has a local training dataset \({{\mathcal{D}}_{k} = \left\{{X}_{i} , {Y}_{i}\right\}}_{i=1}^{{N}_{k}}\), where \({X}_{i}\) and \({Y}_{i}\) are the feature vector and the ground-truth label vector, respectively, and \({N}_{k}\) is the sample size available at party\(k\in \left\{\mathrm{1,2}, ..., K\right\}\). Let all parties have \(N=\sum_{k=1}^{K}{N}_{k}\) samples, and let \({F}_{k}\left(\uptheta \right)\) denotes the local objective function of the \(k\)-th client, i.e., we have:

$${F}_{k}\left(\uptheta \right) = \frac{1}{{N}_{k}} \sum_{i=1}^{{N}_{k}}\mathcal{L}\left(\theta ;\left({X}_{i}, {Y}_{i}\right)\right) ,$$
(1)

where \(\uptheta \in {\mathbb{R}}^{d}\) denotes the model parameters to be optimized and \(\mathcal{L}(.;.)\) is the specific loss function. As an example, one can consider the mean square error loss function\({\mathcal{L}\left(\Theta ;\left({X}_{i}, {Y}_{i}\right)\right)=\frac{1}{2}\Vert {y}_{i}- \widehat{{y}_{i}}\Vert }_{2}^{2}\), where \(\widehat{{y}_{i}}\) is the corresponding predicted label, and \(\Vert .\Vert\) is the \({l}_{2}\)-norm. In this case, we consider the global optimization problem of our FL system as follows:

$$\underset{\uptheta }{\mathrm{min}}\left(F\left(\theta \right)=\sum_{k=1}^{K}\frac{{N}_{k}}{N}{F}_{k}\left(\theta \right) \right) ,$$
(2)

In this framework, the local objective function for each center is weighted by the fraction of data emerging from that center. In order to solve the above optimization problem, the SGD algorithm can be utilized. Therefore, at the \(t\) th iteration, each party computes local gradients using the SGD method, and sends them back to the manager (server) for aggregation and updating. Let \(\nabla {F}_{k}\left({\theta }_{t}\right)\) denote the local gradient on the local data of the \(k\) th party at the \(t\) th iteration. Let \(\eta\) represent the learning rate and let \({\theta }_{t}\) denote the model at tth iteration. The server aggregates and updates the model parameters as follows:

$${\theta }_{t+1}\leftarrow {\theta }_{t}- \eta \sum\nolimits_{k=1}^{K}\frac{{N}_{k}}{N}\nabla {F}_{k}\left({\theta }_{t}\right) ,$$
(3)

Note that in the case of massive datasets, the SGD becomes prohibitively demanding. Hence, the parameter vector is updated with the stochastic gradient:

$${\theta }_{t+1}= {\theta }_{t}- {\eta }_{t}G\left({\theta }_{t}\right)$$
(4)

where \({\mathbb{E}}\left[G\left({\theta }_{t}\right)\right]= \nabla F\left({\theta }_{t}\right)\).

We evaluate two different training strategies for our federated pipeline. In the first training approach, a server aggregates the FL workflow as summarized in Fig. 1. We refer to this first strategy as parallel federated learning (FL-PL): first (step A), the central global model is distributed through different departments and then (step B) the models are trained in each center separately, and finally (step C) the locally trained models return to the central server and aggregate the results to the central global model. Steps A–C are repeated until the model is fully trained and converges. In the second approach, referred to as sequential federated learning (FL-SQ), the model meets the data serially center-after-center. First (step A in Fig. 1), model training begins in one center for a predefined number of epochs, and then (step B) the model passes sequentially through all centers. Finally (step C), this process will be repeated for a predefined number of rounds to generate the ultimate model. FL-SQ requires longer training time since the learning procedure is sequential. As for the implementation of our experiments, all FL algorithms and DL models were implemented in TensorFlow 2.6 (details on DL models are provided below). The FL process in this work was performed on a server with multiple local GPUs similar to previous studies [36, 54,55,56,57,58,59], where each local GPU was considered as our node and center.

Fig. 1
figure 1

Schematic of two FL algorithms and network architectures as implemented in this study. In parallel federated learning (FL-PL), in the first step (A), the central global model is distributed through different centers and then (B) the models are trained in each center separately, and finally (C), local-trained models return to the central server and aggregate the results to the central global model. Steps (AC) are repeated until the model is fully trained and converged. In sequential federated learning (FL-SQ), the model meets the data serially center-after-center. First (step A in Fig. 1), model training begins in one center for a predefined number of epochs, and then (step B) the model passes sequentially through all centers. Finally (step C), this process will be repeated for a predefined number of rounds to generate the ultimate model. The bottom image depicts our U2Net architecture; each blue block in the main body (left) consists of a residual U-Net (right)

Deep neural network

In this study, we used the modified U2-Net [60] which utilizes residual U-blocks in a U-shaped architecture. It employs a deep network supervision strategy, where the training loss includes information in all scales. Deep supervision allows to extract both local and global contextual information [60]. The advantage is that unlike the prevalently utilized U-Net based on successive down-sampling of the image and hence gradually losing high-resolution information [60], the U2-Net does not sacrifice the high-resolution content of the images [60], which is crucial for many image-to-image conversion tasks, such as attenuation scatter correction.

This is performed using a nested two-level U-structure inspired from the classical U-Net. The idea is to keep the general U-structure of the U-Net [61], but inside each convolutional block, it uses another structure which again has a U-shaped form with its symmetric encoder-decoder architecture [60]. This block is known as ReSidual U-block (RSU), which enables intra-stage multi-scale features to be extracted. The RSU is motivated by the classic U-Net [61] with a symmetric encoder-decoder structure. It provides a mixture of receptive fields with different sizes, which is highly desirable for fine-grained image-to-image tasks [60]. This is equivalent to drastically increasing network layers, but with the important advantage of keeping the computational and memory footprint low and hence the training procedure simple [60]. Note that the idea of having nested U-Net is different from the more common strategy of cascading multiple U-Nets together, which increases the computational burden proportional to the number of networks used [60]. The nested structure enables the U2-Net to extract intra-stage multi-scale, as well as aggregated inter-stage multi-level features [60]. As for the training strategy, the network further uses deep supervision, where the training loss includes information in all scales [60]. Deep supervision allows to further extract both local and global contextual information [60].

It can evoke intra-stage features in different scales depending on the depth and kernel size [60]. One can select an optional depth to achieve various single-level or multi-level nested U-shape structures [60]. Although too deep models might get too complex with respect to implementation and employment in training procedures and real-world applications. In this work, non-attenuation/scatter-corrected images were used as input to the modified U2-Net to generate attenuation/scatter-corrected PET images directly. The network was trained in a 2D manner with an Adam optimizer, a learning rate of 0.001, an L2-norm loss, as well as a weight decay of 0.0001. The schema of the network is depicted in Fig. 1.

Evaluation strategy

In this study, we evaluated two federated models, referred to as FL-SQ and FL-PL, and compared their performance with the centralized (CZ) approach, wherein the data are pooled to one server. Moreover, center-based (CB) models were built and evaluated separately using only the training/test datasets from the same center. Each center’s data were divided into training (30 patients), validation (10 patients), and test sets (10 patients). A standard train/validation/test data splitting was followed for the training of all models and the results were reported on untouched test sets to avoid the risk of overfitting. There was no overlap between training, validation, and testing sets. The same patients were used for evaluation of the different non-CB models to facilitate comparison of the various models. In the three non-CB strategies, including FL-SQ, FL-PL, and CS, the models were built using a 180/60 train/validation set, and the results were reported using 60 test sets (the 10 test datasets from each of the six centers). In CB models, six different models were developed using 30/10 train/validation, and only 10 test sets from the same center were employed for model evaluation.

For model performance evaluation, voxel-wise mean error (ME), mean absolute error (MAE), relative error (RE%), absolute relative error (ARE%), and peak signal-to-noise ratio (PSNR) were computed between ground truth CT-based attenuation/scatter corrected and the predicted corrected PET images, as follows:

$$ME=\frac{1}{vxl}\sum\nolimits_{v=1}^{vxl}{PET}_{Predicted}(v)-{PET}_{CT-ASC}(v)$$
(5)
$$MAE=\frac{1}{vxl}\sum\nolimits_{v=1}^{vxl}\left|{PET}_{predicted}(v)-{PET}_{CT-ASC}(v)\right|$$
(6)
$$RE(\mathrm{\%})=\frac{1}{vxl}\sum\nolimits_{v=1}^{vxl}\frac{{\left({PET}_{predicted}\right)}_{v}-{\left({PET}_{CT-ASC}\right)}_{v}}{{\left({PET}_{CT-ASC}\right)}_{v}}\times 100\mathrm{\%}$$
(7)
$$ARE(\mathrm{\%})=\frac{1}{vxl}\sum\nolimits_{v=1}^{vxl}\left|\frac{{\left({PET}_{predicted}\right)}_{v}-{\left({PET}_{CT-ASC}\right)}_{v}}{{\left({PET}_{CT-ASC}\right)}_{v}}\right|\times 100\mathrm{\%}$$
(8)
$$PSNR(dB)=10{\mathrm{log}}_{10}(\frac{{Peak}^{2}}{MSE})$$
(9)

where PETpredicted denotes DL-based corrected PET image while PETCT-ASC stands for the reference PET-CT-ASC image, and vxl and v denote the total number of voxels and voxel index, respectively. Moreover, the structural similarity index (SSIM) was calculated based on [62].

The different plots (box, bar, and scatter plots) were provided to enable different comparisons. Two-sample Wilcoxon test (Wilcoxon rank sum test or Mann–Whitney test) was used for the statistical comparison of image-derived metrics between the different training models. We corrected p-values using Benjamin Homberg to provide an adjusted p-value (q-value). A threshold of 0.05 was considered as the significance level of q-values. In addition, we used joint histogram analysis to depict the distribution of voxel-wise PET SUV correlations between the reference CT-based ASC images and different DL approaches.

Results

Figure 2 represents an example of non-ASC, CT-ASC, CB model, CZ model, FL-SQ model, FL-PL model, and the corresponding bias maps generated for DL models with respect to CT-based ASC (CT-ASC) images. As can be seen, the CZ-based model, FL-SQ model, and FL-PL model generated high-quality images. More examples of images for each of the centers are provided in supplemental Fig. 1.

Fig. 2
figure 2

Example of non-ASC, CT-ASC, CB model, CZ model, FL-SQ model, FL-PL model, and their corresponding bias maps generated for DL models with respect to CT-based ASC (CT-ASC) images. Sequential federated learning (FL-SQ), and parallel federated learning (FL-PL), centralized (CZ), and center based (CB)

Figure 3 compares quantitative image quality metrics, i.e., RE (%), ARE (%), ME, MAE, SSIM, and PSNR, calculated on SUV images between the different training strategies with respect to CT-ASC images serving as ground truth. As expected, the CB training strategy is the worst method in terms of quantitative analysis, resulting in the highest absolute error (MAE = 0.21 ± 0.07). The performance of the FL-based algorithms is comparable with the centralized training strategy, while the CZ method shows a lower deviation and smaller variance compared to FL-SQ and FL-PL, in terms of MAE (0.10 ± 0.03 versus 0.14 ± 0.07 and 0.14 ± 0.06, respectively). Table 2 summarizes the statistical comparisons of quantitative metrics between these four training strategies. In terms of overall structural similarity, the different approaches demonstrated comparable performance against ground truth (CZ = 0.93 ± 0.01, FL-SQ = 0.93 ± 0.01, and FL-PL = 0.92 ± 0.03), except for the CB achieving an SSIM of 0.70 ± 0.04. Table 3 summarizes the statistical comparison of quantitative metrics calculated between these four training strategies separately for each center. The same pattern of quantitative metrics in Table 2 is repeated for each center and all metrics across the different frameworks.

Fig. 3
figure 3

Comparison of quantitative image quality metrics, including RE (%), ARE (%), ME, MAE, SSIM, and PSNR, calculated on SUV images between different training strategies with respect to CT-ASC images serving as ground truth sequential federated learning (FL-SQ), and parallel federated learning (FL-PL), centralized (CZ), and center-based (CB)

Table 2 Statistical comparison of quantitative metrics between the four training strategies used in this study. Sequential federated learning (FL-SQ) and parallel federated learning (FL-PL), centralized (CZ), and center based (CB)
Table 3 Comparison of various image quality metrics (mean ± SD) for the different training models performed at the different centers

The quantitative performance of the different training strategies categorized by the clinical center is reported in Fig. 4 (Supplemental Fig. 2 depicts similar information for each patient). The center-wise relative error for the CZ approach (ARE = 11.16 ± 3.24%) demonstrates slightly better performance compared to FL-SQ (ARE = 13.51 ± 5.04%) and FL-PL (ARE = 12.83 ± 3.91%) approaches. Conversely, ARE metric for the CB approach was larger (24.22 ± 7.28%). The highest MAE was achieved by the CB method (0.21 ± 0.07) compared to CZ, FL-SQ, and FL-PL which achieved values of 0.10 ± 0.03, 0.14 ± 0.07, and 0.14 ± 0.06, respectively. For all approaches, SSIM and PSNR metrics demonstrated a consistent behavior over the different centers (0.93 ± 0.02 and 34.0 ± 3.23, respectively), except for CZ which achieved the poorest performance in terms of structural analysis, resulting in SSIM of 0.70 ± 0.04 and PSNR of 28.66 ± 2.70. Supplemental Fig. 2 depicts the quantitative performance of the different training strategies categorized by the different cases in the test dataset.

Fig. 4
figure 4

Quantitative performance of the different training strategies, including sequential federated learning (FL-SQ) and parallel federated learning (FL-PL), centralized (CZ), and center based (CB), categorized by clinical center

Furthermore, the voxelwise joint histogram analysis depicting the correlation between the predicted and CT-ASC images serving as ground truth is illustrated in Fig. 5. The coefficient of determination (R2) achieved by CB, CZ, FL-Sq, and FL-PL methods were 0.76, 0.94, 0.93, and 0.92, respectively.

Fig. 5
figure 5

Voxelwise joint histogram analysis depicting the correlation between the predicted images using the different training approaches and CT-ASC images serving as ground truth

The results of the statistical analysis between the different learning strategies in the form of center-based categorization are summarized in Fig. 6. As illustrated, the CB approach is significantly different from the other algorithms as reflected by almost all quantitative parameters, except for RE and ME. The CZ model performance demonstrates consistent behavior against FL algorithms in almost all parameters in center-based categorization (p-value > 0.05).

Fig. 6
figure 6

Statistical analysis between the different learning strategies in the form of CB as well as centralized CZ and FL approaches for the different quantitative metrics reflecting evaluation on the overall data as well as on data from each center. Blue and red colors indicate p-value < 0.05 and p-value > 0.05, respectively. Abbreviations: sequential (FL-SQ) and parallel federated learning (FL-PL), centralized (CZ), and center-based (CB) learning

Discussion

DL approaches are data-hungry algorithms that require large, reliable datasets to generate robust and generalizable models [27, 28]. However, the collection of large, centralized datasets for training DL models is challenging and not always feasible owing to the sensitivity of clinical datasets and specifically medical images [27, 28]. FL algorithms provide the opportunity to train a model using multicentric datasets without sharing data [27, 28]. In this work, we provide a framework for DL-based AC/SC model generation from PET images from different centers without the direct sharing of clinical datasets. Our FL-based DL models provided promising results which could improve model generalizability and robustness for AC/SC of PET images without sharing dataset in multicentric studies.

The quantitative analysis performed on SUV PET images demonstrated highly reproducible performance against intra/inter-patient variability. In terms of SUV quantification bias, the ARE% metric demonstrated excellent agreement between FL-SQ (CI:12.21–14.81%) and FL-PL (CI:11.82–13.84%) models and conventional centralized training approach (CI:10.32–12.00%), while FL-based algorithms improved model performance in terms of ARE by more than 11% compared to CB training strategy (CI: 22.34–26.10%). The center-based voxel-wise quantitative analysis and structural indices (Figs. 3 and 4) illustrated the superior performance of FL-based algorithms compared to the CB approach. Furthermore, although in the center-categorized mode, the Mann–Whitney test between different strategies (Fig. 6) revealed consistency between CZ and FL-based algorithms (p-value > 0.05) on the overall dataset, the statistical analysis demonstrated significant differences between the different training approaches (p-value < 0.05). In addition, the joint histogram analysis (Fig. 5), depicting a voxel-wise comparison between reference CT-ASC and predicted images, exhibited close performance between CZ (R2 = 0.94), FL-SQ (R2 = 0.93), and FL-PL (R2 = 0.92), while the CB model achieved a far lower coefficient of determination (R2 = 0.74). Despite the strong correlation coefficient between CZ and FL-based methods compared to reference CT-ASC, a slight underestimation of the predicted tracer uptake was observed. Even though slightly inferior results were observed in some CB models, which could be attributed to the different number of slices used in the training of the models (as reflected in Table 1). Overall, all CB models exhibited very similar values of quantitative errors.

In a previous study in the context of conventional centralized learning [20], direct AC/SC of PET images using a modified ResNet algorithm achieved good performance (MAE = 0.22 ± 0.09 and ARE = 11.61 ± 4.25%) using a large dataset (1150 patient images), as gathered from one center and single PET scanner [20]. We then further improved the performance of our algorithm by developing a modified bilevel nested U-NET architecture inspired by U2-Net applied for object detection from natural images (centralized mode: MAE = 0.10 ± 0.03 and ARE = 11.16 ± 3.24%). In our previous study [20], DL-based attenuation and scatter correction in the image domain were extensively evaluated on 150 clinical cases including quantitative analysis of radiotracer uptake in 170 lesions/abnormal high-uptake regions (colorectal, head and neck, lung, lymphoma, …). A mean relative SUV error of less than 5% was observed for SUVmax and SUVmean across all lesions/regions. Although the quantitative analysis was not performed on malignant lesions in the current study, the voxel-wise SUV error for CZ and FL algorithms was within the same range as our previous study [20].

In comparison to previous works, Yang et al. [63] reported an average ARE of 16.55 ± 4.43% for AC/SC of whole-body PET images using 3D generative adversarial networks. Dong et al. [64] tested different network architectures (U-Net, GAN, and cycle-GAN) achieving good performance in terms of ME (0.62 ± 1.26%) and normalized mean square error (0.72 ± 0.34%), respectively. Van Hemmen et al. [65] developed a modified U-Net architecture for AC/SC of whole body PET images only images resulting in an average ARE of 28.2% on a small-scale dataset. Hwang et al. [66] compared different PET attenuation correction approaches using emission data, including DL-based μ-map generation from non-attenuation-corrected (NAC) images, improved estimation of μ-maps using maximum likelihood estimation of activity and attenuation (MLAA) and a combination of these two methods. They reported that the combination of the MLAA algorithm and DL approach outperformed μ-maps estimated from NAC PET images, whereas no improvement was observed when combining these two approaches. Apart from direct AC/SC on PET images, a number of studies reported on MRI-guided AC/SC by generating pseudo-CT images from PET/MR images and attenuation maps based on tissue classification from PET-only images [67,68,69]. Although the synthesized attenuation map-based approaches demonstrate promising results, they suffer from numerous challenges, including a mismatch between anatomical (MRI) and PET images and organ motion [67,68,69]. However, the direct AC/SC approach is less sensitive to noise, metal artifacts, truncation, and local mismatch between anatomical and functional images [70, 71]. Furthermore, this approach is potentially capable of correcting for organ motion and hollow artifacts provided that the model is trained on a clean and accurately corrected PET images [20, 70].

Although direct attenuation/scatter correction in the image domain has a number of advantages, the generation of pseudo µ-maps (synthetic CT) from non-attenuation corrected images or MR images would provide an explainable AC map to verify/detect errors/drawbacks within PET attenuation and scatter correction procedures [20, 66, 72, 73]. The suboptimal performance of direct AC approaches cannot be easily depicted from the resulting PET-AC images (local under/over estimation of radiotracer uptake). However, the suboptimal performance of DL-based synthetic CT generation approaches could be visually detected from the resulting synthetic µ-maps. The resulting synthetic CT images could be visually checked to detect any possible anatomical defects and/or artifacts prior to PET attenuation correction. The other drawback of direct AC in the image domain is the sensitivity of the models to the quality of the query data wherein increased levels of noise, abnormalities, and minor image artifacts may result in erroneous signals in the resulting images. Moreover, the occurrence of outliers (cases with gross errors) in the outcome of these models should be carefully monitored (owing to black-box nature of DL models).

Different studies have been recently performed to assess the performance of FL approaches in medical image analysis [74]. In a study by Feki et al.[36], COVID-19 detection from chest X-ray images using VGG16 and ResNet50 tested federated and centralized frameworks, reporting similar performance for both models. In a more recent study [37], an FL-based model, referred to as EXAM, was developed based on vital signs laboratory exams and chest X-rays for future oxygen requirements of COVID-19 patients across twenty centers. They compared the FL-based model with the center-based model (where each center developed and evaluated the model separately) achieving 16% and 38% improvement in average AUC and generalizability, respectively. Gawali et al. [75] compared different privacy-preserving DL methods for chest X-ray classification tasks. They reported an AUROC of 0.95/0.72 and an F1 score of 0.93/0.62 for a DenseNet model trained in a centralized way and their best-performing FL approach, respectively. In our study, we evaluated two FL-based models and achieved better and comparable results for CB and CZ models, respectively. Building a generalizable and robust model requires a large dataset, while privacy concerns could be addressed by FL approaches without sacrificing models’ performance. In a more recent study, Shiri et al. [76] proposed a FL based multi-institutional PET image segmentation framework on head and neck studies. They enrolled 404 patients from eight different centers and reported that FL-based algorithms outperformed CB and achieved similar performance as the CZ approach.

The FL paradigm enables the training of machine/DL models on multiple decentralized datasets without the need for exchanging data. This preserves data privacy, data security, and data access rights while allowing access to the large-scale heterogeneous database for model training [77]. In the FL framework, the local data are not made available to other participants, or on the server. However, a curious server may infer sensitive data from the exchanged model parameters. In the literature, various attacks have been investigated against machine learning models [78, 79]. For instance, Shokri et al. [80] studied membership inference attacks, whereas Fredrikson et al. [81] addressed model inversion attacks. Possible threats can be classified into three categories, depending on the stage of the process of an FL system. Malicious parties can perform data poisoning attacks at the “input” of the learning model [82, 83]. For instance, they may modify the label of the data samples. Alternatively, they can perform model poisoning attacks during the learning process [84, 85]. For example, they may upload random updates to the global model. Finally, a malicious party can perform inference attacks on the “released learnt model” [77, 80, 81]. For instance, a curious server may infer sensitive information about the training data from the communicated model parameters.

CB framework faces generalizability challenges even with large datasets owing to the large variability across different centers in terms of scanner brands, data acquisition and reconstruction protocols, and post-processing schemes. Moreover, due to the absence of infrastructures and expertise, it may not possible to build ML models at each center. The CZ training framework is the ideal option for ML model development. Yet, it suffers from limitations imposed by ethical and legal constraints. FL algorithms provide the opportunity to train a model using multicentric datasets without sharing data, and models trained with the FL framework can converge to the CZ performance in the ideal situation. Overall, the CZ model training would lead to the highest accuracy and generalizability. However, in cases where ethical and legal constraints do not allow data sharing or when a center does not have enough training samples FL approaches would be an attractive solution. Models trained with FL in the best-case scenario might approach the performance of the CZ models. On the other hand, CB models are observed to suffer from very poor generalizability.

Data heterogeneity arising from the use of different scanners, image acquisition and reconstruction protocols, is the main source of error impairing building a generalizable model [17]. The heterogeneity of data across the different centers prevents CB models from working properly on an unseen dataset from the other centers. To build a generalizable model, data from different centers should be included in the training dataset, which is possible in CZ and FL pipelines. In FL, a global model is built based on a portion of the data from the different centers in a federated approach and then for each center, the global model will be specialized for each center by applying a transfer learning technique using a transfer-FL framework [27, 28]. This approach could be employed to comply with the heterogeneous data collected from the different centers with various acquisition parameters.

This study inherently bears a number of limitations. The implementation of all models was performed on a server using different GPUs where the different nodes were considered as centers similar to previous FL studies [35, 36, 54,55,56,57,58,59, 74, 76]. The challenges of FL, such as local computer capacity, and communication bottleneck between centers and local server should be considered in the real clinical scenario. Further studies should be performed in real clinical situations using a larger size of the training dataset. In the current version, a proof-of-concept has been demonstrated and further investigation with larger cohorts is warranted. One limitation of FL is data preparation and preprocessing due to the nature of the decentralized process. However, for image preprocessing, including normalization, we used an easy method to ensure reproducibility.

Conclusion

AC/SC are key corrections required to enable quantitative PET imaging, which remains challenging on CT-less PET scanners (PET/MRI and standalone PET-only scanners). DL-based models provide very promising results and might outperform conventional algorithms in terms of attenuation and scatter corrections. At the same time, robust and generalizable DL models require heterogeneous, large, and reliable datasets from multiple centers. Yet, legal/ethical/privacy considerations prevent the collection of very large datasets. In this work, we developed an FL-based framework for anatomical knowledge free or CT-less AC/SC of PET images, which proved to outperform center-based models, demonstrating comparable performance with respect to centralized DL. FL-based DL provided promising results through improving model generalizability and robustness for AC/SC of PET images without direct sharing of datasets amongst centers.