Introduction

Combined positron emission tomography/computed tomography (PET/CT) imaging is widely used in clinical practice to provide functional information on organs and tissues, as well as disease abnormalities. Static PET images provide semi-quantitative information regarding the distribution of a radiotracer uptake. In oncology, fluorodeoxyglucose (FDG) PET imaging is routinely relied upon for diagnosis, staging, treatment planning, and therapy follow-up [20].

In clinical applications, nuclear medicine physicians carry out qualitative assessments of PET/CT images, which is typically sufficient for detecting and anatomically locating lesions. For radiotherapy treatment planning, radiation oncologists manually draw boundaries on fused PET/CT images to determine the gross target volume (GTV) of a tumor, in order to subsequently deliver a specific dose to the target. The boundary of the target should be defined as accurately as possible to maximize the coverage of the target and minimize the dose delivered to surrounding healthy tissues and nearby organs-at-risk (OAR).

More quantitative assessment of FDG uptake in PET images can also be performed. For instance, radiomics analyses [23] aim at extracting clinically relevant measurements through the calculation of numerous image-derived features features (e.g., intensity, shape, and textural). Such measures can subsequently be used to build models predictive of outcome or for assessing changes in tumors before, during, and after treatment in order to better evaluate response to therapy [7, 37, 38]. It has been shown in all image modalities including PET that the choice of the segmentation method in this step of the radiomics workflow can significantly affect the extracted features [21, 31, 42, 46, 51]. In addition, it is recognized that in the absence of fully automated segmentation, this step is a crucial bottleneck and time-consuming step of any radiomics study, preventing such a process to be expanded to very large datasets [23]. There is therefore a need for a delineation method that is not only accurate and robust, but also fully automated as well.

There are several challenges pertaining to PET image segmentation [20]. First, PET images suffer from limited spatial resolution (4–5 mm), comparatively to CT (below 1 mm) due to partial volume effects (PVE) that make boundaries between adjacent functional regions blurred and result in underestimated activity in small objects of interest. Second, signal-to-noise ratio in PET images is inherently low and affected by a vast array of factors, such as scanner sensitivity, temporal resolution, acquisition mode, scan time, quantity and distribution of tracer, applied corrections (e.g., scatter, attenuation, randoms) and reconstruction algorithm type (e.g., resolution recovery, time of flight) and parameters (e.g., number of iterations). All these issues make things challenging in a multi-center context, i.e., when analyzing PET images acquired using different systems, acquisition protocols, and reconstruction settings. Third, the wide variability in shapes and heterogeneity of lesion uptakes might reduce the generalization of segmentation methods to only some specific cases.

An important aspect in medical image segmentation is that the true boundary of the object of interest (ground truth) is impossible to determine without a complete histopathological analysis of an excised tumor, which can typically be performed only in a small number of cases. In PET, even with a very robust protocol, this approach can only provide approximate co-registration between the histopathology slides and the corresponding 3D PET slices [20]. One way to overcome this is the use of a consensus of several manual segmentations by experts as a surrogate of truth [20]. Unfortunately, manual segmentation is typically a labor-intensive, time-consuming task with low reproducibility, due to the high intra- and interobserver variability [19].

There have been a number of algorithms proposed for PET image segmentation, accounting at different degrees for some of the limitations referred to above [20]. For example, thresholding-based methods, the most simple image segmentation techniques, work on the assumption that different tissue types have specific uptake ranges; therefore, segmentation can be done by comparing individual voxel intensities with a set of thresholds. More advanced methods aim at exploiting statistical differences between uptake regions and surrounding tissues. These include different clustering and classification methods trained on a set of features extracted from PET images, as well as atlas-based [39, 45] and generative models such as Gaussian mixture models (GMM) [2] and Fuzzy Locally Adaptive Bayesian (FLAB) model [16]. Numerous other common image segmentation algorithms have been evaluated for this task using PET uptake only [20]. For the vast majority of these published methods, it is usually assumed that the tumor has been previously isolated in a volume of interest (VOI), i.e., the input to the algorithm is not the entire PET image but a sub-volume containing the object of interest, that is usually manually determined after visually detecting the tumor uptake in the whole image. It should be emphasized that numerous approaches tried to improve PET segmentation by considering both PET and CT modalities together, assuming an (almost) perfect correspondence between tumor functional uptake and tumor anatomical boundaries as determined on CT images using co-segmentation approaches exploiting co-registered PET and CT images [9, 14, 34]. This assumption may not be true as radiotracer uptake and anatomical boundaries can be uncorrelated. This also makes the method sensitive to registration issues in PET imaging, especially in body regions affected by motion [20].

Convolutional neural networks (CNNs) have been successfully applied to different medical imaging tasks [35], such as reconstruction [32, 43], denoising [10, 44], segmentaton [13, 40], and classification [6]. Most segmentation studies rely on U-Net [47] that is arguably the most popular network for semantic segmentation, and focus on anatomical modalities such as magnetic resonance imaging (MRI) [8, 40] and CT [41, 48]. The limited number of papers dedicated to PET segmentation usually assume a correspondence between functional and anatomical regions in combined PET/CT or PET/MRI imaging [12, 26, 52,53,54]. The ground truth is usually obtained through manual delineation performed on multimodal images (e.g., training a network to reproduce delineations performed by radiation oncologist that perform this manually on fused PET/CT images). Guo et al. [13] included PET imaging within a CNN-based multimodal image segmentation framework using PET, MR (T1 and T2), and CT images of a publicly available soft tissue sarcoma dataset of 50 patients. Gross tumor volumes were manually annotated in all four imaging modalities. Different fusion networks were used for feature-, classifier-, and decision-level fusion, demonstrating improved performance at a feature level fusion [13].

Considerably less attention has been dedicated to processing PET images as a stand-alone modality. Moreover, a majority of studies have used only datasets with small cohorts of patients from one or two centers and manual delineation as a surrogate of truth. Under these circumstances, some previously published results might be less generalizable due to high heterogeneity of PET images caused, for instance, by scanner type, reconstruction algorithm, and applied post-processing that vary across centers. Huang et al. [26] applied U-Net with minor modifications for head and neck cancer gross tumor volume segmentation on PET/CT images. Results were obtained for a dataset of 22 patients using manual segmentation as a surrogate of truth. Blanc-Durand et al. [3] evaluated U-Net for glioma segmentation on PET images with the fluoroethyl-tyrosine (FET) tracer. Their dataset contained only 37 patients with manually segmented lesions. Leung et al. [33] used a modified U-Net trained on simulated PET images and fine-tuned using a clinical dataset of 160 patients with manual delineations for lung cancer segmentation. In cervical cancer, Chen et al. [5] proposed to combine a 2D CNN and a post-processing step that relies on prior anatomical information on the tumor roundness and its position relative to the bladder. The choice of the 2D network was dictated by the limited size of the available dataset that contained 1176 slices from 50 patients, and the surrogate of truth was also obtained through manual delineation. Within the scope of the recent MICCAI challenge on automatic PET tumor functional volume delineation, the CNN-based method reached the highest score compared to twelve other approaches, among them some of the current state of the art [22].

In this paper, we focused our experiments on cervical cancer. Approximately 570,000 cases of cervical cancer and 311,000 deaths from the disease occurred in 2018 and this type of cancer was the fourth most common cancer in women worldwide [1]. Recently, predictive models relying on textural features from tumor volumes in PET images were able to identify the subset of patients that will suffer from recurrence after treatment with clinically relevant accuracy [37, 46]. This obviously requires accurate delineation of the tumor volume in the PET images and this is achieved without the use of the associated CT image. Due to the anatomical proximity between the cervix and the bladder that generally has similar FDG uptake in PET scans, conventional techniques (e.g., thresholding, region- and boundary-based methods) provide poor results if applied to the whole image without additional prior knowledge or constraints, which is why a VOI excluding physiological uptake usually needs to be provided as input to the method. For instance, the use of FLAB, as described in the radiomic study above [37], requires an expert to first manually define a VOI containing the tumor but excluding the bladder, in which FLAB is then applied to delineate the tumor. However, this step can be quite labor-intensive and time-consuming, especially when the tumor uptake and bladder are very close to each other, hindering the potential clinical translation of these segmentation tools, and in turn the use of the predictive radiomics-based models. A fully automated segmentation step, without the manual determination of the VOI, is therefore highly desirable in that context.

The purpose of our study was thus to propose a U-Net-based model for the fully automated delineation (i.e., without the need for visual detection of their location and manual determination of a VOI) of 3D functional primary tumor volumes in PET images only, especially in the specific context of cervical cancer where the pathological uptake of interest is located close to a physiological one that should not be included (here, the bladder). A secondary objective was to train the network on a reliable ground truth obtained through accurate and robust PET semi-automated segmentation instead of manual delineation. A final objective was to train and evaluate the performance of our model under standard clinical imaging conditions, considering a multi-center patient cohort without any prior standardization in the data acquisition or image reconstruction processes.

Fig. 1
figure 1

Proposed Encoder-Decoder Network with residual blocks. The number of output channels is depicted under blocks of each group

Materials and methods

PET images and training relying on FLAB-derived ground truth

Our first objective is to achieve fully automated determination of the functional uptake boundaries in PET images only, without relying on assumptions regarding its correspondence with anatomical boundaries and to avoid registration issues, which are important in the case of cervical cancer due to the elastic nature of organs in this body region. We decided to train and evaluate the proposed model exclusively on real clinical images, contrary to recent recommendations by the task group 211 of the AAPM (American Association of Physicists in Medicine) dedicated on PET auto-segmentation, which advises to rely on a combination of simulated, phantom, and clinical images [20]. Indeed, usually the number of clinical images available for training and validation is small, and the surrogate of truth is questionable when only manual delineations from experts are available. In such a context, results obtained on large datasets of simulated and/or phantom images can indeed increase the confidence in the results obtained in a smaller amount of clinical cases with less reliable surrogate of truth. However, in the present work, we exploited a large dataset of images that were processed by experts using a semi-automated approach (see section below detailing how the ground truth was generated) for the purpose of radiomics-based outcome modeling studies. In addition, one objective of this work is to evaluate the ability of the proposed approach to deal with fully automated tumor uptake delineation when it is located close to a physiological one that should not be included. Simulated or phantom images corresponding to this specific context are currently not available in large amounts to train and evaluate a CNN-based approach.

We collected a dataset of 232 FDG PET images of patients from five institutions, all with histologically proven cervical cancer, with clinical stage IB1-IVBFootnote 1 (see Supplementary Fig. 1). All images contained the abdominopelvic cavity and were acquired for diagnostic and staging purposes before chemoradiotherapy followed by brachytherapy. Collected images considerably differed in acquisition protocols (scanning duration per bed position, injected radiopharmaceutical dose) and reconstruction (algorithms, use of time of flight information, resolution modeling, voxel and matrix sizes) (see Table 1). Data were corrected for randoms and scatter in all cases. All reconstructed images were corrected for attenuation using the associated low-dose CT.

Table 1 Summary of patients, including the different characteristics of the scanners, and associated reconstruction methods and parameters

An objective of our work is to train the network using a reliable ground truth excluding the bladder uptake. Segmentation of the tumor volumes to generate the ground truth was performed on PET images using the FLAB algorithm [16] applied in a semi-automated manner: first, a VOI containing the tumor uptake and excluding the nearby bladder and other physiological uptakes was manually defined by the user. The FLAB algorithm was then run within that VOI to generate a segmentation mask that was reviewed by the user. If this mask was deemed unsatisfactory, the user had the option to re-run the algorithm after specifying different values of initialization parameters in order to obtain a more satisfying result. Finally, the results for all tumors were reviewed and in some cases (< 5%) manually edited before being validated by two experts with more than 15 years of clinical practice. Given that FLAB in such a context has been demonstrated to provide accurate and reliable results in numerous studies [11, 21, 49], including for complex heterogeneous cases [17, 19] and over different scanner model and reconstruction algorithms, especially compared to manual delineation [18], we consider this ground truth sufficiently reliable for the purpose of training and evaluating the proposed approach. Although FLAB was applied only within the manually determined VOI, we then registered the obtained segmentation mask onto the entire PET image used as input to the network for training and testing.

Network architecture

The widely used 3D U-Net model [4] served as the basis for our network design. Although not the main objective of the present work, we nonetheless introduced three optimizations beyond the standard U-Net model:

  1. 1.

    Original U-Net consists of conventional convolutional blocks composed of a 3 × 3 × 3 convolution, a normalization layer (e.g., batch norm), and a ReLU activation function as a basic element of the network. We chose to rely upon a residual block with full pre-activation [24] supplemented by a concurrent spatial and channel squeeze and excitation (scSE) module (Fig. 1, gray blocks).

  2. 2.

    An important part of the proposed architecture is the integration of squeeze and excitation (SE) blocks that aim at providing the option to compute weights for the feature maps, so the network can put more or less attention on some of them. We implemented SE blocks within the full pre-activation residual blocks, namely a specific modification called concurrent spatial SE (scSE) that has been shown to perform better for image segmentation tasks [48]. In order to include the scSE module in the residual block, we followed the same approach that was applied in SE-ResNet architectures [25]. Due to the high memory consumption working with 3D images, we switched from using batch norm layers to instance normalization (instance norm) that was shown to work better in a small-batch regime [50].

  3. 3.

    We replaced max pooling operations in the encoder of the network by learnable downsampling blocks (Fig. 1, red blocks), which consist of one 3 × 3 × 3 strided convolutional layer, the instance norm, the ReLU activation, and the scSE module. Similarly, we implemented upsampling blocks in the decoder of the network using a 3 × 3 × 3 transposed convolution instead (Fig. 1, green blocks). To reduce memory consumption and increase the receptive field of the network, we implemented the first downsampling block with a kernel size of 7 × 7 × 7 right after the input. The last convolutional layer followed by the sigmoid activation function to produce the model output was applied with a kernel size of 1 × 1 × 1.

Experimental settings

Data preprocessing

The PET images exhibited a large variability of voxel sizes (see Table 1) that can adversely affect the model performance since CNNs cannot natively interpret spatial dimensions with different scales. Therefore, we first interpolated all PET images and corresponding segmentation masks to a common resolution of 4 × 4 × 2mm3 through the use of linear interpolation. A slice thickness of 2 mm was chosen to retain small image details that could be lost if interpolated at a larger voxel size. Linear interpolation was chosen after comparison with other techniques including Nearest Neighbor and B-spline, which led to decreased performance.

PET image intensities can exhibit a high variability in both within image and between images. In order to reduce these variabilities and use them as the input for the CNNs, we applied Z-score normalization for each scan separately, with the mean and the standard deviation computed based only on voxels with non-zero intensities corresponding to the body region.

Data augmentation

Due to the large variability in shapes, sizes, and heterogeneity of tumor uptakes in PET images, data augmentation can play a useful role in model training. To aid the model learn features invariant to affine transformations that are realistic, we applied mirroring on the axial plane, rotations in random directions with the angle uniformly sampled from the range [5, 15] degrees along the random set of axes, and scaling with a random factor between 0.8 and 1.2. In order to increase the diversity in lesion shapes, we relied on elastic deformations. Gamma correction with γ sampled from the uniform distribution between 0.8 and 1.2, and contrast stretching between 0 and 0.8–1.2 of the original range of values was applied to adjust voxel intensity distributions. To improve model robustness, we also added Gaussian noise to training samples. The standard deviation of the noise was equal to 0.1–0.2 standard deviations of the training sample. All augmentation methods were applied independently during model training with a probability of 0.2.

Training procedure

Due to the large size of PET images, we trained the model on randomly extracted patches of 128 × 128 × 64 voxels with a batch size of 2. Since all PET images corresponded to the abdominopelvic cavity with a number of axial slices ranging from 77 to 192, the chosen patch size was large enough to cover a significant part of the input PET image for all patients.

We trained the model for 400 epochs using Adam optimizer with β1 = 0.9 and β2 = 0.99 for exponential decay rates for moment estimates. We applied a cosine annealing schedule [36], gradually reducing the learning rate from lrmax = 10− 4 to lrmin = 10− 6 for every 25 epochs and performing the adjustment at each epoch.

Considering the fact that the Dice similarity coefficient (DSC) is one of the most common metrics used for the evaluation in medical image segmentation, we trained the model with the Soft Dice Loss. Based on [40], in case of binary segmentation, the loss function for one training example can be written as

$$ L(y, \hat{y}) = 1 - \frac{2{\sum}_{i}^{\mathcal{N}} y_{i} \hat{y}_{i} + 1}{{\sum}_{i}^{\mathcal{N}} {y_{i}^{2}} + {\sum}_{i}^{\mathcal{N}} \hat{y}_{i}^{2} + 1} $$
(1)

where yi ∈{0,1} is the the binary label for the i-th voxel; \(\hat {y}_{i} \in [0, 1]\) predicted probability for the i-th voxel. Additionally, we applied Laplacian smoothing by adding + 1 to the numerator and the denominator in the loss function to avoid the zero division in cases when only one class is represented in the training example.

Multi-center cross-validation

Cross-validation is probably the simplest and most widely used method for estimating the expected prediction error of a model on an independent test sample [15]. Importantly, cross-validation is based on the assumption that data samples in the train and test folds are drawn from the same distribution. However, as already mentioned in section “Network architecture,” no standardization in the acquisition or reconstruction protocols were implemented across the five institutions in which the images were collected. In addition, different PET/CT imaging devices with variable overall performance were used in these centers. Therefore, in order to obtain a more reliable estimate of the model performance, we implemented 5-fold cross-validation where each fold was composed only of samples from one of the 5 centers. This simulated a “real-life scenario” in which data from one or several centers are used for training and evaluating a model that is then used in yet another center. For each cross-validation split of the data, we randomly set aside 20% of training samples to tune hyperparameters of the model and to assess the model performance during training.

Evaluation metrics

Aside from the DSC metric quantifying global volume overlap, we used precision (a.k.a. positive predictive value) P and recall R (a.k.a. sensitivity) to further evaluate model performance, as recommended by the TG211 [20], where DSC can be written as the harmonic mean of precision and recall:

$$ DSC = 2 \frac{P\cdot R}{P+R} $$
(2)

Using these metrics, we compared our proposed network to the standard U-Net (StdU-Net) as a baseline model. In addition, a comparison was made with the use of a fixed thresholding method (based on 40% of the maximum standardized uptake within the tumor, denoted from here onwards as T40), still widely used in the literature despite its obvious limitations [20, 22].

Results and discussion

The results of all methods are summarized in Table 2 and Supplementary Table 1. As expected, T40 obtained poor performance across all test folds compared to stdU-Net and the proposed model. On average, our proposed model outperformed its StdU-Net counterpart in terms of DSC (0.80 vs. 0.77), with a slighlty smaller spread (0.03 vs. 0.05). The largest difference between the proposed method and its standard counterpart was measured on the “Brest” test fold, where StdU-Net demonstrated relatively poorer performance (0.77 vs. 0.68). However, on the other test folds, both models achieved closer results. The superiority of the proposed model was due to a better recall (0.90 vs. 0.87), whereas the difference in terms of precision was smaller (0.75 vs. 0.74). Kolmogorov-Smirnov and Wilcoxon signed-rank tests both indicated that the difference in predictions of two models was significant (α = 0.05) only for all evaluation metrics on the Brest test fold, and for recall on “Nantes” (see Supplementary Table 1).

Table 2 Segmentation results obtained on the different test folds with the use of cross-validation

This finding is in line with previous observations that, when properly tuned, the standard U-Net model can provide highly competitive results in many image segmentation tasks, especially in medical imaging. For example, top-ranked results were obtained in recent segmentation challenges using the ordinary U-Net model [28,29,30]. Under these circumstances, each step in the entire pipeline (data preprocessing, data augmentation, training procedure, etc.) may have a much larger impact on the model performance than a careful or complex re-design of the model architecture. For instance, we observed in our experiments that applying b-spline interpolation for image resampling instead of linear interpolation deteriorated both models’ performance by an average of nearly 8.5% on the test folds.

Both models achieved higher recall (between 0.83 and 0.93 on average) than precision (between 0.61 and 0.81) in all test folds, showing a trend in overestimating the ground truth rather than underestimating it. One of the most challenging aspects pertaining to cervical cancer segmentation in PET images is to distinguish the tumor uptake from the adjacent bladder uptake. In all cases, even when the tumor was very close to the tumor, the proposed approach was capable of address this problem independently of the size, location, and shape of the tumor uptake (see examples in Fig. 3). However, the wider spread of results on the two largest test folds (Brest, Liège) (see Fig. 2) could mean that the applied data augmentation techniques are not able to completely mimic all possible variations in presented PET images and alternatives should be investigated, such as, for example, relying on realistic simulated PET images to add more data for training. In addition, due to our 5-fold evaluation scheme based on clinical centers, the size of training sets (and as a result the variety of encountered examples) varied substantially (e.g., holding out Liege yields 146 training cases whereas holding out Nantes yields 209), which could also contribute in explaining these differences (Fig. 3).

Fig. 2
figure 2

Box plots of the results on the test folds

Fig. 3
figure 3

Examples of the model predictions in each test fold. Axial slices. (first row) Input images, (second row) input images with ground truth segmentation, (last row) input images with predicted segmentation. Evaluation metrics for whole scans are provided in format (DSC, precision, recall)

Analyzing predictions of the models, we identified a number of outliersFootnote 2 in each test fold (see examples in Fig. 4). When considering the DSC metric, the total number of outliers in the entire dataset was equal to 15 (12 for StdU-Net) and varies from 2 to 4 across the test folds. In most cases, the model failed to accurately segment images with relatively small tumor regions (see Fig. 4a, d). More specifically, 11 outliers corresponded to cases where the tumor size was less than 200 voxels (i.e., 6.4 cm3), whereas the average value across the dataset was 1160 voxels (i.e., 37.12 cm3). The other source of errors in the model predictions is the presence of surrounding tissues with a relatively high uptake that can be misclassified as the tumor (see Fig. 4b, c, e).

Fig. 4
figure 4

Examples of outliers in each test fold. Axial slices. (first row) Input images, (second row) input images with ground truth segmentation, (last row) input images with predicted segmentation. Evaluation metrics for whole scans are provided in format (DSC, precision, recall)

The performance was affected by tumor volume (see Supplementary Table 2 and Supplementary Fig. 2): the lowest results were obtained in the first decile groupFootnote 3 with DSC = 0.56 compared to the performance obtained on larger tumor volumes (significantly higher between 0.71 and 0.85). This happened due to precision that was steeply increasing along with the tumor size (0.44 to 0.90), whereas recall remained relatively stable (between 0.81 and 0.94). Examining the impact of the tumor contrastFootnote 4, we found that the proposed model demonstrated the worst results on low contrast images (see Supplementary Table 4 and Supplementary Fig. 4). The decile group with the lowest tumor contrast had DSC = 0.67 and recall = 0.77, which were significantly different from the results on other groups (0.77 to 0.84, and 0.88 to 0.93, respectively). Investigating the relation between FIGO stages and the model performance, we did not find significant differences with DSC ranging from 0.75 to 0.82 (see Supplementary Table 4 and Supplementary Fig. 4).

It should be emphasized that although we used a previously well-validated approach to define the ground-truth, this remains a surrogate of truth. In the absence of perfectly registered histopathological spatial information, this is the best we can achieve with a single segmentation method, which obviously provides imperfect results in a small number of cases (for instance highly heterogeneous or very small and low contrast cases) [22]. An even better approach would consist in generating several manual delineations by experts (at least three) in addition to the results of FLAB (other algorithms with proven good performance [22] could be added too) and generating a statistical consensus of all these segmentation results. This would provide the proposed model an even more reliable ground truth to learn from, but it would be considerably more time-consuming and tedious, especially for generating the numerous manual delineations. Alternatively, our approach consisting in training the network on rigorously determined ground-truth masks could be reproduced by relying on other semi-automated methods with similar demonstrated levels of performance [20, 22]. Once trained, the proposed network can be applied to new data instantaneously, without the need for user intervention beyond checking and validating the output result.

Unlike [5], we did not rely on any post-processing techniques built on prior anatomical information. First, based on the segmentation results of the proposed model, it appears able to natively learn the anatomic position of the tumor relative to the bladder from training samples without additional a priori guidance. Second, the assumption about the tumor roundness contradicts numerous examples in our dataset, especially those with heterogeneous distributions (see Fig. 3b, e). Although in the present case we focused on PET-only delineation, the proposed model can be trained using multiple different modalities as input. It might be beneficial in specific cases, such as dealing with small and/or low-contrast tumors, to extract additional information from associated CT or MRI modalities. For instance, in the context of the MICCAI 2020 Head and Neck Tumor segmentation challenge (HECKTOR), we applied a similar U-Net-based model to delineate lesions in combined PET/CT images, reaching 1st rank performance with DSC of 0.76 [27]. However, the main challenge in that case is to have a reliable ground truth determined on fused multimodal data, which could prove quite difficult in the cervical region due to anatomical deformations and differences between PET and CT datasets.Footnote 5

With respect to our original objectives, our results obtained with the use of multi-center cross-validation allow concluding that the designed model is able to provide similar performance on PET images from different institutions and is robust to variations in scanner types, reconstruction algorithms, and post-processing methods. In addition, it allows for fully automated delineation of the tumor uptake without the need to exclude the bladder uptake, either manually or through the incorporation of additional prior information or constraints.

Conclusion

In this work, we trained a modified U-Net model for fully automatic tumor uptake delineation in PET images in a multi-center context, without the need for additional anatomical information or prior constraints. The ability of the proposed model to learn and perform well for this task was demonstrated in PET images of 232 patients collected from five institutions. The ground-truth labels for all patients were generated by experts with the use of a semi-automated algorithm, to reduce observer-related variability and to avoid relying on manual delineations. We presented a versatile pipeline that includes appropriate data preprocessing and augmentation, design of the model architecture beyond the standard U-Net model, and an optimized training procedure. We mimicked a typical clinical scenario and conducted all experiments in a multi-center context. The designed model obtained good average accuracy for all considered institutions with very small standard deviation (DSC of 0.80 ± 0.03) without requiring any change in the pipeline. It slightly improved accuracy over the standard U-Net model, although both approaches provided good results and largely outperformed the fixed threshold approach. The described approach managed to avoid including the bladder uptake in the resulting segmentation without the need for additional anatomical information (for instance using the CT image) or priors such as shape constraints, and can therefore achieve fully automated delineation of the tumor uptake without the need for any user intervention. It can be implemented with minimal modifications to solve a variety of other segmentation tasks in different medical imaging modalities and could facilitate the deployment of fully automated radiomics pipelines.