Background

Positron emission tomography (PET) and computed tomography (CT) are oncological imaging modalities extensively used for staging and treatment response assessment in lymphoma [1]. Alone and when combined with existing prognostic indicators, quantitative imaging characteristics extracted from PET scans have been shown to improve risk stratification. Baseline metabolic tumor volume (MTV) is a quantitative measure, obtained from [18F]-fluorodeoxyglucose ([18F]FDG) PET scans, which quantifies tumor burden with FDG uptake [2]. Several studies demonstrated that high MTV before starting treatment is significantly correlated with a shorter progression-free survival (PFS) and/or overall survival (OS) [3, 4]. These findings imply that MTV is a promising prognostic factor in tailoring lymphoma therapy. However, quantitative PET measures are susceptible to image quality in varying degrees, including the different PET reconstruction methods [5,6,7]. As such, several papers clearly demonstrated the high sensitivity of intensity measures like the maximal standardized uptake value (SUVmax), SUVpeak and multiple textural features for reconstruction setting [11, 14, 15]. As a result, new image reconstruction methods, such as the point spread function (PSF), pose uncertainties about their impact on the different quantitative PET metrics, including volumetric features like MTV [8, 9]. Despite the auspicious potential clinical value of MTV as a prognostic and response predictive marker in lymphoma, susceptibility to reconstruction setting and thus inability to reliably reflect (changes in) tumor burden precludes any clinical implementation.

In this light, we aim to evaluate whether, and to which extent, MTV is sensitive to reconstruction setting including the impact of segmentation method. Therefore, in this study we assess the variability in MTV using 9 semiautomatic segmentation methods to variations in 3 different image reconstruction methods. Throughout the literature, the ComBat method has been proposed as a solution to reduce the impact of the image preprocessing effect and it is currently used in various contexts [10]. Originally, ComBat has been used in genomics as a harmonization strategy to deal with the alterations caused by batch effects [11]. When applying this method, the batch effect is discarded as all data are realigned in a single space and biological information remains unchanged. In image analysis, we can use ComBat to compensate for the variability among features generated by the scanner/protocol effect. However, ComBat harmonization is not always correctly used. Therefore, in this paper we also analyze the ability of using ComBat to remove variability in MTV and whether different implementations of ComBat are able to further improve the alignment of the data.

Methods

Study population

In this study, we used baseline [18F]FDG PET/CT scans from 19 patients from two different datasets. The first dataset consists of 14 patients scanned at Amsterdam UMC which were retrospectively obtained from ongoing studies with a waiver for informed consent from the Medical Ethics Review Committee of Amsterdam UMC, location VUmc (IRB2018.029). From these 14 patients, 9 patients were diagnosed with DLBCL, 3 were diagnosed with Hodgkin lymphoma, 2 were diagnosed with T cell lymphoma and 1was diagnosed with post-transplant lymphoproliferative disorder (PTLD). The second dataset consists of 5 DLBCL patients which were recruited at the outpatient clinics of the department of Hematology of the Amsterdam UMC, location VUmc, and the outpatient clinics of the department of Hematology of the Amstelland Hospital in Amstelveen (IRB2019.278). These trials enrolled patients aged 18 years or older diagnosed with DLBCL with at least one tumor with a diameter equal to or more than 3 cm. Patients who had undergone chemotherapy in the past 4 weeks showed multiple malignancies, metal implants or pregnant/lactating patients were excluded from the study.

Quality control of scans

The quality control (QC) check of the scans followed the EANM guidelines: The liver SUVmean should be between 1.3 and 3.0, and the plasma glucose should be lower than 11 mmol/L [12]. Furthermore, scans were excluded during the QC if the scans were incomplete and/or the total image activity (MBq) was not between 50 and 80% of the total injected FDG activity and/or any DICOM data were missing as in [2].

Image processing

In order to analyze the impact of reconstruction methods, we used [18F]FDG PET baseline scans derived from three different reconstruction methods: one reconstruction which followed locally clinically preferred protocols (high resolution or HR reconstruction), another reconstruction following EARL1 standards (EARL1 reconstruction) and a third reconstruction following EARL2 standards (EARL2 reconstruction). EARL2 standards were established with the implementation of PSF into the original EARL image reconstruction capabilities [13]. PSF is a resolution modeling algorithm which improves image resolution and contrast [14]. In comparison with EARL standards, the most substantial configuration to the HR reconstruction is a pixel spacing parameter of 2 mm instead of 4 mm and a higher spatial resolution. Table 1 contains a summary of the parameters related to the reconstruction methods  used in this study.

Table 1 Summary of parameters characterizing each reconstruction method

The MTV of lesions was calculated and analyzed using ACCURATE software [15]. ACCURATE enables the calculation of MTV of lesions on PET scans automatically and allows the users to apply multiple segmentation methods or volumes of interest (VOI) [15]. Nineteen lymphoma patients were included in the analysis. For each PET baseline study, 3 different reconstructions were investigated (EARL1, EARL2 and HR). We delineated on average 3 lesions per PET scan, which resulted in a total of 56 lesions across all of the included patients. Nine different semiautomatic segmentation methods were applied to delineate each of these lesions. Since each PET scan consisted of 3 reconstructions, a total of 1512 delineations and MTV measurements were included for the analysis.

For each reconstructed scan, the following segmentation methods were applied: segmentation based on fixed thresholds using standardized uptake value of 4.0 (SUV4.0), and SUV of 2.5 (SUV2.5), 41% of SUVmax (41M), segmentation based on adaptive thresholding using 50% of peak voxel value adapted for local background (A50P), majority vote approaches for segmenting voxels detected by at least 2 (MV2) and 3 (MV3) out of these 4 methods [16], lesional-based methods that identify the optimal method based on SUVmax (L2A, L2B) [17] and a contrast oriented method, Nestle segmentation [18]. For the L2A method, a SUV4.0 contour is used for SUVmax > 10 and MV3 for SUVmax < 10. For L2B, MV2 instead of SUV4.0 in case of SUVmax > 10 was used. The majority vote approaches are based upon the agreements between SUV4.0, SUV2.5, A50P and 41M. A detailed description of the methods can be found in [19].

ComBat harmonization

ComBat harmonization was applied to align the MTV measurements from the three different reconstructions used in this study. As aforementioned, ComBat was first described in the field of genomics to remove batch effects [10, 20]. The ComBat method assumes that the deviation introduced by the batch effect is removed once the means and the variances are standardized across the different batches. The value of the feature Y for a specific VOI j and scanner i is expressed as follows:

$$Y_{ij} = \alpha + \gamma_{i} + \delta_{i} \varepsilon_{ij} ,$$
(1)

where \(\alpha\) represents the mean value of the feature Y, \(\gamma\) represents the additive effect of the scanner, \(\delta\) is the multiplicative effect of the scanner, and \(\varepsilon\) is the error. In this case, the feature Y would be the MTV and the VOI j the delineated lesion. This harmonization method uses the empirical Bayes framework to estimate the batch/scanner effect terms, \(\gamma_{i}\) and \(\delta_{i}\). Subsequently, the corrected Y value \(Y_{ij}^{{{\text{ComBat}}}}\) is calculated in Eq. (2) where \(\hat{\alpha }\), \(\widehat{{\gamma_{i} }}\) and \(\widehat{{\delta_{i} }}\) are estimations of parameters \(\alpha\), \(\gamma_{i}\) and \(\delta_{i},\) respectively.

$$Y_{ij}^{{{\text{ComBat}}}} = \frac{{Y_{ij} - \hat{\alpha } - \widehat{{\gamma_{i} }}}}{{\widehat{{\delta_{i} }}}} + \hat{\alpha }$$
(2)

To understand how the implementation of ComBat is affecting our MTV values, we implemented multiple versions and compared them to the original data. Initially, we applied the regular implementation of ComBat which derives the transformation by aligning the mean and standard deviation of the data groups pertaining to different reconstructions (‘Regular ComBat’). This implementation of ComBat assumes a normal distribution of the data. Since medical data are rarely normally distributed, we also implemented the version of ComBat which applies the logarithmic transformation to attain normal distributions (‘Log-transformed ComBat’). When applying such transformation, the returned values have already been exponentially transformed to be comparable with the rest. Details of these two ComBat versions can be found in Table 2. Another approach to address the non-normal data distribution is to standardize the median and interquartile range instead of the mean and the standard deviation. Furthermore, we investigated whether excluding outliers affects the harmonization of the data. ComBat was applied using R version 4.0.5 based on the code provided by Fortin et al. [21].

Table 2 Description of characteristics of ComBat implementations

Statistical analysis

We first compared the MTV values across the 9 different segmentation methods. For each one of the lesions, we compared the MTVs obtained from EARL2 or HR reconstructions to those from EARL1 using MTV volume ratios. Since EARL1 is used as the reference reconstruction method, in these ratios, EARL1 results are given in the denominator as shown in the following equations:

$${\text{MTV}}\;{\text{Ratio}}\;{\text{EARL}}2 = \frac{{{\text{EARL2 }}\;{\text{MTV}}}}{{{\text{EARL}}1 \;{\text{MTV}}}}$$
(3)
$${\text{MTV}}\;{\text{Ratio }}\;{\text{HR}} = \frac{{{\text{HR}}\;{\text{MTV}}}}{{{\text{EARL1 }}\;{\text{MTV}}}}$$
(4)

Equations (3) and (4) were calculated across all of the 9 segmentations which resulted in a MTV ratio value per lesion for each segmentation method for both EARL2 and HR reconstructions. MTV ratios were used to compare the effect of different reconstructions across multiple segmentations before applying ComBat and after applying ComBat.

Results

The MTV analysis was carried out by calculating the MTV volume ratios (see Eqs. 3 and 4 for reference). In Fig. 1, the MTV ratios are plotted per segmentation method for both EARL2 (a) and HR reconstructions (b). A perfect alignment of MTV values between reconstructions is given by an MTV ratio of 1. Both plots show dissimilarities with EARL1 reconstruction; however, MTV from the HR reconstruction presents larger variability than MTV from EARL2. Differences between reconstructions are readily apparent for segmentation methods 41M, A50P, MV3 and Nestle, where the MTV ratio boxplots stay below 1 (MTV ratio EARL2: median of 0.73, 0.86, 0.80 and 0.82, respectively), indicating that the volume of the lesions segmented under these settings is smaller than the segmented volume with EARL1 reconstructions. These findings are also presented in Table 3, where we displayed the median and IQR values for MTV ratios for each of the segmentation methods. In addition, for SUV2.5, the size and amount of outliers outnumbered the rest of segmentation methods (Additional file 1: Fig. S1). MTVs from the SUV4.0 segmentation method showed the best alignment between reconstructions (MTV ratio EARL2: median = 1.01, interquartile range (IQR) = [0.96, 1.10]). Generally, fixed threshold methods were less sensitive to changes in reconstruction settings (MTV ratio EARL2: median of 0.96 for MV2, L2A and L2B).

Fig. 1
figure 1

MTV ratios across segmentation methods. Each boxplot illustrates the set of MTV ratios obtained with a particular segmentation method: 41M, A50P, L2A, L2B, MV2, MV3, NESTLE, SUV2.5 or SUV4.0. MTV ratios are given for a EARL2 reconstructions and b HR reconstructions. * Implies few outliers not displayed

Table 3 MTV ratios (median and IQR for each segmentation method) for EARL2 and HR reconstructions

ComBat transformation was applied to compensate for differences in reconstruction methods. Different versions of ComBat were implemented. In this paper, we focused on the comparison between the Regular ComBat and Log-transformed ComBat (see Table 2 for reference) because the latter is generally less sensitive to outliers and will, by definition, prevent the generation of negative values. Table 4 illustrates the median, mean, standard deviation (SD) and IQR of the MTVs before and after ComBat for 41M segmentation. SD and IQR are given because the data are not normally distributed. An improved alignment between reconstructions after using ComBat was observed. There is large variability in the data across the 3 reconstruction methods. This is also shown in Fig. 2. In Fig. 2, MTV values per reconstruction setting are presented using 41M segmentation method for all three situations: before ComBat (a), after Regular ComBat (b) and after Log-transformed ComBat (c). In Fig. 2b, negative MTVs were obtained for the HR reconstruction when using Regular ComBat. This was also observed for other segmentation methods such as SUV2.5, MV3 and A50P (Additional file 2: Fig. S2).

Table 4 Transformation of MTV values (mL) with ComBat harmonization
Fig. 2
figure 2

MTVs obtained using 41M segmentation across 3 different reconstructions: EARL1, EARL2 and HR. a MTVs before ComBat harmonization. b MTVs after ComBat using non-transformed distribution (Regular ComBat). c MTVs after ComBat using log-transformed distribution (Log-transformed ComBat). Regular ComBat leads to negative volume values for the clinical reconstruction unlike Log-transformed ComBat which led to positive-only volumes

The transformation of the MTV ratios per segmentation can be found in Fig. 3. Usually, an agreement of MTVs among different reconstructions can be observed post-ComBat harmonization for most segmentation methods. For most of the segmentation methods, the post-ComBat boxplots (dark blue) IQR included the value of 1, especially after applying the Log-transformed ComBat. However, these boxplots have larger IQR in comparison with the boxplots for the values before ComBat (light blue). This shows that ComBat increases the variability of the MTV parameter and consequently worsens the precision. The transformation was considerably better when applying Log-transformed ComBat instead of Regular ComBat. Log-transformed ComBat led to higher accuracy with median values closer to 1 and an acceptable increase in variability when compared to Regular ComBat.

Fig. 3
figure 3

MTV ratio across 9 segmentation methods. MTV ratio calculated by comparing MTVs of EARL2 to those of EARL1, with EARL1 as the denominator for a particular segmentation method: 41M, A50P, L2A, L2B, MV2, MV3, NESTLE, SUV2.5 or SUV4.0. For each segmentation method, 2 boxplots are shown. The light blue boxplot represents data without ComBat harmonization, while the dark blue boxplot is obtained after ComBat harmonization. MTV ratio equal to 1 indicates a perfect alignment between reconstructions. a MTV ratio before and after ComBat using non-transformed distribution (Regular ComBat) b MTV ratio before and after ComBat using log-transformed distribution (Log-transformed ComBat)

Discussion

The aim of this study was to evaluate the impact of image reconstruction methods onto MTV calculations in baseline [18F]FDG PET scans of lymphoma patients. For this study, we focused on two aspects: the reconstruction and the segmentation method. Specifically, we analyzed the interaction of three reconstruction methods with nine different segmentation methods and how these conditions affected MTV. At the moment, there is no consensus about which methods and settings are optimal for PET MTV quantification. However, the scientific community has acknowledged the need to generate robust and reproducible MTV measurements for prognostic and clinical applications [22,23,24,25].

The results of this study present significant inter-reconstruction variability for MTV calculations. The three different reconstruction methods which were evaluated (EARL1, EARL2 and HR) resulted, in some cases, in large differences in MTV values. Volumes derived from EARL2 and HR reconstructions have a tendency to be smaller in size when compared to EARL1. This is in concordance with a previous study where they found that PSF reconstructions led to a decrease in MTV in 83% of the analyzed lesions [26]. Furthermore, accurate MTV quantifications are also influenced by the segmentation method used. To our knowledge, this is the first study validating the effect of segmentation algorithms for different reconstruction methods, but the variation of MTV absolute values among segmentation methods has been previously reported in several studies [11, 25, 27]. Our results show that some segmentation methods are less sensitive to changes in reconstruction methods than others. The most robust (against reconstruction) segmentation method for MTV calculations was SUV4.0. In a recent work on DLBCL subjects, SUV4.0 was found to perform the best in deriving MTV compared to 6 other segmentation methods [19]. Results from MV2 segmentation were comparable to those of SUV4.0. MV2 segmentation tends to provide a fairly accurate segmentation of the lesions as it delineates voxels included in at least two out of four methods: SUV4.0, SUV2.5, A50P or 41% [11].

In the second stage of this study, we implemented ComBat with the aim of removing the variability introduced by the reconstructions. In a multicenter study on breast cancer [18F]FDG PET images, ComBat was successfully used to realign SUV measurements and multiple textural features [11]. Moreover, this approach has been validated for scanner effect removal in other imaging technologies such as CT [28] and MRI [29]. A better alignment in MTV between the different reconstruction methods was indeed accomplished once ComBat was applied. Our data showed very high values which caused large variability within reconstruction methods (Table 4). These extremely large values are generally originated by the presence of bulky tumors and the flooding effect caused by some segmentation methods. To deal with these extreme outliers, we used Log-transformed ComBat which achieves an improved alignment of MTV values between reconstructions compared to Regular ComBat. Some other versions of ComBat were implemented in the attempt to further remove this variability (using the median and the IQR in the transformation); however, these did not show a significant improvement. Despite a better alignment after ComBat, we still observed a decrease in accuracy and a worsened precision when comparing EARL2 and HR to EARL1. Therefore, an upfront harmonization of image quality and use of a consensus segmentation method are highly preferred. Furthermore, use of regular version of ComBat for the transformation resulted in negative MTV values for several segmentation methods particularly when using the HR reconstruction. Bearing this in mind, we believe ComBat should be used with caution. Adjusting the parameters of this method is important in order to avoid incoherent results and to mitigate any possible side effects of the ComBat harmonization.

The overall uncertainty and variability of PET extracted features can often be explained by the technical aspects involved in imaging acquisition. As such, unexpected deviations in volumetric features like MTV have to be carefully considered and should not be hastily adopted for response prediction. Novel technological implementations in reconstruction methods are significantly improving image quality standards; however, they have the effect of generating discrepancies in multicenter studies as not all PET systems can be equally equipped with these technologies. Lack of standardization is, therefore, becoming the main issue in MTV analysis of [18F]FDG PET-CT images. To address this matter, multiple medical societies like the European Association of Nuclear Medicine or the Society of Nuclear Medicine and Molecular Imaging are advocating for the inclusion of harmonized practices which can alleviate the variability and promote robust tumor quantification. Finally and most importantly, the harmonization of these methods is an essential step toward the implementation of MTV as a prognostic factor in clinical practice.

Conclusion

This work corroborates the fact that robustness of MTV depends on both segmentation method and reconstruction methods. We found SUV4.0 to be the recommended method for lesion delineation, showing least sensitivity to image reconstruction settings. Moreover, ComBat was partially able to reduce reconstruction dependent MTV variability, provided a log-transformation to account for the non-normal distribution of MTVs is included. In conclusion, herein we demonstrate the impact of the imaging technical aspects in PET derived MTV and we highlight the importance of standardization in imaging workflows in order to enhance reproducibility for multicenter studies and, ultimately, the implementation of MTV for prognosis in clinical practice.