Introduction

In clinical oncology, medical imaging technologies have evolved from simple diagnostic tools to a source of valuable clinical information over the years [1, 2]. In addition, the emergence of new technologies and the requirements of precision medicine has given rise to a promising field of radiomics [3, 4]. Radiomics is an image data-mining framework that makes it possible to extract a variety of quantitative imaging features from medical images and identify potential relationships with clinical and biological findings. As a result, radiomics may increase the precision of diagnosis, prediction, and prognosis to improve clinical decision-making for many diseases, including lymphoma [1, 5,6,7,8,9,10,11,12].

[18F]FDG or other PET radiopharmaceutical uptake patterns within a tumor have been characterized by identifying imaging features (intensity, heterogeneity, and shape) reflecting biological characteristics, such as cellular density, proliferation rate, hypoxia, necrosis, and angiogenesis [13, 14]. Several attempts have been made to evaluate the relationship between quantitative parameters of [18F]FDG uptake and the treatment response of lymphoma [15,16,17,18,19,20,21,22,23]. Parvez et al. [16] found that metabolic tumor volume (MTV) correlates with response to therapy in a retrospective study of 82 aggressive B-cell lymphoma patients. However, MTV represents the total volume of tumor activity and does not reflect spatial distribution, heterogeneity, and shape of lesions. Lue et al. [17] and Tatsumi et al. [18] reported that the radiomics features of [18F]FDG PET promise predictive values for treatment response in patients with Hodgkin and follicular lymphoma, respectively. In a retrospective study of 30 patients, Sun et al. [24] found that the standardized uptake value (SUV), the MTV, some texture features, and the tumor location were useful parameters in interim response prediction of primary gastrointestinal diffuse large B-cell lymphoma (DLBCL).

Based on the literature, it remains to be established how important different biomarkers are for predicting outcomes in lymphoma. For instance, in diffuse large B-cell lymphoma, Adams et al. [25] discovered that the national comprehensive cancer network international prognostic index was more accurate at predicting progression-free survival than whole-body total MTV, while Cottereau et al. [26] demonstrated the opposite. Such studies are based on static PET acquisitions that measure radiopharmaceutical uptake heterogeneity only at a one time-point. However, the knowledge of regional heterogeneity in molecular features of cancer cells changes over time can have significant implications for tumor response to treatment and patient outcomes [27].

Alternatively, dynamic PET imaging, employed primarily in the research setting, can track PET radiopharmaceutical biodistribution in the body over time, offering dynamic analysis, including full kinetic modeling and potentially enhanced clinical tasks such as therapy response monitoring [28, 29]. As such, dynamic features derived from kinetic maps might contain additional information concerning the behavior of the tumor. Meanwhile, there have been only few published papers evaluating dynamic features due to the limitations of dynamic acquisition. In patients with non-small cell lung cancer (NSCLC), two studies investigated the correlation between dynamic and static radiomics features [30, 31]. Tixier et al. [30] analyzed static and parametric PET images with quantitative parameters (MTV, SUVmax, SUVmean, heterogeneity) on 20 therapy-naive NSCLCs. They reported similar correlations and minor differences for metrics such as entropy and zone percentage quantifying intra-tumor uptake spatial distribution heterogeneity. However, they suggested further validation studies to compare the predictive or prognostic value of static versus parametric images for patient response or overall survival in NSCLC. Noortman et al. [31] evaluated a more extensive feature set (spatial intensity, shape, and texture radiomics features) derived from static and dynamic [18F]FDG PET of thirty-five NSCLC patients. They indicated that dynamic gray-level co-occurrence matrix (GLCM) features contain limited additional information compared to static radiomic features. However, the number of patients in the dataset was limited, and it is difficult to draw a general conclusion. This is noteworthy that the aforementioned studies [30, 31] have merely investigated dynamic features in lung cancer with no prediction of response to therapy; therefore, further investigation is needed to evaluate chemotherapy response prediction using dynamic features of lymphoma patients. Based on previous reports, certain dynamic features appear to offer more information than static features, which could lead to improved predictions. In the current study, we sought to investigate the performance of dynamic features derived from the dual-time-point (DTP) Ki to develop pre-therapy [18F]FDG PET/CT prediction models for response to chemotherapy in lymphoma patients.

Materials and methods

Figure 1 summarizes the various steps involved in the study design. At first, the Ki map was generated from DTP imaging using pre-treatment PET data. Next, radiomics features were extracted from the regions of interest (ROIs) segmented from the SUV PET image and Ki map. Afterward, ComBat harmonization is applied to each feature set to adjust for the batch impact caused by the multi-center dataset. Next, the response to treatment was evaluated according to the post-treatment PET scan. Finally, predictive models are developed to predict the treatment response of lymphoma (Hodgkin and non-Hodgkin) patients.

Fig. 1
figure 1

Five-step flowchart for the present study. (Step I) The Ki map was generated based on DTP imaging of pre-treatment PET data. (Step II) The SUV and Ki map were segmented to define VOI. (Step III) The LIFEx software was used to extract static and dynamic features. (Step IV) The ComBat harmonization was applied to each feature set to correct for the batch effect. A post-treatment PET scan was then used to assess the response to treatment. (Step V) prognostic models were developed to predict treatment outcomes for lymphoma patients and different classification metrics were reported for evaluation of models

PET/CT imaging protocol and patient selection

We searched for lymphoma patients with PET/CT scans from January 2013 until March 2022. We investigated around 4000 patients’ database records at two independent institutions, referred to as Centers 1 and 2. Medical records were carefully reviewed to identify which patients had pre- and post-treatment PET/CT scans, with the pre-treatment images acquired at DTP acquisition with a lesion in FOV of the delayed scan. The inclusion and exclusion criteria of patients are presented in Fig. 2. Overall, 26 patients from Center 1 and 19 from Center 2 were included.

Fig. 2
figure 2

Inclusion and exclusion criteria followed in patient selection. A total of 126 lesion in 45 cases including 75 responding and 51 non-responding to treatment response were retained from an initial of 3980 patients. The criteria that were considered include: (1) patients have pre- and post-treatment PET/CT scans, (2) undergoing DTP PET image acquisition for initial PET scan, and (3) visible lesion in delayed image of pre-treatment PET

All patients benefited from a second PET/CT evaluation after the first line of chemotherapy, specifically the doxorubicin (adriamycin), bleomycin, vinblastine, and dacarbazine (ABVD) regimen in Hodgkin lymphoma, and the rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone (R-CHOP) in non-Hodgkin lymphoma. Response to treatment was evaluated on a lesion basis according to Deauville criteria reported on the post-treatment PET scan [32]. A total of 126 lesions were individually classified as responding (n = 75) vs. non-responding (n = 51). The clinical characteristics of the patients are reported in Table 1. Before treatment, all patients underwent DTP [18F]FDG PET/CT scans with detailed key acquisition parameters of the datasets presented in Table 1.

Table 1 Summary of clinical characteristics of patients and image acquisition parameters in different centers

Generation of the K i images

The image of the metabolic uptake rate was generated according to the DTP scan through an in-house MATLAB code [33, 34]. In short, the Ki map was defined as the slope of the Patlak equation from two time points, t1 (related to the routine static image data acquired 60-min post-injection) and t2 (the time of the delay scan) in the following Eq. (1):

$$K_{i} = \frac{{\frac{{C_{{{\text{PET}}}} \left( {t_{2} } \right)}}{{C_{{\text{P}}} \left( {t_{2} } \right)}} - \frac{{C_{{{\text{PET}}}} \left( {t_{1} } \right)}}{{C_{{\text{P}}} \left( {t_{1} } \right)}}}}{{\frac{{\int_{0}^{{t_{2} }} {C_{{\text{P}}} (\tau ){\text{d}}\tau } }}{{C_{{\text{P}}} \left( {t_{2} } \right)}} - \frac{{\int_{0}^{{t_{1} }} {C_{{\text{P}}} (\tau ){\text{d}}\tau } }}{{C_{{\text{P}}} \left( {t_{1} } \right)}}}}$$
(1)

where CPET(t) and CP(t) denote radiopharmaceutical concentrations at time t in tissue and plasma, ROIs, respectively. We derived a subject-specific input function for each patient by scaling a population-based input function described by Vriens et al. [35] to the patient’s image-derived blood pool activity derived from the routine static PET image. Spherical VOIs were manually delineated in the left ventricle and atrium at a sufficient distance from the myocardium, with 15 mm and 10 mm diameters, respectively. The VOIs were then averaged.

In most cases, the patient was taken off the bed following the whole-body (WB) PET prior to the delayed scan. As such, repositioning is a possible source of error for DTP evaluations. As a result, tumor-specific rigid registration between WB and delayed PET based on CT images was performed to maximize the accuracy of the Ki map.

Image segmentation and feature extraction

A threshold value of 30% of the maximum SUV was used to determine the VOI on the static images [36]. Then, the same VOI was manually delineated on the Ki images and modified by erasing or adding voxels to ensure the entire tumor was included in the VOI. Finally, all VOIs were reviewed by two nuclear medicine specialists. Figure 3 shows examples of segmented tumors on the parametric Ki and SUV images.

Fig. 3
figure 3

Examples of SUV (top) corresponding the DTP Ki (bottom) images showing segmented lesions

The LIFEx package (version 7.0.15) [37], which is standardized through the image biomarker standardization initiative (IBSI) [38], was used to extract radiomics features on PET images. First, all the Ki maps were multiplied by 100 to obtain the same scale as the SUV image. Then, the SUV and Ki images were processed using 64 bins, with the minimum and maximum image intensity values set to 0 and 20. Additionally, the voxel size was resampled to 4 × 4 × 4 mm3. A total of 65 radiomics features, including the category of gray-level co-occurrence matrix (GLCM, seven features), neighborhood gray-level different matrix (NGLDM, three features), gray-level run length matrix (GLRLM, eleven features), gray-level zone length matrix (GLZLM, eleven features), shape (five features), histogram (four features), conventional (twelve features), and discretized (twelve features) indices, were extracted for each lesion in both SUV and Ki images. Full details about the features are presented in Table 2.

Table 2 Radiomic features extracted from the SUV and Ki images

Harmonization

Harmonization was performed for all PET parameters using the ComBat harmonization method [39] to eliminate multicentre effects from radiomics features. In addition, ComBat harmonization removes batch effects based on an empirical Bayes framework using Bayes estimations for the location-scale parameters, including mean and variance for each variable [39,40,41].

Univariate analysis

We calculated correlation coefficients between static and DTP features using Spearman’s rank method to identify features that might provide additional information. Receiver operating characteristic (ROC) curve analysis was used to assess the predictive power of each radiomics feature before and after the ComBat harmonization. The AUC of DTP and static features and the AUC of features before and after the ComBat harmonization were compared using Delong’s test. All the statistical analyses were performed in MedCalc (version 20.0.14; MedCalc Software Bvba). To assess the significance of the features, we also applied false discovery rate (FDR) Benjamini–Hochberg (BH) correction to correct for multiple comparisons, reporting q values. A q value of less than 0.05 defined statistical significance.

Multivariate machine learning analysis

We developed various models using the DTP and static features before and after Combat harmonization. Our models were: (1) H_ DTP (harmonized radiomics features extracted from the DTP Ki map), (2) H_Static (harmonized features extracted from the SUV images), (3) H_ DTP + Static (combined harmonized features extracted from the DTP Ki map and the SUV images), (4) Non-H_ DTP (non-harmonized features extracted from the DTP Ki map), (5) Non-H_Static (non-harmonized harmonized features extracted from the SUV images), (6) Non-H_ DTP + Static (combined non-harmonized harmonized features extracted from the DTP Ki map and the SUV images).

First, we selected the most effective features by applying the minimum redundancy maximum relevance (mRMR) approach [42] to the input data. This algorithm selects a subset of features with maximum relevancy to the patient’s outcome and the most negligible correlation with each other simultaneously. Next, the classifiers were built with Python 3.7.4 and constructed using eXtreme Gradient Boosting (XGBoost version 1.6.1) machine learning algorithm [43]. XGBoost is an ensemble learning algorithm based on different decision trees. Finally, three different radiomic models based on the (1) static, (2) DTP, and (3) combination of DTP and static PET features were established to predict therapy response in lymphoma patients.

This study randomly divided the data into two groups: 80% for the model training and internal validation and 20% for the test. The test data were not used during model development. A subset of the training dataset was used to derive the models (80%), and the remainder (20%) was used for validation. We repeatedly trained a bootstrapped model with 1000 repetitions to find the optimal hyperparameters of models based on the random search method and AUC. Then, the optimal model was tested on the remaining 20% of the dataset (unseen during model training). This process was repeated 100 times to ensure the results were repeatable for different models. The mean ROC and the mean, standard deviation, and 95% confidence interval (CI) of AUC, accuracy (ACC), sensitivity (SEN), and specificity (SPE) were used to assess the predictive performance of the models. We used the Mann–Whitney test to determine significant differences between the models.

Results

Univariate analysis

Spearman’s correlation matrix of static and DTP radiomics features is shown in Fig. 4. Using the Spearman’s correlation coefficient (ρ), the features with low (ρ < 0.5), moderate (0.5 < ρ < 0.7), and high (ρ > 0.7) correlation are reported in Table 3. DTP features with ρ < 0.7 contain additional information compared to static ones.

Fig. 4
figure 4

Spearman correlation matrix of dynamic and static features. Dynamic features with ρ < 0.7 contain additional information compared to static one

Table 3 Correlation of static and DTP features using Spearman’s correlation coefficient (ρ)

The AUC, p value, and q value for each DTP and static feature before and after harmonization are reported in Additional file 1: Fig. S1. The significant differences in the ROC curves between DTP and static features, before and after harmonization, are compared using the Delong test and false discovery rate (FDR) q value (< 0.05) using the Benjamini–Hochberg procedure (BH), as shown in Additional file 1: Fig. S2. Table 4 shows the number of features whose performance (as AUC) significantly increased, decreased, or did not result in any difference before and after harmonization for both DTP and static features. No significant difference was observed among the ROC curves of DTP and static radiomics features. When comparing the ROC curves before and after harmonization, most of the harmonized features do not show any decreases or increases in performance against non-harmonized features.

Table 4 Results of the Delong test comparing AUCs of the DTP and static features with and without ComBat harmonization

Multivariate analysis

The mRMR algorithm selected ten from 65 features for static and DTP models. From a total of 20 features composed of 10 top DTP and static features, the combined static + DTP model used ten selected features applying the mRMR algorithm. All of the selected features for each model are presented in Table 5.

Table 5 Ten top features selected by mRMR algorithms for each model

The heat map of AUC, accuracy (ACC), sensitivity (SEN), and specificity (SPE) for different models, including DTP, static, and DTP + static, before and after harmonization to predict treatment response, are shown in Fig. 5. The confidence interval (CI) and mean and standard deviations (Mean ± STD) of AUC, ACC, SEN, and SPE for these models are summarized in Table 6. Figure 6 represents the ROC curve of these models for the test set. AUCs for all models have the highest values after harmonization. Before and after harmonization, the mean of AUC for the DTP model were 0.76 ± 0.02 and 0.87 ± 0.03, respectively. For static models, these values changed to 0.79 ± 0.02 and 0.88 ± 0.01, respectively, and for DTP + static model, these values were 0.81 ± 0.03 and 0.97 ± 0.02, respectively. Among the models, the combination of harmonized DTP and static features significantly improves the performance with AUC = 0.97 ± 0.02, ACC = 0.89 ± 0.05, SEN = 0.92 ± 0.09, SPE = 0.88 ± 0.05, respectively. The 95% CI for these parameters was 0.96–0.97, 0.88–0.90, 0.90–0.93, and 0.87–0.89, respectively. p Values are shown in Fig. 7, comparing models in terms of significant changes in AUC, ACC, SEN, and SPE. Majority of models had significant differences (p < 0.05).

Fig. 5
figure 5

Heatmap of the performance of the DTP, static, and DTP + static models with and without ComBat harmonization; ACC: accuracy, AUC: area under the curve, SEN: sensitivity, SPE: specificity

Table 6 Mean, STD, and confidence interval (CI) of the area under the curve (AUC), accuracy (ACC), sensitivity (SNE), and specificity (SPE) in the test set for the different models studied
Fig. 6
figure 6

The ROC curves of the different models for prediction of response to therapy a before and b after ComBat harmonization. Solid lines are the mean ROC and the shaded regions represent one standard deviation around the average

Fig. 7
figure 7

p Values for the comparison between the different models concerning the area under the curve (AUC), accuracy (ACC), sensitivity (SEN), and specificity (SPE)

Discussion

Accurate prediction of response will improve treatment strategies and therefore optimize therapeutic results. In this study, we developed radiomics models for predicting the response of lesions to chemotherapy using the XGBoost classifier based on the static and DTP PET features selected by the mRMR algorithm in lymphoma patients. To this end, we extracted radiomics features from the SUV image and DTP Ki map, namely static and DTP features, respectively, and compared the predictive treatment response performance of DTP and static features. The present study investigated the potential information that DTP features may add to traditional features derived from the static PET images in 126 lesions of 45 lymphoma patients. Several studies have shown the significant potential of DTP imaging for generating parametric Ki images [33, 34, 44]. In the absence of list mode data, Van den Hoff et al. [34] proposed novel method to determine the metabolic uptake rate utilizing DTP images. Based on this study, we generate Ki map by determining the slope between the two time points. Only a few studies have investigated the performance of dynamic features. Tixier et al. [30] evaluated several parameters (SUVmax, SUVmean, and MTV) and heterogeneity quantification in NSCLC. They reported high correlations for all parameters between SUV and parametric images, which indicates that heterogeneity quantification on parametric images does not offer additional information compared to static SUV images. However, in another study, Noortman et al. [31] found that certain dynamic GLCM radiomics features show different information than traditional radiomic in patients with NSCLC. In our study, 12 dynamic features contain additional information compared to static ones (see features with ρ < 0.5 in Table 3).

On the other hand, moderate correlation features provide a small amount of additional information (see features with 0.5 < ρ < 0.7 in Table 3). In agreement with Tixier et al. [30] and Noortman et al. [31] studies, most dynamic features show moderate and high correlations with static ones. Although the correlation of features found by the mentioned studies is not comparable to our results, the different types of lesion and acquisition protocols were investigated. We estimated the Ki map using the DTP method to achieve a simple and clinically feasible approach for deriving dynamic features. Several studies evaluated conventional PET metrics (SUV, MTV, and TLG) and showed the predicted value of treatment response in lymphoma patients [20,21,22, 45,46,47,48].

In addition, some studies investigated the role of PET radiomics features in predicting treatment response in lymphoma. Lue et al. [17] reported that wavelet HIR_GLRMPET and RLNU_GLRMCT are independent predictive factors for treatment response in patients with Hodgkin lymphoma. Tatsumi et al. [18] demonstrated that LGZE might help predict the treatment response of follicular lymphoma.

Univariate analysis of our study showed that some radiomics features might be predictive. For harmonized DTP features, the highest AUCs were achieved for GLCM_Energy, GLCM_Entropy, and uniformity (AUC = 0.73, p value = 0.0001, q value < 0.0005). Among static features, GLRLM_RLNU (AUC = 0.75, p value = 0.0001, q value = 0.0007) were found to be as most predictive features. Based on univariate results, there was no significant difference between the performance of most DTP and static radiomics features.

Specifically, several studies developed radiomic models for lymphoma patients to provide a prediction response to therapy. In a retrospective study included 57 bulky malignant lymphoma patients, Bouallègue et al. [23] presented a model incorporating static PET texture and shape features that achieved the highest predictive value with ROC AUC of 0.82 and 80% accuracy compared with other factors, including MTV and histology. Coskun et al. [19] developed the logistic regression model with cross-validation to predict treatment response using static PET features in DLBCL. They reported an accuracy of 0.87 and an AUC of 0.81. Finally, Jimenez et al. [15] proposed a radiomics model to predict ibrutinib response in lymphoma patients using static PET features trained by repeated cross-validation nested with the Gentle AdaBoost ensemble algorithm. They achieved an AUC of 0.86 (sensitivity, 92.9%, specificity, 81.4%; p < 0.001). Our study showed AUC = 0.88 for static features when taking advantage of the ComBat harmonization.

Since performing dynamic acquisition has limitations in clinical practice, the predictive value of dynamic features was not considered previously. We used the clinically feasible DTP PET imaging to achieve the Ki map. Our study sheds light on the possibility of treatment response prediction utilizing dynamic features by the DTP method. The results showed that DTP-feature yielded similar classification performance (AUC = 0.87) to static models (AUC = 0.88). Hence, since some DTP and static features had low and moderate correlations, they could serve as different markers. Previous studies reported improving performance by combining different markers, such as PET features and clinical data [9, 49]. Although it was out of the scope of the present investigation to add clinical data, we further took steps to build a novel model by combining DTP features with static ones. We found that this integrated model has the advantage of predicting treatment response with the highest AUC value (0.97). These results indicated that the H_DTP + Static model provided more accurate information and improved performance over other models we tried. Also, the performance of multivariate models was improved compared to univariate radiomics analysis. Due to the dual-centric nature of our study, we used ComBat harmonization to resolve the plausible batch effect. Univariate AUC of most DTP and static features did not differ significantly, and some of the features decreased before and after harmonization. However, as shown in Fig. 6, we observed higher AUCs and improvements in the predictive power of all multivariate models after harmonization, which were congruent with previous studies [50].

There were some limitations in this study. Foremost, the study cohort is relatively small; we used datasets from only two centers where external validation was lacking from different centers. However, we used the bootstrap technique to evaluate our models to address the limited sample size; further clinical studies are needed to verify our results with more extensive clinical databases. Moreover, obtaining full-time input function information for the standard Patlak method requires either arterial blood sampling or a long scan covering early time points of the blood pool. We used a scaled population-based input function for Patlak analysis to overcome this challenge, although the lack of ground truth information might have influenced the results. Another limitation of this study was the lack of multiple segmentations to assess the effect of segmentation variability on the extracted features. Finally, clinical data (patients' history and demographics, laboratory tests) were not considered in the model as the focus was on imaging features.

Conclusion

Our results indicate the potential of combining dynamic and static features from FDG PET images to predict the treatment response in lymphoma patients. We used the dual-time-point framework to obtain the Ki maps and extract dynamic features, which can be applied in routine clinical practice. We demonstrated that the highest predictive performance of the XGBoost classifier with the mRMR algorithm was achieved when DTP and static features from FDG PET images were combined. We also demonstrated that ComBat harmonization significantly improved the performances of static, DTP, and combined static and DTP-based radiomics models toward significantly improved prediction of therapy response in lymphoma patients.