Introduction

The rapid development of immune checkpoint inhibitor (ICI) agents targeting programmed cell death protein 1 (PD-1) or programmed cell death protein ligand 1 (PD-L1) has granted immunotherapy a key role in the treatment of advanced lung cancer in the past ten years [1]. The National Comprehensive Cancer Network (NCCN) has recommended PD-1/PD-L1 blockade therapy for locally advanced and metastatic non-small-cell lung cancer (NSCLC) without targetable genetic mutations [2]. Nonetheless, the beneficial outcome only exists in 15–40% of patients based on previous literature [3, 4]. The need for identifying more efficient predictive biomarkers of immunotherapy responses is therefore crucial.

PD-L1 expression status has been clinically used to select candidates for PD-1/PD-L1 ICIs, but its efficacy as a predictive biomarker is controversial, which is partially due to the quantification nonuniformity and intratumoral heterogeneity [5,6,7,8,9]. In addition, it is an invasive procedure that is not suitable for all patients. Therefore, a noninvasive biomarker is still needed for the precise stratification of patients receiving immunotherapy.

Poor prognosis was associated with several clinical features, such as late lines of immunotherapy and the presence of distant metastases before the treatment, but those findings were based on relatively small cohorts [10, 11]. Computed tomography (CT) image-based radiomics have shown promising results in evaluating tumor responses to immunotherapy, but challenges remain in the aggregation of predictions made at a lesion level to predict a patient-level outcome [12]. Assigning the same patient response to all lesions is a typical lesion-wise radiomic analysis yet it fails to account for effects induced by the unique immune-related response, specifically, dissociated responses [13]. Other patient-level approaches include selecting the largest lesion or averaging by the mean, but the outcomes are far from satisfactory [11, 14]. Chang et.al used a tumor volume-based weighted radiomic features to establish the patient-level outcome in brain metastases with preliminary success [15]. However, this weighting strategy does not consider factors other than tumor size. Recent studies show that an attention-based multiple instance learning (a-MIL) technique may help build predictive models by attributing more weights to the most relevant subregions associated with a specific classification task [16,17,18,19]. Therefore, we aimed to explore whether using the adaptively weighted sum of radiomic features from multiple intrapulmonary lesions on the pre-treatment CT scans can enhance the performance of radiomic models for predicting long-term progression-free survival (PFS ≥ 6 months) benefit of immunotherapy. Furthermore, we wanted to examine the complementary ability of clinical variables to the radiomic model.

Materials and methods

Patient population and clinical data collection

This single-center study retrospectively reviewed a total of 309 patients at our hospital from June 2015 to November 2020 with pathologically confirmed advanced lung cancer treated with at least one cycle of either PD-1 or PD-L1 ICI therapies. The inclusion and exclusion criteria are detailed in Fig. 1. The final 233 patients were randomly split into a training (n = 185) and an internal test cohort (n = 48). Patient clinical data were collected through electronic medical records and are displayed in Table 1. The endpoint of our study was defined as the durable clinical benefit (DCB: PFS ≥ 6 months) or the non-DCB (NDCB: PFS < 6 months) group. PFS was defined as the time between the initiation of ICI to the progression of lung cancer or the death of the patient. The progression status was retrieved from the medical records and approved by a senior pulmonologist (M.C.) and a senior radiologist (X.S.) according to the response evaluation criteria in solid tumors (RECIST) version 1.1. Detailed methods for quantifying PD-L1 expression are illustrated in Supplementary Note 1.

Fig. 1
figure 1

Flow chart shows inclusion/exclusion and data split for the single-center cohort. Abbreviations: PD-1, programmed cell death protein-1; PD-L1, programmed cell death protein ligand-1; RECIST, response evaluation criteria in solid tumors

Table 1 The definitions and scoring rules for clinical characteristics

Image acquisition and lesion delineation

Contrast-enhanced chest CT scans were carried out from the lung apex to the lung base using various sets of CT scanners as shown in Table 2. The contrast-enhanced scans were acquired at 35 s after the injection of 80–100 mL of nonionic contrast material (Ultravist 300, Bayer Schering Pharma AG; or Ioversol 320, Hengrui Pharmaceutical Co., Ltd) intravenously at a rate of 2.5 mL/s.

Table 2 Parameters of CT scanners

The anonymized original Digital Imaging and Communications in Medicine (DICOM) images were normalized and standardized before being imported into the Dr. Wise research platform, on which lesions were automatically delineated using algorithms described in the previous literature [20]. The results were confirmed and modified on axial images slice by slice with mediastinal (width: 450 HU, level: 50 HU) and lung (width: 1200 HU, level: −600 HU) window settings by a senior thoracic radiologist (L.S.) without knowledge of response status. The volume of interest (VOI) was drawn based on the criteria described in Supplementary Note 2 and Supplementary Fig. 1. A maximum of five lesions were retained for each patient based on their longest diameters. For the purpose of reliability analysis, a randomly selected 30 cases were drawn by a second senior radiologist (W.S.).

Feature extraction and aggregation methods

A total of 1454 radiomic features were extracted for each VOI using the PyRadiomics (version 3.0.1) package in Python (version 3.8). More details can be found in Supplementary Note 3.

All features were normalized before being aggregated to form the patient-level radiomic features in the following three ways:

  1. a.

    Largest lesion approach (LL): radiomic feature of the largest (3D diameter) lesion.

  2. b.

    Average-multiple-lesion approach (ML): average of summed radiomic features of up to five lesions.

  3. c.

    Weighted-multiple-lesion approach (WL): sum of adaptively weighted radiomic feature of up to five lesions. The weighted coefficients were determined by an a-MIL model that was developed in the training cohort. The weight coefficients for lesions in the test cohort were generated by the attention-based algorithm embedded in the a-MIL model. A detailed illustration of this technique is shown in Supplementary Note 4 and the codes in Python were available on GitHub at https://github.com/zhjtwx/immunity_WL for reproducibility purposes.

Feature selection and model development

The feature selection and model development were conducted in the training cohort. The patient-level radiomic features underwent three preliminary procedures in a sequential order as shown in Fig. 2d. Firstly, the features that were robust to changes in variations in contour delineation (intraclass correlation coefficient (ICC) of inter-observer variability > 0.8) were retained. Secondly, features that were significantly related to the clinical outcome (p value < 0.05 in the Mann–Whitney U test analysis) were remained. Lastly, Pearson’s correlation coefficient (PCC) of each of the two features was calculated and for the correlated pair (PCC ≥ 0.85), the one with the lower p value was remained. Both the radiomic and clinical features were then fed to the L2-regularized ridge-embedded logistic regression (ridge regression) to select the representative features (coefficient > 0.001) that were associated with the clinical outcome. Unsupervised hierarchical clustering was performed on these representative radiomic features using heatmaps to comprehend their structure (Fig. 2e).

Fig. 2
figure 2

Workflow of model construction. a Multiple lesions were delineated on the Deep-wise labeling system online. b Radiomic features were extracted from multiple lesions. c The three patient-level analyses were performed with (1) radiomic feature from the largest lesion (LL), (2) average radiomic features from multiple lesions per patient (ML), and (3) weighted sum of radiomic features from multiple lesions per patient (WL). See the “Materials and methods” section for details. d The robust and uncorrelated radiomic features were then selected through three preliminary steps. The L2-ridge regression was the last feature selection step to reduce redundancy. e Representative features were displayed using unsupervised hierarchical clustering. f The model was cross-validated in the training cohort and then validated in the independent test cohort. g Receiver operating characteristic (ROC) curves were drawn. *: denotes the p value of radiomic features between the durable clinical benefit (DCB) and the non-DCB group in the training cohort examined by the Mann–Whitney U test. Abbreviations: ICC, intraclass correlation coefficient; PCC, Pearson’s correlation coefficient

The radiomic model was then built based on the selected radiomic features using five machine learning classifiers: logistic regression (LR), support vector machines (SVM), extreme gradient boosting (Xgboost), multilayer perception (MLP), and linear discriminant (LD). The fivefold cross-validation technique was applied and the average performance in the 4 sub-datasets and 1 sub-dataset from five iterations was reported as the training and validation performance respectively. Similar approaches were used to build the integrated model with the additional selected clinical features. All models were validated in the test cohort. The entire workflow is depicted in Fig. 2.

Statistical analysis

Differences in all variables between the DCB and the NDCB were assessed using the Mann–Whitney U test for continuous variables and the chi-square test or Fisher’s exact test for categorical variables as appropriate. The reliability of segmentation was analyzed using the Dice similarity coefficient, and for radiomic features, the ICC and Bland-Altman plots were used. The Gradient weighted Class Activation Mapping was used to visualize the representative radiomic features.

The diagnostic performance was evaluated by classification sensitivity, specificity, accuracy, F1 score, positive predictive value, negative predictive value, and area under the curve (AUC). A two-sided 95% confidence interval for AUC was constructed following the approach of Hanley and McNeil [21]. Performance among different models was compared with the Delong test. Calibration curves and decision curve analysis were performed to evaluate the predictive accuracy and clinical utility of the models.

The Kaplan–Meier (K-M) survival curve method and Cox proportional hazards model were used to analyze PFS. To generate a binary classification, the cutoff thresholds for the prediction probabilities generated by all models were established using the maximum Youden index in the training cohort. Different curves were compared using the log-rank test. A subgroup analysis was performed to evaluate the model’s stratification ability in patients using pembrolizumab. A two-sided p value < 0.05 was used to indicate statistical significance throughout the study. All statistical analyses were performed with the R statistical package.

Results

Patient cohort

Patients with DCB account for 62.9% and 64.9% of the patients in the training and test cohort respectively. The median PFS in the entire cohort was 7.7 months (training: 7.6 months; test: 8.4 months). The data for the PD-L1 expression level were available in 89 (38% of the entire cohort) patients. The clinical and demographic characteristics of the patients in our analysis are summarized in Tables 3 and 4. There were no significant differences in the demographic and clinical characteristics between the two cohorts (p > 0.05). Early line therapy, KRAS genetic mutation, and the combination of chemotherapy were significantly associated with DCB, while the presence of bone metastasis before immunotherapy was associated with NDCB. For chemotherapy agents, compared with single-drug, the dual-drug regimen was significantly associated with better clinical outcomes in the test cohort (Supplementary Table 1). Nevertheless, no single chemotherapy regimen showed remarkable superiority over another. The elevated expression of PD-L1 was associated with epidermal growth factor receptor (EGFR)-wild type and Kirsten rat sarcoma 2 viral oncogene homologue (KRAS) mutation (Supplementary Table 2).

Table 3 Demographic characteristics and the therapy regimen of patients in the analysis
Table 4 Immunopathologic features and metastasis statuses of tumor before the initiation of immunotherapy

Representative features

There was high agreement between the segmentations drawn by two radiologists (Dice coefficient of 0.89 [95%CI 0.87–0.91] for the largest lesion and 0.90 [95%CI 0.87–0.93] for multiple lesions). Five representative clinical features were identified: age (≤ 60 or > 60), clinical stage (III or IV), bone metastasis, line of therapy (first, second, or third+), and the use of pembrolizumab. Nineteen, twenty-one, and twenty-five radiomic features were selected individually using the LL, ML, and WL approaches. The number of features that remained at every selection step is shown in Supplementary Fig. 2. ICCs and the Bland-Altman plots showed excellent robustness of the selected features (Supplementary Table 3 and Supplementary Fig. 3). The unsupervised clustering analysis of all representative features resulted in three clusters. Features showed differential expression between the DCB and NDCB cases in both cohorts (Supplementary Fig. 4). A complete list of the representative features and their coefficients in the integrated models with different feature construction methods is shown in Supplementary Table 4. In the WL-based integrated model, the most contributable clinical feature was the line of therapy, and for radiomic features, GLCM_Correlation and GLDM_SDHGLE showed the largest coefficient in the negative and positive directions respectively (Supplementary Fig. 5).

Comparison of model performance

Among the three aggregation methods, the AUCs of the WL-based models were superior to those of the other two approaches in all classifiers but MLP, in which the performance of the WL-based radiomic model was not significantly better than that of the LL-based radiomic model in the test cohort (Fig. 3).

Fig. 3
figure 3

Comparison of the area under the curves (AUCs) of different predicative models based on three patient-level analyses. a, b The AUCs of the radiomic models for discerning DCB (PFS ≥ 6 months) from NDCB (PFS < 6 months) in the cross-validation (a) and the test cohort (b). c, d The AUCs of the integrated models discerning DCB from NDCB in the cross-validation (c) and the test cohort (d). The Y axis represents AUCs and the X axis represents different classifiers. The bar in green denotes the model’s performance based on WL-based radiomic features. P values were obtained by comparing the AUC of the integrated model with the AUCs of the other two models [LL-(blue) and ML-(orange) based radiomic models] using the Delong test. Note: * denotes p value < 0.05, ** denotes p < 0.01, *** denotes p value < 0.001. Abbreviations: LL, largest-lesion approach; ML, average-multiple-lesion approach; WL, weighted-multiple-lesion approach; DCB, durable clinical benefit; NDCB, non-durable clinical benefit; LR, logistic regression; SVM, support vector machines; Xgboost, extreme gradient boosting; MLP, multilayer perception; LD, linear discriminant

The WL-based radiomic model with logistic regression classifier yielded the most balanced performance to discern DCB from NDCB with AUCs of 0.87 [0.84–0.89], 0.75 [0.68–0.82], 0.80 [0.68–0.92] in the training, validation, and test cohort respectively (Supplementary Table 5). With the addition of five clinical characteristics, the WL-based integrated model reached a significantly better AUC than the radiomic model and the clinical model, as presented in Table 5 and Fig. 4. The calibration and decision curve analysis curves for the above models are shown in Supplementary Fig. 6 and 7. The performance of the a-MIL model for differentiating DCB from NDCB was given in Supplementary Note 5.

Table 5 The performance of the logistic regression-based integrated model was compared with the clinical and the radiomic model in each of three feature construction approaches
Fig. 4
figure 4

Displays of the receiver operating characteristic (ROC) curves of clinical, radiomic, and integrated models. a–c ROC curves of clinical (red), radiomic (blue) and integrated (green) logistic regression model built with the LL-based radiomic features (a), ML-based radiomic features (b) and WL-based radiomic features (c) for differentiating DCB from NDCB in the cross-validation cohort. d–f ROC curves of clinical (red), radiomic (blue) and integrated (green) logistic regression model built with the LL-based radiomic features (d), ML-based radiomic features (e), and WL-based radiomic features (f) for differentiating DCB from NDCB in the test cohort

Figure 5 illustrates the discriminability of the log GLDM (SDHGLE) feature and the wavelet GLCM (correlation) feature for representative DCB and NDCB patients before ICI therapy. We observed a higher textural heterogeneity pattern on lesions of the DCB patient compared with the NDCB patient. In addition, heavier weights were attributed to the smaller lesion in most circumstances.

Fig. 5
figure 5

Baseline chest contrast-enhanced CT (CECT) images and visualizing heatmaps of class activation in an image of two lung cancer patients presented with durable clinical benefit (DCB) and non-DCB (NDCB). a This DCB case used penpulimab (PFS = 19.1 months). b This NDCB case used sintilimab (PFS = 5 months). L1 and L2 each represents an individual tumor lesion identified on CECT scans. Both lesions shrank significantly at follow-up 6 weeks after the first course of therapy. (i, iv) represent two original tumor lesions in the mediastinal window of CECT. (ii, v) represent GLDM_SmallDependenceHighGrayLevelEmphasis (SDHGLE) feature heatmaps with corresponding tumor lesions of CECT. (iii, vi) represent GLCM_correlation feature heatmaps with corresponding tumor lesions of CECT. (vii) shows the attributed weight coefficients of corresponding radiomic features of L1 and L2 in the LL (largest lesion), ML (average-multiple-lesion) and WL (weighted-multiple-lesion) approaches, respectively. Note: GLDM-SDHGLE measures the joint distribution of small dependence with higher gray-level values, and a greater value indicates a smaller dependence of higher gray-level values and less homogeneous textures; GLCM-Correlation measures the linear dependency of gray-level of neighbouring pixels, and a higher value indicates a less smooth gradient of the pattern in the image

Stratified pretreatment PD-L1 expression as a predictor of durable PFS

As illustrated in Supplementary Table 6, the positivity rate for PD-L1 expression was 66% (59 out of 89) if the cut-off was 1%, with an accuracy of 61.8% (55 of 89) and an AUC of 0.57 (95% CI: 0.44–0.70) in differentiating DCB from NDCB. If the cut-off was set at 50%, the positivity rate reached 26% (23 out of 89), with an accuracy of 50.6% (45 of 89) and an AUC of 0.61 (95% CI: 0.49–0.73) in differentiating DCB from NDCB. More than 46% (41 of 89) of patients with low expression of PD-L1 (tumor proportion score < 50%) experienced DCB.

Predictive ability for PFS of different predictive models

As illustrated in Fig. 6 and Supplementary Table 7, the integrated model showed better performance for predicting PFS than the other two models (HR = 2.90 [95% CI: 2.15–3.84], p = 0.014 in the test cohort). In the pembrolizumab subgroup analysis, a higher score stratified by the radiomic and integrated models was significantly associated with a longer PFS (Supplementary Fig. 8). The results of Cox regression and K-M analysis for the pembrolizumab subgroup are displayed in Supplementary Table 8.

Fig. 6
figure 6

Kaplan-Meier (KM) progression-free survival (PFS) curve analyses. a–c KM curves on the cross-validation cohort for model scores generated by (a) logistic regression (LR)-based clinical model, (b) LR-based radiomic model and (c) LR-based integrated model. d–f KM curves on the test cohort for scores generated by (d) LR-based clinical model, (e) LR-based radiomic model, and (f) LR-based integrated model. All radiomic and the integrated models displayed here were built with the weighted-multiple-lesion (WL)-based patient-level radiomic features. The cutoff threshold of the clinical, radiomic, and integrated model for the PFS risk stratification is 0.51, 0.43, and 0.67 respectively

Discussion

In this study, we collected a relatively large cohort of advanced lung cancer patients and constructed models to identify patients who were more likely to obtain durable clinical benefits using PD-1/PD-L1 targeted therapies. In the meantime, we explored the method of weighting the sum of radiomic features from multiple intrapulmonary lesions to construct the predictive models and found that it exhibited superior performance to discriminate DCB from NDCB compared with the conventional approaches. Furthermore, an integrated predictive model was constructed using the WL-based radiomic features and five clinical features, reaching AUCs of 0.86 in both the cross-validation dataset and test cohorts.

Considering the presence of immunotherapy-specific unconventional response patterns, the patient-level radiomic analysis that incorporates features from multiple lesions is gaining more attention [12, 14, 22]. MIL is a useful tool to aggregate features from multiple imaging patches (instances) that represent one bag-level characteristic [23]. Back in 2020, Zhang et al adopted an MIL-based supporter vector machine to identify the survival-related high-risk subregions in magnetic resonance imaging (MRI) scans for glioblastoma [16]. More recently, Li et.al. proposed an attention-based MIL framework to compute weights for each segmented patch in an abundance of chest CT images and identified regions that were most correlated with the assessment of COVID-19 severity [17]. Another study in histopathology by Lu et al aggregated patch-level features into slide-level representations and assigned scores to each patch to represent the significance to the collective slide-level representations for a specific classification (e.g., clear cell renal carcinoma) [18]. Here, we treated each delineated lesion as an instance and adaptively weighed them to represent the patient-level clinical outcome. As illustrated in Fig. 5, the difference in radiomic features between the DCB and NDCB group is distinctive in the smaller lesion, to which a higher weight was attributed. It demonstrates the strength and validity of our method.

The later line of immunotherapy was associated with poor prognosis. A similar finding was reported in Tunali et al’s study, in which they argued that multiple systematic treatments induced an “immune-desert” microenvironment that compromised the efficacy of immunotherapy [10, 24]. Currently, using pembrolizumab or atezolizumab as first-line therapy for metastatic lung cancers with high expression of PD-L1 has gained increasing acknowledgment [2, 25]. However, our study recognized that a significant proportion of patients who had low expression of PD-L1 (< 50%) could still reach DCB from PD-1/PD-L1 targeted immunotherapies. It highlights the need to identify a more precise predictive biomarker. Given our predictive model only requires pre-treatment CT images and basic clinical information, it can serve as an alternative and noninvasive biomarker to direct personalized therapeutic immunotherapy regimen, especially for those with unknown PD-L1 statuses.

GLDM_SDHGLE and GLCM_Correlation are the two textural features from our integrated model that had the largest coefficients in the positive and negative directions respectively. By visualizing them on the heatmaps, we identified a ring structure surrounding the tumor margin. Previous literature suggested that peritumoral texture features are associated with tumor infiltrating lymphocytes that can predict tumor response to immunotherapy [26,27,28]. The identified marginal characteristics in our study, though not being biologically validated, may be correlated with the recruitment of active lymphocytes.

The positive association between KRAS-mutated status and immunotherapy’s efficacy was identified in our study and in previous literature as well. Chen et al argued that KRAS-mutation induced an inflammatory tumor microenvironment that may result in the elevation of tumor burden [29, 30]. Other studies indicate that this environment also triggers the elevated expression of PD-L1 although no agreement has been made yet [31, 32]. Our finding suggests the prognostic value of KRAS mutation in PD-1/PD-L1 targeted therapies in lung cancer. Nonetheless, further studies in a larger KRAS-mutated cohort are needed to further warrant this statement.

We acknowledge the limitations of our study. The first is the retrospective nature of this single-center study. Although an internal independent test cohort was adopted, further external validation in a prospective cohort is warranted. Second, we used a relatively small sample size to train a deep learning–based model, although the precise annotations and radiomic features were used to reduce the network complexity. Third, the peritumoral region was not included in our radiomic analysis, which may result in the loss of useful information related to the distribution of tumor-infiltrated lymphocytes around the tumor. Fourth, PD-L1 expression data were unavailable for most patients in our cohort. Combing it with our radiomic signature may enhance the predictive performance of the models. Fifth, we aggregated the radiomic features to give more weight to the most relevant lesion but did not capture inter-lesion heterogeneity and differential patterns of response in patient. Lastly, there is a deficiency of biological validation due to the retrospective nature of our data. Further studies are warranted to help explain the biological significance of the radiomic biomarker.

Conclusion

Our noninvasive predictive model based on the weighted sum of radiomic features from multiple intrapulmonary lesions holds considerable promise as a new approach to bring substantial survival benefits to lung cancer patients who are candidates for immunotherapy.