Introduction

Lung cancer is a fatal disease and second-most common cancer worldwide [1]. As per Global Cancer Statistics 2020 (GLOBOCAN 2020), lung cancer is the most common cause of cancer-related death worldwide [2]. Lung cancer alone accounts for 2,093,876 (11.6%) new cases every year and 1,761,007 (18.4%) deaths every year worldwide [2]. Non-small-cell lung cancer (NSCLC) accounts for 85% of lung cancer globally [3]. The prognosis of the disease and survival of the patients grossly depend on the stage of the disease upon diagnosis. Staging of the disease is performed based on the tumor (T), node (N), and metastasis (M) stage (TNM stage) of the disease [4]. TNM staging is often a complex system and depends on imaging, pathology, and clinical assessment. As a consequence, it is sometimes difficult to identify the disease stage very accurately resulting in poor outcomes of treatment.

With the advent of genomic biology and other technical developments, identification of disease sub-group has become more accurate, and survival has improved significantly. For example, gene sequencing by polymerase chain reaction (PCR) assays is a widely used method for the identification of epidermal growth factor receptor (EGFR) mutations in NSCLC patients [3, 5,6,7,8,9]. The diagnosis of the presence of EGFR mutation in NSCLC patients opens the option of targeted therapy using tyrosine kinase inhibitors (TKIs) that improves the overall survival in patients with EGFR mutation [9]. Several driver gene mutations like EGFR, BRAF, KRAS, MET, ALK, and ROS1 were also identified and may be druggable targets. In NSCLC patients, about 32% worldwide and 38% of Asians have EGFR mutations [3,4,5,6,7,8,9,10]. The overall prevalence of EGFR mutation is higher in females (female vs. male: 43.7% vs. 24.0%) [3]. The prevalence of EGFR mutation is also higher in non-smokers (non-smokers vs. past or current smokers: 49.3% vs. 21.5%) [3]. Many other mutations (EGFR, ALK, ROS1, BRAF, NTRK, MET, and RET) in NSCLC patients have resulted in several subgroups. These patients are treated with targeted therapy and personalized treatment [4, 5]. However, these biomarker-guided targeted therapies have improved the survival significantly, but occasionally these treatments fail. In given circumstances, patient selection for these expensive targeted therapies becomes crucial, and radionics-based prediction models may be helpful as shown in various retrospective studies [10].

Radiomics is a new workflow that extracts high-throughput data from medical images called radiomic features. Radiomic features may show a very high correlation with the treatment outcome. Many publications on radiomics have demonstrated the role of radiomic features in the diagnosis and prognosis of the disease in many cancer types [11,12,13,14,15,16,17]. Many researchers have demonstrated the role of radiomics in prediction model development and treatment outcome prediction [10]. Aerts, He et al. in their study have demonstrated the potential of radiomic features in the prediction of the overall survival in NSCLC patients [18]. A study by He et al. showed the use of radiomic features in the prediction of progression-free survival in lung cancer [19]. In a similar study, Tunali et al. developed a prediction model to predict local recurrence [20]. Nevertheless, several studies have shown the importance of radiomic signatures in the prediction of various clinical endpoints, and many studies have also raised concerns about the stability of radiomic features [20,21,22]. The stability of radiomic features is often assessed by measuring similarities in feature values in repeatability (test–retest) and reproducibility studies. It is of utmost importance generalizing the radiomic-based prediction model across the clinic and worldwide. The stability of radiomic features depends on various factors like differences in imaging equipment, imaging parameters or protocols, image reconstruction algorithms, tumor delineation, and pre-processing steps of radiomic feature extraction. The instability of radiomic features has been identified to be a key issue with the generalization of the radiomic-based prediction model [20,21,22]. Several studies have been performed to identify robust radiomic features among the many features extracted from medical images [20,21,22]. In our earlier repeatability and reproducibility study, we have identified robust radiomic signature on phantom and clinical cohort [21]. In this study, we aim to develop and validate those robust radiomic signatures for the overall survival prediction in non-small cell lung cancer patients.

Material and Method

The study was approved by the Institution Ethics Committee (IEC) (IEC-2) of our hospital as a retrospective study. A consent form waiver is provided by the same IEC as an institutional policy. All the data of the patients were kept confidential.

Patients

TMH Dataset

Two hundred patients of non-small cell lung carcinoma (NSCLC) who underwent treatment with a combination of surgery, chemotherapy, and radiotherapy in our hospital from January 2012 to January 2017 were included in this study. The pre-treatment CT or PET/CT scans of these patients was extracted from the hospital PACS and was included. Similarly, clinical data were extracted from the hospital information system (HIS). Patients’ demographic data are shown in Table 1.

Table 1 Demographic data of patient population used in this study

External Validation Set

The Cancer Image Archive (TCIA) open-source data: 100 NSCLC patients with CT images and RT structures (GTV-1) and survival data of NSCLC-radiomics collection were downloaded from the TCIA portal [18, 23]. The CT scans and GTV were used to extract radiomic features.

Pre-Processing of Data

Clinical data extracted from the HIS were cleaned and converted into a form amenable to machine learning. CT or PET/CT scans were checked for completeness, and contrast-enhanced CT series of PET/CT or CT studies were selected for this study.

Based on median overall survival (OS) in both the datasets, 2-year OS was selected as a clinical endpoint (Table 1). For both datasets, OS were binarized based on 2-year OS [(OS < 2 years) = 1 and (OS > 2 years) = 0].

PET/CT Imaging Procedure

TMH Dataset

Pre-treatment PET/CT scans were performed using Gemini TF16 or Gemini TF64 PET/CT scanners (Philips Medical Systems, Netherlands). The CT of PET/CT scans were performed after the injection of 60 to 80 ml of non-ionic contrast using the protocol mentioned in Supplementary Table s1. CT images were reconstructed using the filtered back project (FBP) reconstruction algorithm.

TCIA External Validation Set

Pre-treatment CT scans were performed using a Gemini CT scanner (Philips Medical Systems, Netherlands). The CT scans were performed after the intravenous injection of 80 ml of non-ionic contrast using the protocol was mentioned in Supplementary Table s1. CT images were reconstructed using the filtered back projection (FBP) reconstruction algorithm.

From both cohorts, CT data were extracted in Digital Imaging and Communications in Medicine (DIOCM) format for radiomic extraction.

Radiomic Extraction

Internal Dataset

The CT series of PET/CT scans were loaded on Intellispace Discovery Portal (research-only build; Philips Medical System, Eindhoven, The Netherlands) and primary tumor delineation was performed using 3D contouring software by the experienced (more than 15 years) medical physicists and saved as radiotherapy structure (DICOM series: RT structure) by the name of gross tumor volume (GTV). The GTVs were checked and approved by experienced (more than 20 years) nuclear medicine physicians and radiologists. Subsequently, the DICOM images and GTV were transferred to the research computer for radiomic extraction. On a research PC, radiomic features were extracted using in-house developed PyRadGUI software using a combination of Plastimatch [24] and Pyradiomics software [25]. The following pre-processing steps were performed using PyRadGUI software. Image conversion: DICOM images and RT structures were converted into NRRD format using the Plastimatch package. Resampling: Images were resampled using a 2 × 2 × 2 mm cube isotropic voxel. Filtering and transformation of image: Three sets of images were generated applying Laplacian of Gaussian (LoG) filters with sigma values of 1, 2, and 3 mm. We also generated eight sets of wavelet-transformed images using eight combinations of high-pass and low-pass wavelet filters [25]. Finally, a total of 1093 radiomic features were extracted from the 12 imaging sets (1 set of original images, 3 sets of LoG images, and 8 sets of wavelet images) and corresponding GTVs [25].

External Validation Set

The TCIA dataset contains CT Image and RT structure (GTV) in DICOM format. We performed the same operation as described in the earlier section, and 1093 radiomic features were extracted for every patient’s data.

Data Balancing

Usually, it is assumed that balanced endpoints are more appropriate to train most of the machine learning algorithms for prediction model development [26]. The majority of the time clinical endpoints have imbalanced ratios, which do not meet the assumptions of balanced endpoints and require data balancing. Data balancing was performed using synthetic minority oversampling technique (SMOTE).

Prediction Algorithm Used

Several radiomic studies have shown that random forest (RFC), support vector (SVC), and gradient boosting classifier (GBC) algorithms are the most efficient classification algorithms for treatment response and outcome events prediction in radiomics based analysis in several types of cancer (28–30). Hence, in this study, we have used RFC, SVC, and GBC for the overall survival prediction. Additionally, we also developed deep learning (DL) multilayer perceptron model.

Radiomic Feature Selection

We opted for a two-step process to select the best radiomic features for OS prediction out of 1093 radiomic features extracted from CT images. We selected 121 stable radiomic features based on our earlier radiomic stability study [21]. Subsequently, the top 50 features were selected using the Chi-squared test. Finally, the top 10 features were selected by applying recursive feature elimination (RFE) methods using random forest (RFE-RF). Python 3.9.0 software is used for the feature selection process.

Prediction Model Development and Validation

The prediction models were developed using random forest (RF), support vector (SV), and gradient boosting (GB) algorithms in Python 3.9.0 software. Hyperparameters of these prediction algorithms were tuned using nested cross-validation, and the same parameters were used to develop all the prediction models. Subsequently, 10-fold cross-validation was performed to access the model performance on the internal dataset. In the next step, a train-test split (80:20) was performed for model development and validation. Three prediction algorithms were used to develop a total of six prediction models utilizing original and balanced training sets.. RF models (RF-Model-O: on the original training data and RF-Model-B: on the balanced training data); SV models (SV-Model-O: on the original training data and SV-Model-B: on the balanced training data), and GB models (GB- Model-O: on the original training data and GB-Model-B: on the balanced training data) were developed on the internal dataset and validated on the test dataset. Subsequently, these models were also validated using the bootstrap (1000 iterations) method on the test dataset and on the external validation cohort.

Two deep learning models (simple-DL: 7-layer perceptron model without dropout layer and dropout-DL: 7-layer perceptron model with dropout layer) were also developed using an internal train-test dataset. Both the DL models were validated using the internal test dataset and an external dataset.

Using random forest, support vector, and gradient boosting algorithm, three models, i.e., RF-MODEL-V, SV-MODEL-V, and GB_MODELS-V, were also developed for predicting 2-year overall survival with tumor volume as a single feature.

Statistical Tests

For all the statistical tests, different packages of Python 3.9.0 open-source software were used. Descriptive statistical tests were performed to understand the distribution of patients in various categories. The demographic data of the internal and external cohorts were compared using t-test. Hierarchical clustering using Pearson’s correlation test and z-score and Chi-squared tests was performed for feature reduction. Recursive feature elimination using a random forest algorithm was performed to select the most significant features for model development. The features from both cohorts were compared using a t-test and violin plot. Receiver operating characteristics area under the curve (AUC), accuracy, precision, recall, and f1-score were calculated for all prediction models on internal and external validation datasets.

Results

The descriptive statistics of demographic data and comparison for both cohorts are shown in Table 1. The heatmap with hierarchical clustering and z-score heatmap of 121 stable radiomic features shows several feature clusters (Fig. 1). Subsequently, based on hierarchical clustering and the multivariate Chi-squared test, top 50 significant features were selected. Finally, the RFE technique was applied using the random forest algorithm, and the 10 most significant radiomic features were selected for model development. The significance of 10 selected features on the internal dataset using the Chi-squared test and the comparison of the distribution feature values on internal and external datasets are shown in the bar chart (Fig. 2A) and violin plot (Fig. 2B), respectively. The comparison of ten significant radiomic features between internal and external cohorts is shown in Table 2. The violin plot and t-test show a similarity in feature distribution for the majority of selected features in internal and external datasets except for a few (Fig. 2).

Fig. 1
figure 1

Heatmap of 121 radiomic features based on A Pearson’s correlation test and B z-score. Hierarchical clustering shows the clusters of radiomic features

Fig. 2
figure 2

The feature significance of the ten most important features on the Chi-squared test (A), the distribution of min–max scaled feature values of the top 10 significant features for the TMH cohort and the validation cohort (B)

Table 2 The results of the unpaired t-test showing the relation between the features of the two datasets
Fig. 3
figure 3

The ROC-AUC curve in 10-fold cross-validation, internal validation, and external validation. AUC curves of the random forest model, gradient boosting model, and support vector model are shown in the first, second, and last rows, respectively

The 10-fold cross-validation on the institutional (TMH) dataset showed a good prediction accuracy and AUC of 0.73 ± 0.08 and 0.77 ± 0.08 for RF-Model-O, 0.69 ± 0.12 and 0.76 ± 0.09 for SV-Model-O, and 0.73 ± 0.00 and 0.79 ± 0.08 for GB-Model-O, respectively. The accuracy of models in train-test internal validation was between 0.76 (for SV-Model-O) to 0.80 (RF-Model-O/GB-Model-O) and AUC 0.81 (RF-Model-O/GB-Model-O) to 0.82 (SV-Model-O) with the original training set (Fig. 3). The training and test prediction scores for all the models developed on the original dataset were found to be the same (Supplementary Table s2) The comparison of prediction models developed using the original and balanced training set was found to be comparable (Table 3). The accuracy of the external validation cohort was found to be between 0.57 (SV-Model-O) and 0.68 (RF-Model-O/GB-Model-O) and AUC 0.61(SV-Model-O) to 0.72 (RF-Model-O/GB-Model-O) (Fig. 3). The test and external validation prediction scores were found to be comparable to that of the bootstrap validation respectively. The internal and external validation results for all the models are shown in Table 4. The post-calibration model accuracy in internal validation was also found to be the same. The ROC curve of all the models for cross-validation, internal validation, and external validation and bootstrap validation is shown in Fig. 3. The test ROC curve and confusion matrix for prediction models developed on original and balanced training sets are shown in Supplementary Figs. s1s3. The test prediction scores for prediction models developed on original and balanced training sets are shown in Table 3. The detailed prediction scores of prediction models in internal and external validation are shown in Table 4. The confusion matrix of internal and external validation is shown in Supplementary Fig. 4. The calibration plots of all three models are shown in Fig. 4. The deep learning models (simple-DL and dropout-DL) also performed well with accuracy = 0.76 and AUC = 0.72, respectively. However, these models failed in external validation with an accuracy of 0.55 for both models. The detailed model performance scores are shown in Table 4.

Table 3 The comparison of the prediction model developed on the original and balance dataset
Table 4 The performance of prediction models in internal and external validation
Fig. 4
figure 4

The calibration plot and Brier scores of the prediction models with and without calibration in the internal validation set. The calibration plot and Brier score of random forest model, gradient boosting model, and support vector model in the first, second, and last row, respectively

The accuracy of the tumor volume-based models, RF-Model-V, SV-Model-V, and GB-Model-V, in the internal validation set was found to be 0.57, 0.50, and 0.51, respectively. The details of the internal validation scores of these volume-based models are shown in the supplementary material (Supplementary Table s3 and Fig. s5).

Discussion

In the last few years, radiomics has been a major area of research in oncology to develop digital phenotypes for various cancers [10,11,12,13,14,15,16,17,18,19]. Several radiomics-based prediction models have been developed, validated, and reported in the literature to predict various prediction endpoints in various cancer types. The role of radiomic features in the prediction of various clinical endpoints in lung cancer has been investigated and reported widely in the last few years [10,11,12,13,14,15,16,17,18,19, 27, 28, 30, 32,33,34,35]. But the generalization of these radiomic models has raised concern in the radiomics community. The high number of radiomic features extracted from the medical images of tumors leads to a data explosion. This data explosion raises several concerns like issues related to feature repeatability and reproducibility, feature redundancy, feature insignificance, and multidimensionality which is the biggest challenge facing the radiomic community. To overcome these challenges, the model development needs to undergo several steps, i.e., identification of stable features, removal of redundant features, selection of most significant features, and finally selection of the best prediction algorithm [29]. In this study, we have implemented a multistep feature selection process to identify the most suitable features for the overall survival prediction in NSCLC patients. Based on our earlier study on the repeatability and reproducibility of CT radiomic features, we selected 121 robust radiomic features out of 1093 extracted features [21] for this study to minimize the drawback related to the reproducibility of radiomic features. To reduce the redundancy, we used Chi-squared and hierarchical clustering using Pearson’s correlation and z-score analysis to identify and remove redundant features. Subsequently, we used a random forest algorithm-based recursive feature elimination (RFE) technique to identify the most suitable radiomic features to predict 2-year overall survival. In our investigation, these radiomic features were able to risk stratify patients into two groups and predict the overall survival in this cohort of patients. These radiomic features show a strong correlation with 2-year overall survival. We used the three most common machine learning algorithms, i.e., RFC, SVC, and GBC and the deep learning multilayer perceptron model to develop a prediction model. The average accuracy of RFC and GBC is similar with an accuracy of around 0.80 and has better accuracy than the SVC prediction model in internal validation. The training and test prediction scores for all the models were found to be comparable (Supplementary Table s2). Comparable training and test prediction scores indicate the reduced probability of model overfitting. Several studies in the past have reported similar findings and our results affirm these findings [26, 27]. Similarly, the fact that GBC and RFC models performed equally well on 10-fold cross-validation and bootstrap validation suggests a robust prediction model. In external validation, the accuracy of GBC and RFC models was also found to be comparable and equally good (accuracy > 0.70). The calibration plot of GBC and RFC also shows similarity in the internal and external validation with a Brier score of around 0.15. The slight reduction in prediction accuracy matrices may be attributed to differences in the two cohorts as shown in Table 1. Some features also have different distributions in two cohorts as shown in the t-test and violin plot. However, bootstrap validation on internal tests and external cohorts shows the stability of radiomic-based prediction model in NSCLC. The deep learning model also showed good accuracy around 0.72 in internal validation but failed miserably in external validation. Both the deep learning models performed equally well in internal validation, whereas these models failed miserably in external validation suggesting the overfitting of the models.

As our development and internal validation cohort include NSCLC patients of stages I–IV, it establishes the fact that this model can predict the event rate across the disease stage. The external validation cohort had several dissimilarities from that of the development cohort; i.e., it consists of stages I–III, the difference in the median age was around 13 years, the difference in median overall survival, and the difference in treatment offered. Nevertheless, the prediction model performed well, and this also strengthens the claim of the generalized nature of this prediction model.

There is concern among the radiomic community regarding the feature stability and predictability of the radiomic model in external validation. Several studies have been performed to address the issues related to the stability of radiomic features [21, 22]. In our earlier study, we performed a rigorous analysis on a human cohort and phantom study to identify the most robust radiomic features [21]. In this study, we were successfully able to demonstrate two aspects of a good radiomic feature: (1) the predictive power and (2) the stability of robust radiomics-based prediction model in external validation. The results of the prediction models on internal validation confirm the predictive potential of these robust radiomic features. We have chosen the top 10 radiomic features from 121 robust radiomic features that offered the highest accuracy in RFE using a random forest algorithm. While the success of these prediction models especially random forest and gradient boost models in external validation is encouraging, radiomic-based prediction models may be generalized if feature stability is thoroughly and accurately assessed. Le et al. in their study have shown the ability of radiomic features to predict the overall survival using the cox model. In this study, the authors have also selected ten significant radiomic features similar to our study for model development [35]. A similar study by the same author demonstrated the predictive power of the radiomic-based prediction model in discriminating the patient in high- and low-risk groups as well as the overall survival prediction in multiorgan cancer [36]. With our study, we have assessed and demonstrated the predictive capability of robust radiomic features in the prediction of 2 years of the overall survival in lung cancer, which was also validated on external datasets. However, the radiomic features selected in our study are different from that of Le et al. which may be because of the features included at the beginning of the feature selection step, feature selection techniques used, and prediction algorithms used.

Several studies in the past have suggested the role of decision support systems (DSS) in clinical decision-making [32,33,34,35]. Prediction of the overall survival is one of the important clinical questions in oncology that can be answered by a DSS. Our study can contribute significantly to the development of a DSS for the prediction of the overall survival in lung cancer.

The current study has a number of limitations, including its retrospective nature and limited sample size, as well as a heterogeneous cohort. To address the issue of small sample size, we employed the cross-validation and bootstrap validation approach for model validation. In future research, our goal is to validate this model on a multicentric study retrospectively. Subsequently, this model will be trained on the large retrospective dataset and validated on the prospective dataset from a multicentric study. The ultimate objective of this research is to validate this prediction model in multicentric prospective clinical trials and implementation of the decision support systems in clinics.

Conclusion

Robust radiomic features have shown promising results for the prediction of 2-year overall survival in non-small cell lung cancer. Comparing the SVC model, the RFC and GBC models performed better in internal and external validation. Despite the fact that this is merely an early study on a small development and validation dataset, the results of external validation suggest that the radiomic-based prediction model may eventually be generalized.