Multi-task deep learning-based radiomic nomogram for prognostic prediction in locoregionally advanced nasopharyngeal carcinoma

Purpose Prognostic prediction is crucial to guide individual treatment for locoregionally advanced nasopharyngeal carcinoma (LA-NPC) patients. Recently, multi-task deep learning was explored for joint prognostic prediction and tumor segmentation in various cancers, resulting in promising performance. This study aims to evaluate the clinical value of multi-task deep learning for prognostic prediction in LA-NPC patients. Methods A total of 886 LA-NPC patients acquired from two medical centers were enrolled including clinical data, [18F]FDG PET/CT images, and follow-up of progression-free survival (PFS). We adopted a deep multi-task survival model (DeepMTS) to jointly perform prognostic prediction (DeepMTS-Score) and tumor segmentation from FDG-PET/CT images. The DeepMTS-derived segmentation masks were leveraged to extract handcrafted radiomics features, which were also used for prognostic prediction (AutoRadio-Score). Finally, we developed a multi-task deep learning-based radiomic (MTDLR) nomogram by integrating DeepMTS-Score, AutoRadio-Score, and clinical data. Harrell's concordance indices (C-index) and time-independent receiver operating characteristic (ROC) analysis were used to evaluate the discriminative ability of the proposed MTDLR nomogram. For patient stratification, the PFS rates of high- and low-risk patients were calculated using Kaplan–Meier method and compared with the observed PFS probability. Results Our MTDLR nomogram achieved C-index of 0.818 (95% confidence interval (CI): 0.785–0.851), 0.752 (95% CI: 0.638–0.865), and 0.717 (95% CI: 0.641–0.793) and area under curve (AUC) of 0.859 (95% CI: 0.822–0.895), 0.769 (95% CI: 0.642–0.896), and 0.730 (95% CI: 0.634–0.826) in the training, internal validation, and external validation cohorts, which showed a statistically significant improvement over conventional radiomic nomograms. Our nomogram also divided patients into significantly different high- and low-risk groups. Conclusion Our study demonstrated that MTDLR nomogram can perform reliable and accurate prognostic prediction in LA-NPC patients, and also enabled better patient stratification, which could facilitate personalized treatment planning. Supplementary Information The online version contains supplementary material available at 10.1007/s00259-023-06399-7.


Introduction
Nasopharyngeal carcinoma (NPC) is an epithelial malignancy arising from the nasopharyngeal mucosal lining [1], with high prevalence rates in east and southeast Asia [2].About 70%-80% of NPC patients are categorized as locoregionally advanced NPC (LA-NPC) (Tumor-Node-Metastasis (TNM) stage III or IVa) according to the 8th edition of American Joint Committee on Cancer (AJCC)/Union for International Cancer Control (UICC) staging system [3].The primary therapeutic regimen for NPC is radiation therapy (RT) with or without chemotherapy due to its radiosensitivity [4].However, despite the improvement in treatment, due to locoregional recurrences and distant metastasis, the 5-year survival rates of LA-NPC patients is still a persistent problem, usually ranging from 10 to 40% [5].Under this circumstance, pretreatment prognosis is a major concern for LA-NPC patients, which is conducive to guide the individualized therapeutic regimen.Specifically, based on the pretreatment prognosis, patients could be stratified into different risk groups with different therapeutic regimens applied, and Bingxin Gu, Mingyuan Meng and Mingzhen Xu contributed equally to this work.
this has been reported to potentially improve the patients' overall survival outcomes [6].
TNM staging system is widely used for prognostic prediction and patient stratification [7][8][9].However, despite the fact that patients with the same TNM stage receive the same treatment, large variations in prognosis exists due to the heterogeneous nature of tumor microenvironment [10].Image-derived biomarkers, such as the standardized uptake value (SUV) and metabolic tumor volume (MTV) derived from [ 18 F]-fluorodeoxyglucose ([ 18 F]FDG) positron emission tomography/computed tomography (PET/CT), can provide promising prognostic information for NPC [11,12].Nevertheless, these factors are limited in clinical practice as they are arduous to represent intra-tumor information such as tumor texture, intensity, heterogeneity, and morphology.Therefore, a reliable and accurate prognostic prediction model is needed to predict their progression-free survival (PFS), and to distinguish high-risk from low-risk patients.Such prediction will ultimately facilitate the formulation of therapeutic regimens and improve patients' overall survival outcomes.
Radiomics is a widely recognized computational method for prognostic prediction, which extracts high-dimensional handcrafted features from medical images to characterize intra-tumor information and then models the relevance between the features and prognostic outcomes through statistical methods [13,14].Radiomics has been widely used for prognostic prediction in various cancers including NPC [15][16][17].However, the extraction of radiomics features requires tumor segmentation masks as the guidance, which inevitably brings an additional segmentation step into the radiomics pipeline.In addition, radiomics features are extracted from the segmented regions, which are usually limited to primary and metastatic lesions [5,18].This suggests that the extracted radiomics features may have difficulties in representing the prognostic information outside of malignant lesions (e.g., adjacent tissue invasion).There have been attempts at leveraging lymph node segmentation for radiomics analysis [19][20][21].However, lymph node segmentation is intractable and the adjacent tissue invasion has not been considered yet.This limitation is more critical for LA-NPC patients, as many vital tissues and organs adjacent to the nasopharynx (e.g., brain, ethmoidal sinus, and orbit) might have already been invaded by LA-NPC [22].
Deep learning is an alternative approach to prognostic prediction and is becoming popular in the literature [15,23,24].Deep survival models based on deep learning usually adopt convolutional neural networks (CNNs) to extract image features and then perform end-to-end prediction from medical images, where tumor segmentation masks are often not required [25].Without tumor masks as constraints, deep survival models may potentially leverage the prognostic information existing within the entire images.Deep survival models have demonstrated the potential to outperform conventional radiomics-based prognostic prediction models [26][27][28].However, performing end-to-end prediction without using tumor masks introduces interference from nonrelevant background information and incurs difficulties in extracting tumor-specific information.Recently, multi-task deep survival models were explored to perform prognostic prediction jointly with tumor segmentation [29][30][31], which implicitly guided the model to extract tumor-related information while not discarding out-of-tumor information.However, the value of multi-task deep learning for prognostic prediction in LA-NPC has not been validated with large patient cohorts.In addition, deep survival models are limited by the 'block box' nature [32], which undermines their generalizability in clinical practice.
Nomograms serve as a common tool for guiding individualized treatments as they can simplify complicated prognostic models to numerical estimate of survival probability and provide a clear visual illustration of the factors leading to the prediction [33,34].Zhang et al. [5] developed a multiparametric magnetic resonance imaging (MRI)-based radiomic nomogram, which provides an illustrative example of precision medicine and prognostic prediction.Peng et al. [15] developed a deep learning FDG-PET/CT-based nomogram that may act as an individual chemotherapy (IC) indicator in advanced NPC.Pan et al. [3] developed a radiomic nomogram with better prognostic performance than the 8th edition of AJCC/UICC staging system.Nevertheless, it has been reported with an external validation cohort that Pan et al.'s nomogram underestimated the 5-year overall survival (OS) of LA-NPC patients [35].Therefore, a more reliable and accurate prognostic nomogram is still needed for LA-NPC patients.
In this study, we aim to evaluate the value of multi-task deep learning for prognostic prediction in LA-NPC patients with a large database acquired from two medical centers.We adopted the state-of-the-art deep multi-task survival model (DeepMTS) [29] for joint prognostic prediction and tumor segmentation from pretreatment FDG-PET/CT images, which predicted a survival risk score (DeepMTS-Score) and a tumor segmentation mask for individual LA-NPC patient.The DeepMTS-Score can be directly used for prognostic prediction, while the predicted tumor masks were leveraged for prognostic prediction through radiomics analysis (AutoRadio-Score).We further developed a multi-task deep learning-based radiomic (MTDLR) nomogram by integrating DeepMTS-Score, AutoRadio-Score, and clinical data, so as to improve the accuracy and interpretability of prognostic prediction.Compared with conventional radiomic nomograms, our MTDLR nomogram achieved better prognostic performance and enabled better patient stratification, which demonstrated the potential to facilitate personalized treatment planning.

Patients
Between May 2009 and May 2019, the medical records of 903 NPC patients were collected from Fudan University Shanghai Cancer Center (FUSCC) and Shanghai Proton and Heavy Ion Center (SPHIC).The inclusion criteria are as follows: (1) histologically confirmed LA-NPC (TNM stage III or IVa); (2) received concomitant systemic treatment with intensity modulated radiotherapy (IMRT); (3) underwent pretreatment FDG-PET/CT scans; and (4) available clinical data and FDG-PET/CT images.Patients with previous chemotherapy/radiotherapy or other malignant tumors were excluded.Finally, 652 patients from FUSCC and 234 patients from SPHIC were enrolled in this study.Patients from FUSCC were randomly divided into a training cohort (n = 522) and an internal validation cohort (n = 130) with a 4:1 ratio, while patients from SPHIC (n = 234) were used as an external validation cohort and used merely for evaluation purpose.
After completion of initial treatment, each patient was followed up for every 3 months in the first 2 years, then every 6 months in the third to fifth year, and annually thereafter.The follow-up endpoint of this study is PFS, defined as the time from randomization to the date of disease progression or death from any cause.The median follow-up time is 50 months (ranging from 44 to 120 months) for FUSCC and 49 months (ranging from 44 to 97 months) for SPHIC.FUSCC and SPHIC Ethical Committee approved this retrospective study with informed consent obtained from all enrolled patients.

PET/CT imaging
FDG-PET/CT images were obtained on a Siemens biograph 16HR PET/CT scanner (Knoxville, Tennessee, USA).FDG-PET/CT data acquisition procedure was detailed in Online Resource.
For quantitative analysis, maximum or mean of standardized uptake value (SUV) normalized to body weight and metabolic tumor volume (MTV) were manually computed for tumor lesions by drawing a 3-dimensional volume of interest (VOI).Meanwhile, total lesion glucose (TLG) was calculated according to the formula: TLG = SUV mean × MTV, where the SUV mean and MTV were recorded at the SUV threshold of 2.5.

Multi-task deep learning-based radiomics analysis
The workflow of multi-task deep learning-based radiomics analysis is illustrated in Fig. 1, which presents a three-step pipeline including multi-task deep learning model construction, automatic radiomics analysis, and nomogram construction.We adopted a deep multi-task survival model (DeepMTS) [29] for joint prognostic prediction and tumor segmentation from FDG-PET/CT images.We preprocessed FDG-PET/CT images with resampling, SUV conversion (for PET only), affine registration, Regions-of-Interest (ROIs) cropping, and intensity normalization (detailed in Online Resource).The preprocessed PET and CT images were concatenated and fed into the DeepMTS as input, while the manual segmentation masks of primary tumors were used as ground truth labels for training only.The DeepMTS is a CNN consisting of a Unet-based segmentation backbone [36] and a DenseNet-based cascaded survival network (CSN) [37].The Unet is a U-shape encoderdecoder CNN with skip connections between its contracting encoder and expanding decoder [36].The DenseNet is a CNN consisting of multiple dense blocks with dense connections between layers, which enables feature reuse to enhance the capacity to generalize to unseen data [37].The segmentation backbone is hard-shared by prognostic prediction and tumor segmentation tasks, which implicitly guides the model to extract features related to tumor regions.The outputs of the segmentation backbone are fed into the CSN as a supplementary input (together with FDG-PET/CT images), which further leverages the global tumor information (e.g., tumor size, shape, and locations) for prognostic prediction.Deep features derived from both segmentation backbone and CSN are used for prognostic prediction via two fully-connected layers.
After training, DeepMTS can predict the survival risk scores of patients (DeepMTS-Score) and the segmentation masks of tumor regions.The DeepMTS-Score is relevant to PFS and can be directly used for prognostic prediction, while the predicted tumor masks were further leveraged in the following automatic radiomics analysis.The architecture of DeepMTS is detailed in [29] and its implementation code is publicly available at https:// github.com/ Mungo Meng/ Survi val-DeepM TS.We also provide more training details in Online Resource.For comparison, we also built a single-task deep survival model for prognostic prediction, following Qiang et al.'s study [38], and its output scores are denoted by SingleTask-Score.
With the tumor masks predicted by DeepMTS, we extracted 1456 handcrafted radiomics features from FDG-PET/CT images via Pyradiomics [39], including 720 PET features, 720 CT features, and 16 shape features based on 3D shape of tumors (detailed in Online Resource).The extracted features were analyzed by a Lasso-Cox model [40], whose output scores are denoted by AutoRadio-Score.We refer to this radiomics process as automatic radiomics, which differentiates it from conventional radiomics based on manual segmentation.For comparison, we also performed the same radiomics analysis based on manual segmentation masks and refer to the output scores as ManualRadio-Score.
After the DeepMTS-Score and AutoRadio-Score are derived, we developed a multi-task deep learning-based radiomic (MTDLR) nomogram by combining the DeepMTS-Score, 1 3 AutoRadio-Score, and clinical data.Univariate and multivariate analyses were performed for all clinical data and prediction scores via Cox proportional hazards regression, so as to screen out the prognostic indicators with significant relevance to PFS and build the nomogram.For comparison, we also built a conventional radiomic nomogram and a single-task deep learningbased radiomic nomogram by combining the ManualRadio-Score and SingleTask-Score with clinical data.

Statistical analysis
Continuous parameters were described using median or mean with range, while categorical variables were described using frequency with percentage.Differences among the training, internal validation, and external validation cohorts were analyzed using the Mann-Whitney test, χ 2 test, or Fisher's exact test.
Univariate and multivariate Cox analyses were performed using SPSS (version 26.0;IBM Inc., New York, NY, USA).All radiomic nomograms were developed based on the multivariate analyses.Calibration curves with the Hosmer-Lemeshow goodness-of-fit test were applied to evaluate the consistence between the observed PFS proportion and the predicted survival probability.
The prognostic performance of nomograms was evaluated using Harrell's concordance indices (C-index), timeindependent receiver operating characteristic (ROC) curve, and area under curve (AUC).The statistical significance between AUCs was tested via DeLong's method using R packages (version 3.6.3,http:// www.R-proje ct.org).Survival analyses based on Kaplan-Meier method were performed for risk group stratification.Patients with score higher/lower than the cutoff value calculated by ROC were stratified into high/low-risk groups, and then a two-sided log-rank test was applied for comparisons.All tests were two-sided for statistical significance, and P value < 0.05 was considered to indicate statistically significant differences.

Patient characteristics
The demographic and clinical characteristics of patients are presented in Table 1.The median age was 45 years (range 15-83 years), 48 years (range 14-79 years) and 48 years (range 14-74 years) for the training cohort, internal validation cohort, and external validation cohort, respectively.Among these three cohorts, no statistically significant difference was observed in age, gender, EBV DNA, T stage, N stage, and TNM stage, whereas BMI, LDH, histology, and PET parameters were statistically significantly different.At the end of the followup, the PFS ratio was 75.67% (395/522), 81.54% (106/130), and 80.77% (189/234) in the training, internal validation, and external validation cohorts, and there was no significant difference of PFS distribution among these cohorts (P = 0.163).

Establishment of MTDLR nomogram
Among the clinical and conventional PET parameters, only TNM stage was significantly associated with PFS in univariate analysis for the training cohort (P = 0.031, Table 2).However, none of these parameters showed a significant correlation with PFS in the internal and external validation cohorts.Notably, all the DeepMTS-Score, SingleTask-Score, AutoRadio-Score, and ManualRadio-Score were significantly associated with PFS in univariate analysis for the training, internal and external validation cohorts.For multivariate analysis, the DeepMTS-Score and AutoRadio-Score could serve as independent factors for predicting disease progression in all three cohorts (Table 3).

Survival analysis for risk group stratification
The conventional radiomic nomogram (ManualRadio-Score + TNM), single-task deep learning-based radiomic nomogram (SingleTask-Score + TNM), and our MTDLR nomogram were used to stratify patients into high-and low-risk groups by cutoff values calculated with ROC curves.The Kaplan-Meier curves of the high-and low-risk patient groups were showed in Fig. 4. For comparison, the commonly-used TNM stage was also adopted to stratify patients according to stage III or IVa, where the patients with stage IVa had significantly poorer prognosis than the patients with stage III in the training cohort (Hazard rate (HR): 1.541, 95% CI: 0.991-2.397,P = 0.029).However, the TNM stage failed to stratify patients into significantly different groups in the internal and external validation cohorts (HR: 1.457, 95% CI: 0.582-3.647,P = 0.381 and HR: 1.839, 95% CI: 0.861-3.928,P = 0.059, respectively).Figure 4 also show that all three nomograms stratify patients into significantly different groups in all three cohorts (P < 0.001).Nevertheless, our MTDLR nomogram differentiated the high-and low-risk groups with the highest HR value among these three nomograms (HR: 10.250, 95% CI: 6.853-15.340, in the training cohort; HR: 7.519, 95% CI: 2.339-24.170, in the internal validation cohort; and HR: 4.812, 95% CI: 2.291-10.100, in the external validation cohort).In addition, the Kaplan-Meier curves of the patient groups stratified by ManualRadio-Score, SingleTask-Score, DeepMTS-Score, and AutoRadio-Score were presented in Online Resource Fig. 2. LA-NPC patients can be stratified into low-and high-risk groups, where the high-risk group was characterized by worse PFS rates than the low-risk group.The TNM staging system, focusing on anatomical and locational information, has been widely used in clinical studies [7][8][9] but, unfortunately, was not an independent prognostic    factor in our study (Table 3).Nevertheless, we identified that combining TNM stage with other prognostic scores still improved the prognostic performance, which is consistent with the findings reported in previous studies [8,28,35].FDG-PET/CT images, given the capabilities in providing tumors' metabolic and anatomical information, have also been widely used for prognostic prediction [41][42][43].However, the conventional FDG-PET/CT-derived parameters (SUV, MTV, and TLG) cannot serve as effective prognostic indicators in our univariate analysis (Table 2).To further leverage the prognostic information in FDG-PET/CT images, radiomics or deep learning were adopted and showed superiority over conventional parameters [28,44].Nevertheless, the prognostic performance varied with different radiomics or deep learning models, which suggests that the prognostic information in FDG-PET/CT image cannot be easily accessed and should be carefully leveraged with well-developed models.
Currently, there is a dilemma for extracting prognostic information from medical images.As discussed, conventional radiomics can well characterize the intra-tumor information while it is limited to the segmented tumor regions.Deep learning can access the prognostic information in the entire images.However, it has difficulties in extracting tumor-specific information.In this study, we adopted a deep multi-task survival model (DeepMTS) [29] to address this dilemma.It has been demonstrated that, through jointly learning tumor segmentation task with a hybrid multi-task architecture, DeepMTS can effectively extract prognostic information from tumor regions while also capturing the out-of-tumor prognostic information, which enables Deep-MTS to outperform existing radiomics-or deep learningbased prognostic prediction models [29].Nevertheless, we noticed that the segmentation output of DeepMTS was not fully leveraged for prognostic prediction and the prognostic information within tumor regions could be further explored.Therefore, we used the DeepMTS-segmented tumor masks for automatic radiomics analysis, which further explored the intra-tumor prognostic information and removed the reliance of conventional radiomics on manual segmentation.For tumor segmentation, the DeepMTS achieved a Dice Similarity Coefficient (DSC) of 0.826, 0.775, and 0.765 on the training, internal validation, and external validation cohorts, which demonstrates great consistency with the manually delineated segmentation masks.It has been reported that automatic segmentations improved the objectiveness [45] and resulted in significantly better prognostic prediction performance than manual segmentation [46], which potentially enables better radiomics analysis and facilitates the final prognostic prediction [47,48].
The prognostic scores from DeepMTS and automatic radiomics were combined with clinical data to build the MTDLR nomogram, which leveraged both FDG-PET/CT and clinical information and also improved the interpretability for prediction.Our MTDLR nomogram achieved the best prognostic performance among all comparison prognostic scores and nomograms (Table 4), which could be attributed to three facts.First, the DeepMTS produced more discriminative prognostic scores (DeepMTS-Score) than the commonly used single-task deep survival model (Single-Task-Score).Second, the automatic radiomics also produced more discriminative prognostic scores (AutoRadio-Score) than conventional radiomics (ManualRadio-Score).Finally, the DeepMTS-Score and AutoRadio-Score were combined together to achieve better prognostic prediction.The strategy of combining multi-task deep learning and radiomics has been adopted for prognostic prediction in head and neck cancer [48] and achieved one of the top prognostic performance in HEad and neCK TumOR segmentation and outcome prediction (HECKTOR 2022) challenge [49].Our study further validated this strategy with a large database of NPC patients.
We divided patients based on our MTDLR nomogram and found that the MTDLR nomogram effectively stratified LA-NPC patients into significantly different risk groups, which is potentially beneficial for individualized treatment regimens.Induction chemotherapy (IC) plus concurrent chemoradiotherapy (CCRT) is recommended as 2A-level evidence according to the National Comprehensive Cancer Network (NCCN) guidelines [4].However, it's still a controversy as a portion of LA-NPC patients do not benefit from IC. Qiang et al. [38] developed a prognostic system to explore whether high-risk or low-risk patients can benefit from IC + CCRT than CCRT only.Zhong et al. [16] developed a deep learning-based radiomic nomogram to predict the prognosis of NPC patients with different regimens and accordingly recommend an optimal treatment regimen.These studies demonstrated the necessity of stratifying LA-NPC patients into different risk groups so as to optimize treatment regimens.
There exist several inevitable limitations with our study.First, the completeness and homogeneity of our data had deficiencies due to its retrospective nature.EBV status was missing for about 15% of patients, which might limit the accuracy of statistical analysis.Second, our study was conducted in endemic areas and thus only included patients with TNM stage III and IVa.Therefore, the MTDLR nomogram could be further validated with more extensive databases in future studies.However, it should be noted that we have validated our MTDLR nomogram in a large database (886 patients) with two validation cohorts, which can support the effectiveness of MTDLR nomogram in LA-NPC.

Conclusion
In this study, we evaluated the value of multi-task learning for prognostic prediction in LA-NPC patients.To achieve this, we adopted a deep multi-task survival model (DeepMTS) and developed a multi-task deep learning-based radiomic (MTDLR) nomogram that combines TNM stage, DeepMTS-Score, and AutoRadio-Score.Compared to the conventional and singletask deep learning-based radiomic nomograms, the MTDLR nomogram extracted more heterogeneous and prognostic information to better predict the prognosis of LA-NPC patients.We validated our MTDLR nomogram with a large LA-NPC databased with two (internal/external) validation cohorts, which support the effectiveness of MTDLR nomogram and its potential contributions to clinical decision making.

Fig. 1
Fig. 1 Workflow of multi-task deep learning-based radiomics analysis

Fig. 2
Fig.2Nomogram and calibration curves.a An integrated MTDLR nomogram was built with TNM stage, DeepMTS-derived prognostic prediction score (DeepMTS-Score), and DeepMTS-derived automatic radiomics score (AutoRadio-Score) to predict 3-year and 5-year PFS probability.For calculating the 3-year and 5-year PFS probability with the nomogram, firstly, we locate the patient's TNM stage and draw a line straight upward to the "Points" axis to determine the points associated with the corresponding TNM stage.Then, we repeat the process for DeepMTS-Score and AutoRadio-Score, and sum the total points achieved for the three covariates.Lastly, we locate this sum on

Fig. 3
Fig. 3 ROC curves for comparison among different clinical, conventional, and deep learning-based radiomics scores/nomograms on the training (a), internal validation (b), and external validation (c) cohorts

Table 1
Demographic and clinical characteristics of patients WHO World Health Organization, BMI body mass index, LDH lactate dehydrogenase, IMRT intensity-modulated radiation therapy, IC induction chemotherapy, CCRT concurrent chemoradiotherapy, AC adjuvant chemotherapy, SUV standardized uptake value, MTV metabolic tumor volume, TLG total lesion glycolysis, PFS progression-free survival

Table 2
Univariate Cox proportional hazard regression analysis for PFS on the training, internal validation, and external validation cohorts P value less than 0.05 was in bold PFS progression-free survival, HR hazard ratio, CI confidence interval, EBV Epstein-Barr virus, BMI body mass index, LDH lactate dehydrogenase, SUV standardized uptake value, MTV metabolic tumor volume, TLG total lesion glycolysis

Table 3
Multivariate Cox proportional hazard regression analysis for PFS on the training, internal validation, and external validation cohorts P value less than 0.05 was in bold PFS progression-free survival, HR hazard ratio, CI confidence interval

Table 4
C-index and AUC of different clinical, conventional, and deep learning-based radiomic scores/nomograms evaluated on the training, internal validation, and external validation cohorts * Com1 means the P value was for the comparison of C-index, and Com2 was for AUC The best result in each cohort was in bold AUC area under the curve, CI confidence interval