Introduction

Nasopharyngeal carcinoma (NPC) is an epithelial malignancy arising from the nasopharyngeal mucosal lining [1], with high prevalence rates in east and southeast Asia [2]. About 70%-80% of NPC patients are categorized as locoregionally advanced NPC (LA-NPC) (Tumor-Node-Metastasis (TNM) stage III or IVa) according to the 8th edition of American Joint Committee on Cancer (AJCC)/Union for International Cancer Control (UICC) staging system [3]. The primary therapeutic regimen for NPC is radiation therapy (RT) with or without chemotherapy due to its radiosensitivity [4]. However, despite the improvement in treatment, due to locoregional recurrences and distant metastasis, the 5-year survival rates of LA-NPC patients is still a persistent problem, usually ranging from 10 to 40% [5]. Under this circumstance, pretreatment prognosis is a major concern for LA-NPC patients, which is conducive to guide the individualized therapeutic regimen. Specifically, based on the pretreatment prognosis, patients could be stratified into different risk groups with different therapeutic regimens applied, and this has been reported to potentially improve the patients’ overall survival outcomes [6].

TNM staging system is widely used for prognostic prediction and patient stratification [7,8,9]. However, despite the fact that patients with the same TNM stage receive the same treatment, large variations in prognosis exists due to the heterogeneous nature of tumor microenvironment [10]. Image-derived biomarkers, such as the standardized uptake value (SUV) and metabolic tumor volume (MTV) derived from [18F]-fluorodeoxyglucose ([18F]FDG) positron emission tomography/computed tomography (PET/CT), can provide promising prognostic information for NPC [11, 12]. Nevertheless, these factors are limited in clinical practice as they are arduous to represent intra-tumor information such as tumor texture, intensity, heterogeneity, and morphology. Therefore, a reliable and accurate prognostic prediction model is needed to predict their progression-free survival (PFS), and to distinguish high-risk from low-risk patients. Such prediction will ultimately facilitate the formulation of therapeutic regimens and improve patients’ overall survival outcomes.

Radiomics is a widely recognized computational method for prognostic prediction, which extracts high-dimensional handcrafted features from medical images to characterize intra-tumor information and then models the relevance between the features and prognostic outcomes through statistical methods [13, 14]. Radiomics has been widely used for prognostic prediction in various cancers including NPC [15,16,17]. However, the extraction of radiomics features requires tumor segmentation masks as the guidance, which inevitably brings an additional segmentation step into the radiomics pipeline. In addition, radiomics features are extracted from the segmented regions, which are usually limited to primary and metastatic lesions [5, 18]. This suggests that the extracted radiomics features may have difficulties in representing the prognostic information outside of malignant lesions (e.g., adjacent tissue invasion). There have been attempts at leveraging lymph node segmentation for radiomics analysis [19,20,21]. However, lymph node segmentation is intractable and the adjacent tissue invasion has not been considered yet. This limitation is more critical for LA-NPC patients, as many vital tissues and organs adjacent to the nasopharynx (e.g., brain, ethmoidal sinus, and orbit) might have already been invaded by LA-NPC [22].

Deep learning is an alternative approach to prognostic prediction and is becoming popular in the literature [15, 23, 24]. Deep survival models based on deep learning usually adopt convolutional neural networks (CNNs) to extract image features and then perform end-to-end prediction from medical images, where tumor segmentation masks are often not required [25]. Without tumor masks as constraints, deep survival models may potentially leverage the prognostic information existing within the entire images. Deep survival models have demonstrated the potential to outperform conventional radiomics-based prognostic prediction models [26,27,28]. However, performing end-to-end prediction without using tumor masks introduces interference from non-relevant background information and incurs difficulties in extracting tumor-specific information. Recently, multi-task deep survival models were explored to perform prognostic prediction jointly with tumor segmentation [29,30,31], which implicitly guided the model to extract tumor-related information while not discarding out-of-tumor information. However, the value of multi-task deep learning for prognostic prediction in LA-NPC has not been validated with large patient cohorts. In addition, deep survival models are limited by the ‘block box’ nature [32], which undermines their generalizability in clinical practice.

Nomograms serve as a common tool for guiding individualized treatments as they can simplify complicated prognostic models to numerical estimate of survival probability and provide a clear visual illustration of the factors leading to the prediction [33, 34]. Zhang et al. [5] developed a multiparametric magnetic resonance imaging (MRI)-based radiomic nomogram, which provides an illustrative example of precision medicine and prognostic prediction. Peng et al. [15] developed a deep learning FDG-PET/CT-based nomogram that may act as an individual chemotherapy (IC) indicator in advanced NPC. Pan et al. [3] developed a radiomic nomogram with better prognostic performance than the 8th edition of AJCC/UICC staging system. Nevertheless, it has been reported with an external validation cohort that Pan et al.’s nomogram underestimated the 5-year overall survival (OS) of LA-NPC patients [35]. Therefore, a more reliable and accurate prognostic nomogram is still needed for LA-NPC patients.

In this study, we aim to evaluate the value of multi-task deep learning for prognostic prediction in LA-NPC patients with a large database acquired from two medical centers. We adopted the state-of-the-art deep multi-task survival model (DeepMTS) [29] for joint prognostic prediction and tumor segmentation from pretreatment FDG-PET/CT images, which predicted a survival risk score (DeepMTS-Score) and a tumor segmentation mask for individual LA-NPC patient. The DeepMTS-Score can be directly used for prognostic prediction, while the predicted tumor masks were leveraged for prognostic prediction through radiomics analysis (AutoRadio-Score). We further developed a multi-task deep learning-based radiomic (MTDLR) nomogram by integrating DeepMTS-Score, AutoRadio-Score, and clinical data, so as to improve the accuracy and interpretability of prognostic prediction. Compared with conventional radiomic nomograms, our MTDLR nomogram achieved better prognostic performance and enabled better patient stratification, which demonstrated the potential to facilitate personalized treatment planning.

Materials and methods

Patients

Between May 2009 and May 2019, the medical records of 903 NPC patients were collected from Fudan University Shanghai Cancer Center (FUSCC) and Shanghai Proton and Heavy Ion Center (SPHIC). The inclusion criteria are as follows: (1) histologically confirmed LA-NPC (TNM stage III or IVa); (2) received concomitant systemic treatment with intensity modulated radiotherapy (IMRT); (3) underwent pretreatment FDG-PET/CT scans; and (4) available clinical data and FDG-PET/CT images. Patients with previous chemotherapy/radiotherapy or other malignant tumors were excluded. Finally, 652 patients from FUSCC and 234 patients from SPHIC were enrolled in this study. Patients from FUSCC were randomly divided into a training cohort (n = 522) and an internal validation cohort (n = 130) with a 4:1 ratio, while patients from SPHIC (n = 234) were used as an external validation cohort and used merely for evaluation purpose.

After completion of initial treatment, each patient was followed up for every 3 months in the first 2 years, then every 6 months in the third to fifth year, and annually thereafter. The follow-up endpoint of this study is PFS, defined as the time from randomization to the date of disease progression or death from any cause. The median follow-up time is 50 months (ranging from 44 to 120 months) for FUSCC and 49 months (ranging from 44 to 97 months) for SPHIC. FUSCC and SPHIC Ethical Committee approved this retrospective study with informed consent obtained from all enrolled patients.

PET/CT imaging

FDG-PET/CT images were obtained on a Siemens biograph 16HR PET/CT scanner (Knoxville, Tennessee, USA). FDG-PET/CT data acquisition procedure was detailed in Online Resource.

For quantitative analysis, maximum or mean of standardized uptake value (SUV) normalized to body weight and metabolic tumor volume (MTV) were manually computed for tumor lesions by drawing a 3-dimensional volume of interest (VOI). Meanwhile, total lesion glucose (TLG) was calculated according to the formula: TLG = SUVmean × MTV, where the SUVmean and MTV were recorded at the SUV threshold of 2.5.

Multi-task deep learning-based radiomics analysis

The workflow of multi-task deep learning-based radiomics analysis is illustrated in Fig. 1, which presents a three-step pipeline including multi-task deep learning model construction, automatic radiomics analysis, and nomogram construction.

Fig. 1
figure 1

Workflow of multi-task deep learning-based radiomics analysis

We adopted a deep multi-task survival model (DeepMTS) [29] for joint prognostic prediction and tumor segmentation from FDG-PET/CT images. We preprocessed FDG-PET/CT images with resampling, SUV conversion (for PET only), affine registration, Regions-of-Interest (ROIs) cropping, and intensity normalization (detailed in Online Resource). The preprocessed PET and CT images were concatenated and fed into the DeepMTS as input, while the manual segmentation masks of primary tumors were used as ground truth labels for training only. The DeepMTS is a CNN consisting of a Unet-based segmentation backbone [36] and a DenseNet-based cascaded survival network (CSN) [37]. The Unet is a U-shape encoder-decoder CNN with skip connections between its contracting encoder and expanding decoder [36]. The DenseNet is a CNN consisting of multiple dense blocks with dense connections between layers, which enables feature reuse to enhance the capacity to generalize to unseen data [37]. The segmentation backbone is hard-shared by prognostic prediction and tumor segmentation tasks, which implicitly guides the model to extract features related to tumor regions. The outputs of the segmentation backbone are fed into the CSN as a supplementary input (together with FDG-PET/CT images), which further leverages the global tumor information (e.g., tumor size, shape, and locations) for prognostic prediction. Deep features derived from both segmentation backbone and CSN are used for prognostic prediction via two fully-connected layers. After training, DeepMTS can predict the survival risk scores of patients (DeepMTS-Score) and the segmentation masks of tumor regions. The DeepMTS-Score is relevant to PFS and can be directly used for prognostic prediction, while the predicted tumor masks were further leveraged in the following automatic radiomics analysis. The architecture of DeepMTS is detailed in [29] and its implementation code is publicly available at https://github.com/MungoMeng/Survival-DeepMTS. We also provide more training details in Online Resource. For comparison, we also built a single-task deep survival model for prognostic prediction, following Qiang et al.’s study [38], and its output scores are denoted by SingleTask-Score.

With the tumor masks predicted by DeepMTS, we extracted 1456 handcrafted radiomics features from FDG-PET/CT images via Pyradiomics [39], including 720 PET features, 720 CT features, and 16 shape features based on 3D shape of tumors (detailed in Online Resource). The extracted features were analyzed by a Lasso-Cox model [40], whose output scores are denoted by AutoRadio-Score. We refer to this radiomics process as automatic radiomics, which differentiates it from conventional radiomics based on manual segmentation. For comparison, we also performed the same radiomics analysis based on manual segmentation masks and refer to the output scores as ManualRadio-Score.

After the DeepMTS-Score and AutoRadio-Score are derived, we developed a multi-task deep learning-based radiomic (MTDLR) nomogram by combining the DeepMTS-Score, AutoRadio-Score, and clinical data. Univariate and multivariate analyses were performed for all clinical data and prediction scores via Cox proportional hazards regression, so as to screen out the prognostic indicators with significant relevance to PFS and build the nomogram. For comparison, we also built a conventional radiomic nomogram and a single-task deep learning-based radiomic nomogram by combining the ManualRadio-Score and SingleTask-Score with clinical data.

Statistical analysis

Continuous parameters were described using median or mean with range, while categorical variables were described using frequency with percentage. Differences among the training, internal validation, and external validation cohorts were analyzed using the Mann–Whitney test, χ2 test, or Fisher’s exact test.

Univariate and multivariate Cox analyses were performed using SPSS (version 26.0; IBM Inc., New York, NY, USA). All radiomic nomograms were developed based on the multivariate analyses. Calibration curves with the Hosmer–Lemeshow goodness-of-fit test were applied to evaluate the consistence between the observed PFS proportion and the predicted survival probability.

The prognostic performance of nomograms was evaluated using Harrell's concordance indices (C-index), time-independent receiver operating characteristic (ROC) curve, and area under curve (AUC). The statistical significance between AUCs was tested via DeLong’s method using R packages (version 3.6.3, http://www.R-project.org). Survival analyses based on Kaplan–Meier method were performed for risk group stratification. Patients with score higher/lower than the cutoff value calculated by ROC were stratified into high/low-risk groups, and then a two-sided log-rank test was applied for comparisons. All tests were two-sided for statistical significance, and P value < 0.05 was considered to indicate statistically significant differences.

Results

Patient characteristics

The demographic and clinical characteristics of patients are presented in Table 1. The median age was 45 years (range 15–83 years), 48 years (range 14–79 years) and 48 years (range 14–74 years) for the training cohort, internal validation cohort, and external validation cohort, respectively. Among these three cohorts, no statistically significant difference was observed in age, gender, EBV DNA, T stage, N stage, and TNM stage, whereas BMI, LDH, histology, and PET parameters were statistically significantly different. At the end of the follow-up, the PFS ratio was 75.67% (395/522), 81.54% (106/130), and 80.77% (189/234) in the training, internal validation, and external validation cohorts, and there was no significant difference of PFS distribution among these cohorts (P = 0.163).

Table 1 Demographic and clinical characteristics of patients

Establishment of MTDLR nomogram

Among the clinical and conventional PET parameters, only TNM stage was significantly associated with PFS in univariate analysis for the training cohort (P = 0.031, Table 2). However, none of these parameters showed a significant correlation with PFS in the internal and external validation cohorts. Notably, all the DeepMTS-Score, SingleTask-Score, AutoRadio-Score, and ManualRadio-Score were significantly associated with PFS in univariate analysis for the training, internal and external validation cohorts. For multivariate analysis, the DeepMTS-Score and AutoRadio-Score could serve as independent factors for predicting disease progression in all three cohorts (Table 3).

Table 2 Univariate Cox proportional hazard regression analysis for PFS on the training, internal validation, and external validation cohorts
Table 3 Multivariate Cox proportional hazard regression analysis for PFS on the training, internal validation, and external validation cohorts

Based on the multivariate analysis, we built the MTDLR nomogram with TNM stage, AutoRadio-Score, and DeepMTS-Score (Fig. 2a). The C-index of the nomogram was 0.818 (95% confidence interval (CI): 0.785–0.851, P < 0.001), 0.752 (95% CI: 0.638–0.865, P < 0.001), and 0.717 (95% CI: 0.641–0.793, P < 0.001) in the training, internal validation, and external validation cohort. Furthermore, the calibration curves showed that the predicted 3-year and 5-year PFS probability of the nomogram was highly consistent with the observed PFS probability (Hosmer–Lemeshow test: P > 0.05, Fig. 2b and c).

Fig. 2
figure 2

Nomogram and calibration curves. a An integrated MTDLR nomogram was built with TNM stage, DeepMTS-derived prognostic prediction score (DeepMTS-Score), and DeepMTS-derived automatic radiomics score (AutoRadio-Score) to predict 3-year and 5-year PFS probability. For calculating the 3-year and 5-year PFS probability with the nomogram, firstly, we locate the patient’s TNM stage and draw a line straight upward to the “Points” axis to determine the points associated with the corresponding TNM stage. Then, we repeat the process for DeepMTS-Score and AutoRadio-Score, and sum the total points achieved for the three covariates. Lastly, we locate this sum on the “Total Points” axis, and draw a line straight down to determine the probability of 3-year and 5-year PFS. b The 3-year and c 5-year PFS calibration curves of the integrated MTDLR nomogram in the training, internal validation, and external validation cohorts. The actual PFS probability is plotted on the y-axis, while nomogram predicted probability is plotted on the x-axis. The P value of calibration was calculated by Hosmer–Lemeshow goodness-of-fit test, and P value > 0.05 indicates the good match between the actual and predicted PFS probability

Performance of radiomic nomograms

To evaluate the prognostic performance of our MTDLR nomogram, the conventional radiomic nomogram (ManualRadio-Score + TNM) and the single-task deep learning-based radiomic nomogram (SingleTask-Score + TNM) were compared (Online Resource Fig. 1 and Table 1). Table 4 shows that our DeepMTS-Score exhibits better prognostic performance than the SingleTask-Score in the training (C-index and AUC: 0.780 and 0.819; Fig. 3a), internal validation (0.731 and 0.750; Fig. 3b), and external validation cohorts (0.695 and 0.702; Fig. 3c). Furthermore, the AutoRadio-Score also shows better prognostic performance than the ManualRadio-Score in these three cohorts (C-index: 0.728, 0.702, and 0.669; AUC: 0.751, 0.706, and 0.704). Moreover, the MTDLR nomogram combining TNM stage, DeepMTS-Score, and AutoRadio-Score achieved the best prognostic performance among all prognostic scores and nomograms in all three cohorts (C-index: 0.818, 0.752, and 0.717; AUC: 0.859, 0.769, and 0.730).

Table 4 C-index and AUC of different clinical, conventional, and deep learning-based radiomic scores/nomograms evaluated on the training, internal validation, and external validation cohorts
Fig. 3
figure 3

ROC curves for comparison among different clinical, conventional, and deep learning-based radiomics scores/nomograms on the training (a), internal validation (b), and external validation (c) cohorts

Survival analysis for risk group stratification

The conventional radiomic nomogram (ManualRadio-Score + TNM), single-task deep learning-based radiomic nomogram (SingleTask-Score + TNM), and our MTDLR nomogram were used to stratify patients into high- and low-risk groups by cutoff values calculated with ROC curves. The Kaplan–Meier curves of the high- and low-risk patient groups were showed in Fig. 4. For comparison, the commonly-used TNM stage was also adopted to stratify patients according to stage III or IVa, where the patients with stage IVa had significantly poorer prognosis than the patients with stage III in the training cohort (Hazard rate (HR): 1.541, 95% CI: 0.991–2.397, P = 0.029). However, the TNM stage failed to stratify patients into significantly different groups in the internal and external validation cohorts (HR: 1.457, 95% CI: 0.582–3.647, P = 0.381 and HR: 1.839, 95% CI: 0.861–3.928, P = 0.059, respectively). Figure 4 also show that all three nomograms stratify patients into significantly different groups in all three cohorts (P < 0.001). Nevertheless, our MTDLR nomogram differentiated the high- and low-risk groups with the highest HR value among these three nomograms (HR: 10.250, 95% CI: 6.853–15.340, in the training cohort; HR: 7.519, 95% CI: 2.339–24.170, in the internal validation cohort; and HR: 4.812, 95% CI: 2.291–10.100, in the external validation cohort). In addition, the Kaplan–Meier curves of the patient groups stratified by ManualRadio-Score, SingleTask-Score, DeepMTS-Score, and AutoRadio-Score were presented in Online Resource Fig. 2.

Fig. 4
figure 4

Kaplan–Meier curves of risk group stratification based on TNM stage, ManualRadio-Score + TNM, SingleTask-Score + TNM, and MTDLR nomogram on the training, internal validation, and external validation cohorts

Discussion

In this study, we constructed a multi-task deep learning-based radiomic (MTDLR) nomogram to predict the PFS of LA-NPC patients. The prognostic prediction and risk stratification performance of the MTDLR nomogram was superior to the conventional radiomic nomogram and single-task deep learning-based radiomic nomogram. LA-NPC patients can be stratified into low- and high-risk groups, where the high-risk group was characterized by worse PFS rates than the low-risk group.

The TNM staging system, focusing on anatomical and locational information, has been widely used in clinical studies [7,8,9] but, unfortunately, was not an independent prognostic factor in our study (Table 3). Nevertheless, we identified that combining TNM stage with other prognostic scores still improved the prognostic performance, which is consistent with the findings reported in previous studies [8, 28, 35]. FDG-PET/CT images, given the capabilities in providing tumors’ metabolic and anatomical information, have also been widely used for prognostic prediction [41,42,43]. However, the conventional FDG-PET/CT-derived parameters (SUV, MTV, and TLG) cannot serve as effective prognostic indicators in our univariate analysis (Table 2). To further leverage the prognostic information in FDG-PET/CT images, radiomics or deep learning were adopted and showed superiority over conventional parameters [28, 44]. Nevertheless, the prognostic performance varied with different radiomics or deep learning models, which suggests that the prognostic information in FDG-PET/CT image cannot be easily accessed and should be carefully leveraged with well-developed models.

Currently, there is a dilemma for extracting prognostic information from medical images. As discussed, conventional radiomics can well characterize the intra-tumor information while it is limited to the segmented tumor regions. Deep learning can access the prognostic information in the entire images. However, it has difficulties in extracting tumor-specific information. In this study, we adopted a deep multi-task survival model (DeepMTS) [29] to address this dilemma. It has been demonstrated that, through jointly learning tumor segmentation task with a hybrid multi-task architecture, DeepMTS can effectively extract prognostic information from tumor regions while also capturing the out-of-tumor prognostic information, which enables DeepMTS to outperform existing radiomics- or deep learning-based prognostic prediction models [29]. Nevertheless, we noticed that the segmentation output of DeepMTS was not fully leveraged for prognostic prediction and the prognostic information within tumor regions could be further explored. Therefore, we used the DeepMTS-segmented tumor masks for automatic radiomics analysis, which further explored the intra-tumor prognostic information and removed the reliance of conventional radiomics on manual segmentation. For tumor segmentation, the DeepMTS achieved a Dice Similarity Coefficient (DSC) of 0.826, 0.775, and 0.765 on the training, internal validation, and external validation cohorts, which demonstrates great consistency with the manually delineated segmentation masks. It has been reported that automatic segmentations improved the objectiveness [45] and resulted in significantly better prognostic prediction performance than manual segmentation [46], which potentially enables better radiomics analysis and facilitates the final prognostic prediction [47, 48].

The prognostic scores from DeepMTS and automatic radiomics were combined with clinical data to build the MTDLR nomogram, which leveraged both FDG-PET/CT and clinical information and also improved the interpretability for prediction. Our MTDLR nomogram achieved the best prognostic performance among all comparison prognostic scores and nomograms (Table 4), which could be attributed to three facts. First, the DeepMTS produced more discriminative prognostic scores (DeepMTS-Score) than the commonly used single-task deep survival model (SingleTask-Score). Second, the automatic radiomics also produced more discriminative prognostic scores (AutoRadio-Score) than conventional radiomics (ManualRadio-Score). Finally, the DeepMTS-Score and AutoRadio-Score were combined together to achieve better prognostic prediction. The strategy of combining multi-task deep learning and radiomics has been adopted for prognostic prediction in head and neck cancer [48] and achieved one of the top prognostic performance in HEad and neCK TumOR segmentation and outcome prediction (HECKTOR 2022) challenge [49]. Our study further validated this strategy with a large database of NPC patients.

We divided patients based on our MTDLR nomogram and found that the MTDLR nomogram effectively stratified LA-NPC patients into significantly different risk groups, which is potentially beneficial for individualized treatment regimens. Induction chemotherapy (IC) plus concurrent chemoradiotherapy (CCRT) is recommended as 2A-level evidence according to the National Comprehensive Cancer Network (NCCN) guidelines [4]. However, it’s still a controversy as a portion of LA-NPC patients do not benefit from IC. Qiang et al. [38] developed a prognostic system to explore whether high-risk or low-risk patients can benefit from IC + CCRT than CCRT only. Zhong et al. [16] developed a deep learning-based radiomic nomogram to predict the prognosis of NPC patients with different regimens and accordingly recommend an optimal treatment regimen. These studies demonstrated the necessity of stratifying LA-NPC patients into different risk groups so as to optimize treatment regimens.

There exist several inevitable limitations with our study. First, the completeness and homogeneity of our data had deficiencies due to its retrospective nature. EBV status was missing for about 15% of patients, which might limit the accuracy of statistical analysis. Second, our study was conducted in endemic areas and thus only included patients with TNM stage III and IVa. Therefore, the MTDLR nomogram could be further validated with more extensive databases in future studies. However, it should be noted that we have validated our MTDLR nomogram in a large database (886 patients) with two validation cohorts, which can support the effectiveness of MTDLR nomogram in LA-NPC.

Conclusion

In this study, we evaluated the value of multi-task learning for prognostic prediction in LA-NPC patients. To achieve this, we adopted a deep multi-task survival model (DeepMTS) and developed a multi-task deep learning-based radiomic (MTDLR) nomogram that combines TNM stage, DeepMTS-Score, and AutoRadio-Score. Compared to the conventional and single-task deep learning-based radiomic nomograms, the MTDLR nomogram extracted more heterogeneous and prognostic information to better predict the prognosis of LA-NPC patients. We validated our MTDLR nomogram with a large LA-NPC databased with two (internal/external) validation cohorts, which support the effectiveness of MTDLR nomogram and its potential contributions to clinical decision making.