Background

Papillary thyroid cancer (PTC) is one of the most common types of differentiated thyroid cancer (DTC), accounting for more than 90% of DTC. Although the mortality rate for PTC patients is low, 10–30% of PTC patients will still experience recurrence or metastasis after initial treatment [1], which is the main cause of death in PTC patients. Therefore, accurate risk stratification and individualized treatment and follow-up strategies are essential for detecting recurrent disease early and improving the prognosis of PTC patients.

The 2009 American Thyroid Association (ATA) guidelines proposed a three-category system to estimate the likelihood of DTC patients developing structural recurrence during postoperative follow-up [2], and a revised ATA risk stratification system was proposed in 2015 [3]; this system has been widely used and validated in clinical practice. Although the ATA system is flexible and can easily estimate risk based only on surgical/histological findings, several prognostic factors in the system do not specify the cutoff values for risk stratification, especially considering heterogeneity among PTC patients; for example, thyroglobulin (Tg) is a specific product of thyroid follicular cells, and it has been demonstrated that a high level of postoperative thyroid stimulating hormone (TSH)-suppressed Tg is associated with a high risk of recurrent disease and mortality, with a wide range of suppressed Tg cutoff values [3]. Moreover, the prognostic value of several prognostic factors remains controversial in PTC, such as metastatic lymph node (LN) features (i.e., the LN metastasis ratio and extranodal extension) [4,5,6], inflammation-based markers (i.e., the neutrophil-to-lymphocyte ratio [NLR], the platelet-to-lymphocyte ratio [PLR], the lymphocyte-to-monocyte ratio [LMR], and the prognostic nutritional index [PNI]) [7,8,9], and metabolic-related markers (i.e., obesity and dyslipidemia) [10, 11], which therefore need to be confirmed.

In the age of precision medicine, there is considerable enthusiasm for estimating prognosis by relying on models that can simultaneously consider many factors and provide an estimate of absolute risk [12]. Machine learning (ML) provides a novel approach to achieve this goal and has advantages in incorporating a larger number of multidimensional variables with more dynamic interactions than traditional prognostic tools [13]. Briefly, ML accounts for partial, nonlinear relationships, or multiple coexistent states between variables and outcomes, and each variable in an ML model can have a variable weight according to the changes in other variables; therefore, it can realize more individual predictions [14].

ML approaches have already proven to be effective predictive tools for various types of tumors [15,16,17]; thus, ML models may also prove useful in PTC risk stratification. However, to the best of our knowledge, only a few studies have developed ML models for predicting death or recurrence in patients with thyroid cancer [18,19,20,21,22,23]. In addition, previous studies were partially limited by the lack of large datasets, single algorithms, inadequate variables, and incomplete model evaluation [18,19,20,21,22,23], which hindered clinicians from better understanding the application of ML in prognosis prediction for PTC. Thus, this study aimed to develop and validate multiple ML models to predict structural recurrence in PTC patients based on a large sample of PTC patients with comprehensive clinical variables.

Methods

Study design and population

The electronic medical records of patients with thyroid cancer treated at West China Hospital, Sichuan University, were fully screened to retrospectively review all PTC patients treated and followed up. We restricted our analyses to PTC patients who underwent thyroid surgery (with or without lymph node dissection) and radioiodine (131I) therapy at West China Hospital, Sichuan University between January 2009 and December 2018 (n = 6220). We excluded patients with unresected tumors (n = 275), initial distant metastasis (n = 152), or other malignancies combined (n = 46), patients with unavailable information about the 8th edition of the AJCC TNM staging system [24] or the 2015 ATA risk stratification [3] (n = 1081), patients with a positive TgAb (> 40 IU/mL) [25] or missing data on postoperative non-stimulated Tg (TSH < 30 µIU/ml) (n = 1592), and patients with a follow-up period shorter than 1 year (n = 830). Finally, a total of 2244 patients were included in the prediction models (Fig. 1A).

Fig. 1
figure 1

Flowchart of the study (A) and the analytic steps (B). PTC, papillary thyroid cancer; Tg, thyroglobulin; TgAb, anti-thyroglobulin antibody; LASSO, least absolute shrinkage and selection operator; LR, logistic regression; XGBoost, eXtreme gradient boosting; SVM, support vector machine; RF, random forest; NN, neural network; AUCs, area under the receiver operating characteristic curves

This study was conducted following the basic principles of the Declaration of Helsinki and was approved by the West China Hospital Clinical Trials and Biomedical Ethics Committee, Sichuan University (Approval in 2020, No. 678). Written informed consent was waived given the retrospective nature of the study.

Potential input variables

To include as many predictive variables as possible and ensure the availability and representation of variables from the database. We used 29 potential input variables consisting of four dimensions in this study: (1) demographic characteristics and comorbidities, including age, sex, race, smoking status, alcohol drinking status, comorbidity of diabetes status, comorbidity of hypertension status, and comorbidity of Hashimoto’s thyroiditis; (2) tumor-related variables, including histology, tumor diameter, tumor foci, tumor location, external thyroid invasion (ETE), BRAF (V600E) mutation, and postoperative non-stimulated Tg; (3) LN-related variables, including LN dissection, number of LNs dissected, extranodal extension (ENE), lymph node metastasis ratio (LNR), and N stage; and (4) metabolic and inflammatory markers, including body mass index (BMI), triglyceride, cholesterol, low-density lipoprotein (LDL), high-density lipoprotein (HDL), neutrophil–lymphocyte ratio (NLR), platelet–lymphocyte ratio (PLR), lymphocyte–monocyte ratio (LMR), and prognostic nutritional index (PNI).

The missing values of potential input variables are represented as a category when the missing value was ≥ 20%, and multiple imputations were used when the missing value was < 20%, as previously reported [26, 27] (Supplementary Data 1). Receiver operating characteristic (ROC) curve analysis was performed, and the Youden index was used to determine the cutoff value for continuous variables [28]. The details of the potential input variables are shown in Supplementary Data 2.

Definition of structural recurrence

The outcome of this study was structural recurrence that occurred during follow-up after initial thyroid surgery. According to the 2015 ATA guidelines [3], patients were considered to have structural recurrent disease if any of the following conditions were met: (1) structural disease confirmed by cytology/histology; (2) highly suspicious lymph nodes or thyroid bed nodules on neck ultrasound; or (3) highly suspicious metastatic disease on whole-body 131I scintigraphy, 18fluorodeoxyglucose positron emission tomography scans, or other cross-sectional imaging. The last follow-up date of this study was July 31, 2021.

Dataset split and resampling

Figure 1B shows the analysis steps of this study. Patients were randomly divided into a training dataset and a test dataset according to a 4 to 1 ratio. The training dataset was used to develop and optimize the models (n = 1795), and the test dataset was used to validate and compare the models (n = 449). To ensure that the models can effectively predict the outcome (minority class), one-sided selection (OSS) under-resampling method was used to establish a balanced training dataset [29] (Supplementary Data 3).

Variable selection

Spearman correlation analysis was used to evaluate the correlation of all variables with each other. Spearman’s correlation coefficients range from -1 to 1, and values of coefficients close to -1 or 1 represent stronger relationships than values closer to zero. Then, the least absolute shrinkage and selection operator (LASSO) method was used to select the input variables. LASSO formulates curve fitting as a quadratic programming problem, where the objective function penalizes the absolute size of the regression coefficients based on the value of a tuning parameter λ involved in the maximum AUC value [30]. Thus, LASSO can perform automatic variable selection by driving the coefficients of irrelevant variables to zero.

Development and optimization of the models

Five-fold cross-validation was used to avoid training overfitting, and the training dataset was divided into 5 equal parts, this process was repeated 5 times. In the first step, the first part was used for validation, and the remaining parts were used for training. Similarly, the second part was used for validation in the second fold, and this process was continued for the rest of the folding. Five popular ML algorithms were applied to develop models based on selected variables, including logistic regression (LR), support vector machine (SVM) [31], eXtreme gradient boosting (XGBoost) [32], random forest (RF) [33], and neural network (NN) [34]. The hyperparameters for the five models were optimized via Bayesian optimization (BO), and the models were trained on a training set for optimization and validated on a validation set for each hyperparameter configuration. The ideal parameter setup provided the highest AUC values [35] (Supplementary Data 4). Finally, the relative importance of the variables of each model was ranked, which can reflect the contribution of each variable when predicting structural recurrence (Supplementary Data 5).

Evaluation and comparison of the models

We evaluated the predictive performance of each model in the test dataset (Fig. 1B). First, we evaluated the discrimination of the models by using the area under the receiver operating characteristic (AUC-ROC) curve [12]. We also compared the AUC values of the ML models and the AUC values of the ATA risk stratification by using the DeLong test [36], and a 2-tailed test with P < 0.05 was considered to indicate statistical significance. We used the Youden index as the threshold to calculate the sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), and F1 score [37]. Second, we evaluated the calibration of the models by using calibration curves, which represent the accuracy of the absolute recurrent risk estimates of the models [12]. Finally, we analyzed the interpretability of the ML models by using the rank of variable importance. Statistical analyses were performed using python (version 3.6.10, https://www.python.org) and R software (version 4.2.2, https://www.r-project.org/).

Results

Characteristics of the study population

The median age of the 2244 patients was 42.0 years, and 1489 (66.4%) patients were female. During a median follow-up of 45.5 months (range: 12.0 to 142.7 months), 179 patients (8.0%) experienced structural recurrence (Table 1). The median tumor diameters of ≤ 10 mm, 10 to 20 mm, 20 to 40 mm, and > 40 mm were documented in 1093 (48.7%), 777 (34.6%), 322 (14.3%), and 52 (2.3%) patients, respectively. Multifocality was observed in 577 (25.7%) patients. Extensive ETE was observed in 584 (26.0%) patients. The median non-stimulated Tg was 0.24 ng/mL with median TSH was 0.46 µIU/ml. A total of 1537 (68.5%) patients and 698 (31.1%) patients underwent central LN dissection and lateral LN dissection, respectively. A total of 1173 (52.3%) and 564 (25.1%) patients had N1a and N1b disease, respectively. The median LNR was 28.57%. A total of 328 (14.6%), 1334 (59.4%), and 582 (25.9%) patients were classified as low risk, intermediate risk, and high risk, respectively, according to the ATA risk stratification.

Table 1 Characteristics of the patients in this study

Patients with recurrent disease were more likely to be older, male, be smokers, be drinkers, have comorbidities of diabetes, hypertension and Hashimoto’s thyroiditis, have larger, multifocal and bilateral tumors, have ETE, have higher levels of Tg, have undergone lateral dissection with a higher number of LN dissected and LNR, have ENE and more advanced N stage, have higher levels of BMI, cholesterol, LDL and NLR, and have lower levels of triglyceride, PLR, LMR and PNI.

Performance of the models

The heatmap of the Spearman correlation analysis showed no significant or weak correlation between the majority of the input variables (Supplementary Data 6). The LASSO method selected nine variables for developing prediction models (Supplementary Data 7), including Tg, LN variables (LN dissection, number of LNs dissected, LNR, and N stage), comorbidities and metabolic-related variables (comorbidity of hypertension, comorbidity of diabetes, BMI, and LDL).

As shown in Table 2; Fig. 2, five models had adequate discrimination in differentiating patients at greater risk of recurrence from those at lower risk, and the AUCs of the five models ranged from 0.738 to 0.767 in the test dataset (LR: AUC = 0.738, 95% CI = 0.636–0.820; SVM: AUC = 0.752, 95% CI = 0.666–0.841; XGBoost: AUC = 0.741, 95% CI = 0.609–0.840; RF: AUC = 0.766, 95% CI = 0.702–0.845; NN: AUC = 0.767, 95% CI = 0.675–0.843). All models showed better discrimination than did the ATA risk stratification (AUC = 0.620, 95% CI = 0.534–0.670; DeLong test: P < 0.01; Supplementary Data 8). The SVM, XGBoost, and RF model showed greater sensitivity (0.568, 0.595, 0.676), specificity (0.903, 0.857, 0.784), accuracy (0.875, 0.835, 0.775), positive predictive value (PPV) (0.344, 0.272, 0.219), negative predictive value (NPV) (0.959, 0.959, 0.964), and F1 score (0.429, 0.373, 0.331) than did the ATA risk stratification (sensitivity = 0.432, specificity = 0.770, accuracy = 0.742, PPV = 0.144, NPV = 0.938, F1 score = 0.216). The calibration curves are shown in Fig. 3. Although all models overestimated the recurrence risk of patients to varying degrees, which may have resulted in a higher false-positive rate if the models were applied in clinical practice, the RF model had generally consistent calibration.

Table 2 Predictive performance of the models in the test dataset
Fig. 2
figure 2

ROC curves of the models and the ATA risk stratification. LR, logistic regression; XGBoost, eXtreme gradient boosting; SVM, support vector machine; RF, random forest; NN, neural network; ATA, American Thyroid Association

Fig. 3
figure 3

Calibration curves of the models. LR, logistic regression; XGBoost, eXtreme gradient boosting; SVM, support vector machine; RF, random forest; NN, neural network

Relative importance of variables in the models

Although slight differences were shown in the importance of variables among those models (Fig. 4), the Tg and the LNR were the top 2 important variables in all the models, the N stage was the top 5 important variables in all the models. The importance of variables in RF model was as follows: Tg, LNR, and N stage, comorbidity of hypertension, LDL, BMI, number of LNs dissected, comorbidity of diabetes, and LN dissection.

Fig. 4
figure 4

Relative importance of variables. LNR, lymph node metastasis ratio; Tg, thyroglobulin; LDL, low-density lipoprotein; BMI, body mass index; LR, logistic regression; XGBoost, eXtreme gradient boosting; SVM, support vector machine; RF, random forest; NN, neural network

Discussion

In this study, based on a large dataset of PTC patients with comprehensive predictive variables (demographic characteristics and comorbidities, tumor-related variables, LN-related variables, and metabolic and inflammatory markers), we developed and validated five ML models to predict structural recurrence in PTC patients. In the test dataset, the SVM, XGBoost, and RF model showed better discrimination than the ATA risk stratification according to the AUC values and corresponding indices, and the RF model generally had consistent calibration compared with the other models. Thus, ML models may aid in treatment decision making and improve postoperative prognosis for PTC patients by accurately estimating the likelihood of structural recurrence and identifying patients at high risk of recurrence. Overall, we suggested that the RF model, which showed overall good performance and interpretability, could be used to predict structural recurrence in patients with PTC.

Nine of 29 variables were selected by the LASSO method and used to develop models in this study, including the Tg, LN variables (LN dissection, number of LNs dissected, LNR, and N stage), comorbidities and metabolic markers (comorbidity of hypertension, comorbidity of diabetes, BMI, and LDL). Further variable importance analysis revealed that the Tg, LNR, and N stage were the three most important variables across all the models. Tg is the most important tumor marker in PTC, and the ATA risk stratification revealed abnormally elevated postoperative suppressed Tg as one of the high-risk predictors; however, it did not specify a cutoff value or include postoperative negative Tg in the postoperative recurrence risk assessment. In this study, a non-stimulated Tg value of 1.08 ng/mL was set as a cutoff by using the ROC curve, and patients with a higher level of non-stimulated Tg had a greater risk of recurrence than those with a lower level of Tg. Consistently, previous studies have used suppressed Tg values > 1 ng/mL to define a biochemical incomplete response to therapy in patients treated with total thyroidectomy and 131I ablation, and approximately 20% of these patients were likely to develop structural disease [3, 38].

LN-related variables were also considered important contributing predictors by the ML models. According to the 2015 ATA risk stratification system, the N stage and the size of the metastatic LN were proposed as key predictors for structural recurrence. In this study, we selected the number of dissected LNs, the LNR, and the N stage to develop models. The LNR was among the 2 most important variables, and the N stage was among the 5 most important variables according to all the models. A higher LNR (≥ 22.70%), greater number of LNs dissected (≥ 21), and advanced N stage were strongly associated with a high risk of recurrence. Several studies have reported various optimal cutoff values for LN-related variables [4,5,6]. For instance, in a recent study in which five ML models were constructed to predict recurrence among patients diagnosed with PTC, the LNR (cutoff = 0.24) and LN metastasis were identified as important variables [21]. Another study determined the predictive cutoff values for the number of metastatic LNs (4 and 13) and the LNR (0.28 and 0.58) using the K-means clustering algorithm [39]. The optimal predictive cutoff may depend on the extent of LN dissection, the number of LNs dissected, the annual number of surgeries performed by physicians, etc. Thus, more evidence is needed before combining LN-related variables with newly developed risk stratification or staging systems for PTC patients.

The comorbidities and metabolic-related markers (comorbidities of hypertension and diabetes, BMI, and LDL) were included in our models and showed potential predictive value. Although the current risk stratification or staging system for PTC does not include these predictors, a few studies have reported that hypertension [40], diabetes [41], a high level of BMI [11] and LDL [42] were significantly associated with the aggressiveness of PTC. The underlying mechanism between metabolism-related predictors and poor prognosis in PTC patients is less clear. Increasing insulin, insulin-like growth factor or TSH were associated with the aggressiveness of PTC in obese patients [43]. The LDL receptor played an important role in the RAS/RAF/MAPK (MEK)/ERK signaling cascade, and synergy between LDL-mediated receptor uptake and BRAF may lead to a worse prognosis in thyroid cancer patients [42].

Compared with existing studies on prognosis prediction for PTC [18,19,20,21,22], our study has several strengths. First, we developed ML models for predicting structural recurrence based on a large dataset of PTC patients. By using multiple ML algorithms, 5-fold cross-validation and Bayesian optimization, more reliable and robust predictions can be achieved. Second, benefitting from the simultaneous consideration of multiple predictors and the use of LASSO, we identified nine variables that were strongly associated with the risk of recurrence to develop models. Our results provided comprehensive evidence for the interpretation and clinical application of these predictors. Third, to help clinicians making optimal use of the models, we evaluated the discrimination, calibration, and interpretability of the ML models; however, these items were underreported in the published literature addressing PTC prognosis prediction [12, 44].

This study has several limitations. First, the retrospective nature of the study might have resulted in selection bias. Second, the ML models we developed were based on data from a single institution, and more studies covering wider populations are warranted for validation. Third, our models were built on data from patients diagnosed with PTC and treated with thyroidectomy and 131I; thus, they were unlikely to be accurate for patients whose tumor behavior was considerably different, such as children and adolescents with PTC [45] or patients who undergo lobectomy [46]. Fourth, our sample data used for developing the models were unbalanced due to the low incidence of recurrence in PTC patients (only 8.0% of patients experienced recurrence in our study). Unbalanced data may typically affect model training; thus, we performed resampling to minimize this effect. Advanced methods for handling imbalanced data have been proposed recently and need to be applied [47]. Fifth, 61.1% of the BRAF mutations were missing, which was used as a categorical variable and may affect the risk estimating of this variable. Finally, a median follow-up period of 45.5 months might be insufficient for the assessment of outcomes in PTC patients; thus, our models were mainly used to estimate short-term recurrence risk.

Conclusion

This study demonstrated that the RF model achieved the expected prediction performance with generally good discrimination, calibration and interpretability. It is likely that ML approaches could improve the accuracy of the existing risk stratification for PTC as well as assist physicians in better understanding how ML approaches can be applied to optimize treatment and follow-up decisions.