Application of machine learning algorithm in prediction of lymph node metastasis in patients with intermediate and high-risk prostate cancer

Purpose This study aims to establish the best prediction model of lymph node metastasis (LNM) in patients with intermediate- and high-risk prostate cancer (PCa) through machine learning (ML), and provide the guideline of accurate clinical diagnosis and precise treatment for clinicals. Methods A total of 24,470 patients with intermediate- and high-risk PCa were included in this study. Multivariate logistic regression model was used to screen the independent risk factors of LNM. At the same time, six algorithms, namely random forest (RF), naive Bayesian classifier (NBC), xgboost (XGB), gradient boosting machine (GBM), logistic regression (LR) and decision tree (DT) are used to establish risk prediction models. Based on the best prediction performance of ML algorithm, a prediction model is established, and the performance of the model is evaluated from three aspects: area under curve (AUC), sensitivity and specificity. Results In multivariate logistic regression analysis, T stage, PSA, Gleason score and bone metastasis were independent predictors of LNM in patients with intermediate- and high-risk PCa. By comprehensively comparing the prediction model performance of training set and test set, GBM model has the best prediction performance (F1 score = 0.838, AUROC = 0.804). Finally, we developed a preliminary calculator model that can quickly and accurately calculate the regional LNM in patients with intermediate- and high-risk PCa. Conclusion T stage, PSA, Gleason and bone metastasis were independent risk factors for predicting LNM in patients with intermediate- and high-risk PCa. The prediction model established in this study performs well; however, the GBM model is the best one.


Introduction
According to the global cancer statistics in 2020, PCa ranks sixth in incidence rate and seventh in mortality in China. (Cao 2020). Pelvic lymph node metastasis (PLNM) accounts for about 15% of all newly diagnosed PCa patients, which is related to biochemical recurrence (BCR) and distant metastasis (DM) after treatment (von Bodman et al. 2010;Wilczak et al. 2018). Gervasi et al. reported that the 10-year risk of DM in lymph node positive patients was 83%, and the 10 year risk of death from PCa was 57% (Wagner et al. 2008). Extended pelvic lymph node dissection (ePLND) has become an integral part of radical prostatectomy (RP), while the American Association of Urology (AUA) and the European Association of Urology (EAU) recommend that low-risk patients do not need ePLND; ePLND is an option for patients with intermediate-and high-risk PCa whose Briganti nomogram predicts that the probability of LNM is greater than 5% (Engel et al. 2010;Lestingi et al. 2021). Therefore, the clinical staging of PCa is the key to precision medicine, and accurate identification of PLNM of PCa patients is crucial to determine the appropriate treatment plan (Hou et al. 2021;Mottet et al. 2017).
At present, many studies have reported that non-invasive imaging techniques can be used to predict LNM of PCa before treatment. CT and MRI, the most commonly used in clinic, can assess the status of pelvic lymph nodes by examining their size. Both of them have no obvious advantages and disadvantages, with a sensitivity of about 40% and a specificity of about 82% (Créhange et al. 2012;Hövels et al. 2008). Von Below et al. showed that multi parameter MRI (mpMRI) is more sensitive and specific than MRI in detecting tumors and lymph nodes, but it is easy to lose signal or image distortion in DWI sequence (von Below et al. 2016). Similarly, PSMA PET/CT has been widely used to detect PCa in prostate, soft tissue and bone, however, and its detection rate of 2-5 mm lymph node invasion is about 60% (Hofman et al. 2018;van Leeuwen et al. 2017). In addition, new imaging technologies are being developed such as MR lymphography with superparamagnetic iron oxide (SPIO) nanoparticles and targeted positron emission tomography imaging (PET) (Muteganya et al. 2018). Their efficacy of prediction for the NLM is still unclear.
Recently, scientists have made great efforts to explore different methods for more accurately evaluating the risks of LNM. However, due to the complexity of medical data, there are important connections between various factors, and certain differences in the calculation methods of models. Therefore, machine learning (ML) has become a powerful tool for improvement of clinical strategies in the field of medical research (Mirza et al. 2019;Oliveira 2019). Compared with traditional regression analysis, ML algorithm has significant advantages in prediction performance in large databases (Bi et al. 2019;Wang et al. 2020 (Li and Zhou et al. 2022).
To our knowledge, there is no effective ML model for predicting risks of LNM of PCa. Therefore, in this study, we established a new model for predicting risks of LNM in patients with intermediate-and high-risk PCa through 6 ML methods based on the clinical and histopathological parameters that are closely related to the prognosis of the PCa in the SEER database.

Study population
The training set and test set were recruited from the SEER database for patients diagnosed with intermediate-and highrisk PCa from 2000 to 2019. The patients diagnosed as intermediate-and high-risk PCa by Gansu Provincial Hospital from 2012 to 2018 will be taken as the validation set. Inclusion criteria were as follows: (1) patients with primary prostate cancer confirmed by the case; (2) at least meet one of PSA ≥ 10 ng/ml, Gleason score ≥ 7 or T stage ≥ T2b; (3) The clinical and pathological data and survival period were complete. Exclusion criteria: (1) no complete clinicopathological data and survival period; (2) PSA < 10 ng/ml, Gleason score < 7 and T1-T2a. Since the study was retrospective and the data were from an open database, informed consent was not used. The detailed screening process is shown in Fig. 1.

Establishment of predictive model
In this study, we compared the pathological characteristics selected from SEER database and external validation set, and analyzed the risk factors for predicting LNM using single factor analysis. Multivariate logistic regression analysis was used to evaluate the variables, and independent predictors related to LNM were obtained. Then we selected 6 common prediction models based on ML to predict LNM of intermediate-and high-risk PCa. We have established six models: random forest (RF), naive Bayesian classifier (NBC), xgboost (XGB), gradient boosting machine (GBM), logistic registration (LR) and decision tree (DT). The SEER dataset was divided by a ratio of 70:30. 70% is used for machine algorithm training, 30% is used for testing, and external verification was used as a separate verification set. In the training process of ML algorithm, each model is cross verified for 10 times to maintain the stability of the model, and the best super parameters are selected using random search method. The F1 score, AUROC, sensitivity and specificity of each model are comprehensively evaluated, compared the performance differences of different models, and selected the model with the highest accuracy as the final model according to the comprehensive score. Finally, the accuracy and generalization of the selected best prediction model are further verified using an independent external verification set.

Assessment of prediction model
We used area under curve (AUC) to evaluate the accuracy of each model. Considering the possibility of over fitting or under fitting, we combined the sensitivity and specificity of each model to obtain F1 score. In addition, we use decision curve analysis to test the prediction accuracy of the model.

Statistical analysis
We used SEER * STAT statistical software to extract training sets and test sets from SEER database. Hospital patients as an external validation set. All patient data were analyzed with SPSS V.25.0. Continuous variables are represented by the median of interquartile interval (IQR), and categorical variables are represented by values and proportions. Wilcoxon rank sum test is used for continuous variables, and chi square test or Fisher exact test is used for categorical variables. Univariate and multivariate logistic regression were used to analyze the risk factors of lymph node metastasis in high-risk PCa. P values lower than 0.05 were statistically significant. Adjusted odds ratios (ORs) and corresponding 95% confidence intervals (95% CI) were calculated. The modeling process is implemented through the Sci Kit Learn library (version 0.19.2) in Python (version 3.7.1). Test the training set with RF, NBC, XGB, GBM, LR and DT, and establish a prediction model. The relative importance of each input variable in each model is analyzed. We used 10 times cross validation and ROC curve analysis on the training set to test the performance of the model. Finally, the prediction accuracy of GBM model is further verified by decision curve analysis.

Baseline characteristics
A total of 24,470 patients with intermediate-and high-risk PCa were included in this study, including 24,359 from SEER database and 111 from our hospital's external validation set. Patients were divided into two groups according to whether they had LNM. There were significant differences between the two groups (patients with or without LNM) in terms of grade (p < 0.001), T stage (p < 0.001), M stage (p < 0.001), Stage (p < 0.001), Gleason (p < 0.001), PSA (p < 0.001), bone metastasis (p < 0.001), liver metastasis and lung metastasis (p < 0.001) ( Table 1).

Screening and validation of the best machine learning model
With lymph node status as a prognostic indicator, four factors (p < 0.05) in the above logistic regression analysis were determined to enter the model as variables. In the training set, ML algorithms including RF, NBC, XGB, GBM, LR and DT are executed to establish the prediction  prediction ability, AUROC = 0.82 (Fig. 4). AUROC of all models in the test set is > 0.7. F1 score value is suitable for evaluating the prediction performance of unbalanced samples. In the test set, GBM has the best prediction performance, significantly better than RF (F1 value: 0.838, sensitivity (recall): 0.877, specificity: 0.783; F1 value: 0.798, sensitivity (recall): 0.857, specificity: 0.709). Based on the aforementioned results, GBM was selected as the best prediction model for predicting LNM (Table 3). Furthermore, decision curve analysis (Fig. 5) shows the accuracy of GBM model.

Permutation feature of importance
In the six models, the relative importance order of each input variable is slightly different. T, PSA and Gleason are almost the first three indicators of each model, and bone metastasis is a lower indicator. (Fig. 6) In the GBM model, the order of relative importance of the variables from high to low is T, PSA, Gleason and bone metastasis.

Calculator preliminary model
The GBM model performs best among the six models. Accordingly, we have established a calculator preliminary model to promote the clinical application of this prediction model (Fig. 7).

Discussion
LNM is a paramount prognostic factor for patients with PCa, and has been proved to be an important predictor of BCR survival, metastasis free survival and overall survival of PCa (Engel et al. 2010;Wilczak et al. 2018

Fig. 5
The decision curve analysis of the GBM model. In the figure, the red curve represents the predicted performance of the GBM model, respectively. In addition, there are two lines, which represent two extreme cases. The gray vertical line represents the hypothesis that all patients have LNM; the black horizontal line represents the hypothesis that no LNM occurs. The curve showed that when the LNM probability was between 0.1 and 0.9 in the training set. LNM could be discriminated when using this GBM predictive model to make clinical decisions lymph node analysis on 68 Ga-PSMA-11 PET. Compared with the previously used clinical nomograms, this model has a remarkably improved the positive rate of LNM in the patient selecting to perform ePLND (Ferraro et al. 2020). In this study, we used the large sample size of SEER database and ML algorithm to develop six prediction models to predict LNM in the patients with intermediate-and high-risk PCa. Logistic regression analysis showed that T stage, Gleason score, PSA and bone metastasis were independent risk factors for pelvic LNM of intermediate-and high-risk PCa. Among the six models, the AUC value of GBM model is the highest, and the prediction accuracy of other models for LNM is about 80%. RF model shows the best prediction performance before and after data balancing, with obvious advantages of high precision and fast speed; however, it also has the disadvantage of over fitting. F1 score, which represents the harmonic average of the accuracy rate and recall rate, is the final assessment parameter of the evaluating each model. According to the evaluation results of the test set, the prediction performance of GBM model is better than that of RF model. It can be seen that RF model may show over fitting in the training process, which makes it unsuitable for the data in the test set, while GBM model has the best prediction performance. To increase the application feasibility of this model, we developed a calculator to evaluate the individual probability of LNM in patients with intermediate-and highrisk PCa.
The results of this study showed that T stage, PSA, Gleason score and bone metastasis were the most important predictors in the patients with intermediate-and high-risk PCa. As an important indicator of tumor progression, T stage is positively correlated with LNM in a large number of tumors (Barriera-Silvestrini et al. 2021). A large number of research data in this study show that the level of high PSA will increase the rate of lymph node invasion, which is contrary to the results of the previous studies. The possible reason is PSA may be more meaningful in D'Amico risk stratification. The increase of Gleason score also increases the risk of lymph node invasion (Turk et al. 2018). Bone metastasis is significantly related to LNM of PCa, which can provide some ideas for follow-up research, that is, consider the existence of metastasis of other sites as a factor before patients have LNM.
The EAU guidelines used Briganti's nomogram prediction model to screen ePLND patients. The advantage of this study is to compare several models head-to-head with the nomogram model. The sensitivity, specificity and AUC of the nomogram are 0.882, 0.705 and 0.80, respectively, while the sensitivity, specificity and AUC of GBM are 0.877, 0.783 and 0.813, respectively. It shows that GBM in the six predictive models has the best predictive value for LNM in the patients with intermediate-and high-risk PCa. To further facilitate clinical  Calculator based on GBM model to predict LNM of intermediate-and high-risk PCa application, we designed a preliminary calculator model that can quickly calculate the probability of LNM.
Of course, this study has several limitations. First, this study is a retrospective study, which may have some selection bias. Second, SEER database lacks more data such as tumor volume, percentage of positive tissue cores, testosterone level, and so on. In addition, the external validation set data is small, and more sample sizes need to be included to test the effectiveness of the model. Finally, although we have corrected the sample imbalance problem of SEER dataset as much as possible, this problem will still interfere with the results and affect the generalization ability of the model.

Conclusion
This research has developed and validated six prediction models using ML algorithm, of which GBM model has the best performance. Based on this algorithm, a preliminary model of the calculator is designed, and then the local LNM probability in patients with intermediate-and high-risk PCa can be individually predicted according to the existing clinical characteristics, which can help clinicians quickly and accurately assess the risk of LNM, finally, precise therapy.
Author contributions WXR and ZXX: carried out the research and design. ZM: conducted research and collected and analyzed data. LHP: conceived this research and helped shape language. LXP and LY: provided suggestions. All authors have contributed to the article and approved the submitted version.
Funding This study was supported by Gansu Natural Science Foundation (22JR5RA670 and 22JR11RA271) and Gansu Provincial Hospital (17GSSY3-4).

Data availability
The data on which the study is based is available from the repository and can be downloaded at the following link (https:// seer. cancer. gov). Relevant information will be provided upon reasonable request.

Conflict of interest
The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.

Ethics approval
The SEER database is an open and identifiable public database, so it does not require the approval and informed consent of the agency review committee. For single-center data, this study is a retrospective study. The basic information of patients is not involved in the study, so the approval of the ethics committee is not required.
Consent to participate Informed consent was obtained from all individual participants included in the study.

Consent to publish All authors consent to publish this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.