In silico development and experimental validation of a novel 7-gene signature based on PI3K pathway-related genes in bladder cancer

Although bladder cancer (BLCA) is the 10th most common tumor worldwide, particularly practical markers and prognostic models that might guide therapy are needed. We used a non-negative matrix factorization algorithm to classify PI3K pathway-related genes into molecular subtypes. A weighted gene co-expression network analysis (WGCNA) was generated to identify co-expression modules. Univariate Cox regression, least absolute shrinkage sum selection operator-Cox regression, and multivariate Cox regression were utilized to develop a prognostic score model. Kaplan–Meier analysis and receiver operating characteristics were utilized to measure the model’s effectiveness. A nomogram was constructed to improve the predictive ability of the model based on clinical parameters and risk. Decision curve analysis (DCA) was used to evaluate the nomogram. To evaluate the immune microenvironment, an estimate algorithm was used. Drug sensitivity was identified using the R package “pRRophetic.” UM-UC-3 cell line was used to measure the effect of CDK6 in Western blotting, proliferation assay, and 5-ethynyl-20-deoxyuridine assay. Based on PI3K pathway-related genes, The Cancer Genome Atlas (TCGA)-BLCA and GSE32894 patients were divided into two subtypes. Twenty-five co-expression modules were established using the WGCNA algorithm. A seven-gene signature (CDK6, EGFR, IGF1, ITGB7, PDGFRA, RPS6, and VWF) demonstrated robustness in TCGA and GSE32894 datasets. Expression levels of CDK6 and risk positively correlated with M2 macrophages and IgG. Cisplatin, gemcitabine, methotrexate, mitomycin C, paclitaxel, and vinblastine are sensitive to different groups based on the expression of CDK6 and risk. Functional experiments suggested that CDK6 promotes the proliferation of UM-UC-3 cells. We constructed a seven-gene prognostic signature as an effective marker to predict the outcomes of BLCA patients and guide individual treatment. Supplementary Information The online version contains supplementary material available at 10.1007/s10142-022-00884-2.


Introduction
The world's 10th leading cause of tumor-related mortality and morbidity is bladder cancer (BLCA) (Shinde-Jadhav et al. 2021), divided into muscle and non-muscle-invasive BLCA. Despite treatment advances, the diagnostic yield, outcomes, and 5-year survival rate have hardly changed (Berdik 2017). BLCA carries the highest recurrence rate, 50-70% of patients with incomplete cystectomy relapse within 5 years of treatment (Ainsworth 2017). Therefore, it is critical to identify markers to improve the diagnostic yield and outcome.
The PI3K signaling pathway is associated with the proliferative capacity of tumor cells (Pollak 2018). PI3K pathway-related genes were found to be significantly altered in many cancers, and these altered genes can promote cancer growth, apoptosis, and metastasis in pancreatic cancer (Banh et al. 2020), breast cancer (Garcia-Martinez et al. 2021), melanoma (Hamm et al. 2021), and BLCA (Hsieh et al. 2011). Studies focused on how the PI3K pathway affects the proliferation ability of BLCA cells as a downstream pathway regulated or modified by upstream molecules. However, there is no study on the effects of the PI3K pathway on treatment and outcome. Linhui Wang and Yutao Wang contributed equally to this paper.
The rapid development of gene sequencing and the widespread availability of public databases have enabled the collection of gene expression data and the identification of biomarkers.
In the current study, TCGA (https:// cance rgeno me. nih. gov/) and the Gene Expression Omnibus (GEO, http:// www. ncbi. nlm. nih. gov/ geo/) databases were used to obtain gene expression data and clinical information from BLCA patients. Using a univariate Cox proportional hazard regression model, genes that predict outcome were selected, and a prognostic model for BLCA was established using a least absolute shrinkage and selection operator (LASSO) regression. This model was subsequently validated in four cohorts.

PI3K pathway-related genes
The list of genes associated with the human PI3K pathway was downloaded from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto 2000), and 275 genes were included.

Obtaining and processing TCGA and GEO data
The BLCA cohort downloaded from TCGA included 430 samples: 411 tumor samples and 19 normal tissues. The gene expression matrix (TPM) and clinical information for each sample were applied for subsequent analysis. All samples in the TCGA-BLCA cohort served as a training set to construct the prognostic model. RNA expression profiles were normalized by log2(exp + 1). The GSE32894 (Sjödahl et al. 2012) cohort with 224 samples was obtained from the GPL6947 platform from GEO. Data from TCGA and GEO cohorts were normalized together by the "Combat" function, using the R package "sva." All clinical information for patients in the TCGA and GSE32894 cohorts was presented in Table 1.

Molecular typing based on PI3K pathway-related genes
Fifty-eight genes were not found when extracting the expression of 275 PI3K pathway-related genes from the TCGA cohort and GSE32894 cohort; therefore, the expression of 217 genes was used for subsequent studies. To conduct a univariate Cox analysis, the coxph function was utilized. The "brunet" criterion and the non-negative matrix factorization (NMF) algorithm with 50 iterations were used to cluster BLCA samples. The clusters' k number was between two and 10. To determine the average profile width of the common membership for each subclass with a minimum membership of 10, the R package NMF was utilized. The clusters' optimal numbers were determined using comprehensive consideration of dispersion, cophenetic, and silhouette.

WGCNA
Using the R package "WGCNA," a co-expression algorithm was employed to identify co-expressed genes and divide the genes into multiple co-expression modules based on the protein-coding genes' expression in the TCGA-BLCA (Langfelder and Horvath 2008). We constructed a scale-free coexpression network with the soft threshold = 5 and R 2 = 0.87. The minimum module's number of genes was set to 30.

Establishment and verification of the prognostic score model
The R package "survival" was used to perform a univariate Cox regression to identify genes linked with outcome in TCGA-BLCA. The genes' numbers in the model were then reduced using the LASSO-Cox algorithm (Tibshirani 1997) using the R package "glmnet." Multivariate Cox regression was used to establish a prognostic score model, using the R package "survival." We divided TCGA-BLCA patients into high-or low-risk subgroup based on the median of risk values. Based on the median of the risk values obtained from the TCGA-BLCA cohort, we divided GSE32894 patients into high-or low-risk subgroup. The predictive ability of our model was validated using Sangerbox (http:// sange rbox. com/), a tool for bioinformatic data analysis based on the R language.

Establishment of nomogram and DCA
The R package "rms" and "regplot" (Zhang and Kattan 2017) was used to establish a nomogram combined with clinical data and risk score. The nomogram can predict BLCA patients' 1-, 3-, and 5-year survival rates and improve the predictive ability of the constructed model. The predictive ability of the nomogram was evaluated by calibration and DCA, using the R package "ggDCA."

Immune microenvironment analysis
The ESTIMATE algorithm (Yoshihara et al. 2013) is used to transform gene expression data from each patient into the fractions of stromal and immune cells, thereby obtaining stromal and immune scores. To determine the correlation of risk score and gene expression with tumor purity, stromal score, immune score, and various inflammatory factors, the R package "heatmap" was used. Correlations between immune cells and risk score or gene expression were calculated using the R package "pheatmap."

Drug sensitivity analysis
Gene expression data were transformed into drug sensitivity data. Then, using the R package "pRRophetic," relationships were determined between risk score and gene expression or drug sensitivity. The R package "ggpubr" was used to draw boxplots to display results.

Cell culture and transfection
The Chinese Academy of Sciences Cell Bank (China) provided the UM-UC-3 human bladder cell line. UM-UC-3 cells were cultured in high-glucose DMEM (Hyclone) with 10% fetal bovine serum (Gibco), at 37 ℃ and 5% CO 2 . The small interfering RNAs (siRNA) that reduce CDK6 expression were acquired from JTSBIO Co. (China). The sequences of Si1-CDK6 were as follows: sense: AGU UAG UUU GGU UUC UCU GUC; anti-sense: CAG AGA AAC CAA ACU AAC UUU. The sequences of Si2-CDK6 were as follows: sense: AAC ACU AAA GUU AGU UUG GUU; anti-sense: CCA AAC UAA CUU UAG UGU UUG.

Western blotting
Radioimmunoprecipitation assay (RIPA) buffer was used to take protein from cells. A bicinchoninic acid assay kit measured the concentrations of protein. Different protein bands were separated by 10% sodium dodecyl sulfate-polyacrylamide gel electrophoresis, then transferred to polyvinylidene fluoride membranes. After blocking the membranes, they were incubated overnight with primary antibodies. The membranes were washed and then incubated with secondary antibodies for 1 h. Finally, membranes were rewashed, and the enhanced chemiluminescence detected the protein expression. ImageJ analyzed the results of pictures.

Cell proliferation assay
UM-UC-3 cells were plated in 96-well plates. Cell Counting Kit-8 assay reagent (Bimake, USA) was added to each pore to be measured according to the manufacturer's instructions. Absorbance was measured on an automated reader (Bio-Rad) at 450 nm.

EdU assay
UM-UC-3 cells were plated in 24-well plates. EdU assay reagent (Beyotime Biotechnology, China) was added to each pre according to the manufacturer's instructions. A fluorescent microscope (Olympus Corporation, Japan) was used to obtain images, and the number of proliferating cells was counted using ImageJ software.

Molecular subtype identification using the NMF algorithm
The "survival" R package was used to conduct a single-factor Cox analysis. The 55 genes with P < 0.01 associated with BLCA outcome were obtained. The "Brunet" criterion and the NMF algorithm with 50 iterations were used to cluster BLCA samples. The clusters' k numbers were between two and 10. The average profile width of the common membership matrix was determined using the R package NMF, with a minimum membership of 10 for each subclass. The cluster groups' optimal number (k = 2) was determined using cophenetic dispersion and silhouette (Fig. 1a,  b). The expression levels of PI3K pathway-related genes for samples in each group are shown in Fig. 1c. The outcome was worse in the C2 group than in the C1 group (Fig. 1d). The NMF algorithm was validated in the GSE32894 cohort. In the GSE32894 cohort, the cluster groups' optimal number was 2 (Fig. S1a), and the C2 group had a worse outcome than the C1 group (Fig. S1b).

WGCNA and functional analysis of co-expression modules
The samples were clustered using hierarchical clustering based on the protein-coding genes' expressions in TCGA-BLCA (Fig. 2a), and a topological overlap matrix was established (β = 5) (Fig. 2b). The co-expressed genes were grouped into a module using dynamic tree shearing; 25 co-expression modules were established (Fig. 2c).
The correlation between each module and futime, fustat, stage, gender, age, cluster 1 and cluster 2 is shown in Fig. 2d. The highest correlations were between the brown and green modules and clusters. R software was used to perform KEGG pathway enrichment analysis and GO analysis on the genes of brown and green modules.

Establishment of a prognostic risk model
Genes associated with the TCGA-BLCA outcomes among PI3K pathway-related genes were selected using univariate Cox regression, and we identified 55 genes with P < 0.01. The LASSO-Cox algorithm was performed on 55 genes to reduce the number of the constructed risk score prognostic model (Fig. 3a, b); 11 genes (CDK6, CRTC2, EGFR, IGF1, IKBKB, IL7, ITGB7, LAMA2, NGF, PDG-FRA, PPP2R3A, RPS6, and VWF) was screened. Finally, a risk model was constructed using multivariate Cox regression from seven genes. The signature formula for the seven mRNAs was as follows: RiskScore = 0.210687 307280298 × expCDK6 + 0.25722712750568 × e x p E G F R + 0 . 1 6 2 2 3 0 2 9 9 6 4 6 6 3 5 × e x p I G F 1 -0.436384884423358 × expITGB7 + 0.14593082164 The patients were divided into two risk groups. Survival curves were drawn to compare survival. Outcomes were worse in the high-risk group than in the low-risk group (P < 0.001) in Fig. 3c. Receiver operating characteristic (ROC) curves were drawn to determine the model's applicability. Its 1-, 3-, and 5-year areas under the curve (AUCs) were 0.74, 0.76, and 0.75, respectively (Fig. 3d). Survival status, risk score, and expression of the seven genes for each sample are shown in Fig. 3e.

Robust verification of the risk signature
To assess the model's applicability, risk scores of GSE32894 were calculated based on the risk model. The patients were divided into high-and low-risk groups. The survival curve of the external cohort was drawn to determine the model's efficiency. Outcomes were worse in the high-risk group than in the low-risk group (P < 0.05) (Fig. 5a). In the GSE32894, the 1-, 3-, and 5-year AUCs were 0.74, 0.74, and 0.76, respectively (Fig. 5b). The risk, survival status distribution, and patients' gene expression in the external cohort are shown in Fig. 5c. Univariable and multivariable Cox regression analysis showed that age, stage and risk score could independently predict the prognoses of BLCA patients (Table 2).

Analysis between risk models and clinical characteristics
Patients were grouped according to their stage, age, and gender. To determine the model applicability, KM curves were drawn. TCGA findings demonstrated that the model predicted outcomes with stage I + II, stage III + IV, age ≤ 60, age > 60, male and female in Fig. 6a-f (P < 0.05 for all). The analysis of the GSE32894 cohort demonstrated that the model could predict outcomes with Ta-T2 (P < 0.05) and female (P < 0.05) (Fig. S3a-f).

Construction and predictive ability of nomogram
Combining clinical parameters (grade, gender, age, and stage) and risk score, a prognostic nomogram was constructed with BLCA patients' survival probability (Fig. 7a). The 1-, 3-, and 5-year calibration were used to evaluate the predictive discrimination of the nomogram, and the result showed that the nomogram  had the best predictive discrimination for 1-year overall survival (Fig. 7b). The AUC value (0.751) of the nomogram was better than the constructed model (0.727), age (0.667), gender (0.476), grade (0.531), and stage (0.640), as shown in Fig. 7c. The DCA result showed that the net benefit of the nomogram was greater than the single independent clinical feature (Fig. 7d).

Immune environment evaluation
The heat map showed that increased CDK6 expression and risk score were associated with survival status, higher grade, estimate, stromal and immune scores, and inflammatory factor expression (Fig. 8a, b). The R package "pheatmap" was used to determine the relationship between immune cells and CDK6 expression and risk score. The number of M2 macrophages increased with higher CDK6 expression and risk score (Fig. 8c, d).

Identification of sensitive drugs
After converting gene expression into drug sensitivity data using the R package, patients were divided into high-or low-expression levels or high-or low-risk according to CDK6 and risk score. Then sensitive therapeutic agents were sought. The high-CDK6 expression group's sensitivities to cisplatin (Fig. 9a), gemcitabine (Fig. 9b), mitomycin C (Fig. 9d), paclitaxel (Fig. 9e), and vinblastine (Fig. 9f) were higher than those of the low-expression group (P < 0.001 for all). There is no different sensitivity to methotrexate between the high-and low-CDK6 expression group (Fig. 9c). The high-risk group's sensitivities to cisplatin (Fig. 9g), gemcitabine (Fig. 9h), and paclitaxel (Fig. 9k) were higher than those of the low-risk group (P < 0.001 for all). For methotrexate (Fig. 9i), the low-risk group was more sensitive (P < 0.001). The high-and low-risk groups have no different sensitivity to mitomycin C (Fig. 9j) and vinblastine (Fig. 9l).

CDK6 functional experiments
For validation of the bioinformatics results in vitro, we selected the risk factor CDK6. Changes in BLCA cell proliferation were observed by decreasing CDK6 expression using siRNA (Fig. 10a). In the CCK8 assay, the proliferation of cells from the CDK6 knockdown group was lower than the normal group (Fig. 10b). The results in the EdU experiment were consistent with those of the CCK8 experiment (Fig. 10c).

Discussion
Bladder urothelial cancer is a heterogeneous malignancy with a high likelihood of incidence and recurrence (Robertson et al. 2017;Lindskrog et al. 2021). Despite developing comprehensive treatment strategies for BLCA, there is still a lack of markers that can effectively diagnose it and guidance for molecular targeted therapy and individualized treatment. Based on PI3K pathway-related genes in this study, we divided BLCA patients into two subtypes using the NMF algorithm. Twenty-five co-expression modules were then The biological functions and pathways that could be affected by the two modules were then analyzed. We developed a prognostic score model from transcriptome data using TCGA-BLCA and validated its stability in an external cohort using univariate Cox, LASSO-Co, and multivariate Cox analyses. This model included seven prognostic factors (CDK6, EGFR, IGF1, ITGB7, PDGFRA, RPS6, and VWF). The patients were divided into subgroups according to their stage, age, and sex to test whether the model would determine the outcome. The nomogram based on clinical parameters and risk had better predictive ability than risk and independent clinical parameters. We found that M2 macrophages highly correlated with CDK6 and risk, and CDK6 and risk were associated with status, T-stage, estimate score, immune score, stromal score, and various inflammatory factors. A prognostic model of five lncRNAs was established using univariate and multivariate Cox proportional hazard regression and has an excellent predictive ability (Liu et al. 2022). After screening prognostic factors using univariate regression analysis, an eight-m5C-related LncRNA risk model was constructed using the LASSO-Cox regression method to predict outcomes (Yuan et al. 2021). Univariate and LASSO-Cox regression analyses were used to establish an 11-lncRNA signature for predicting outcomes, and then, the signature was validated in all tests (Gao et al. 2021). A novel immune-related lncRNA signature was also constructed using the univariate and LASSO-Cox regression method. The AUCs of the signature were high, demonstrating the excellent ability to predict outcomes . The method of our constructed model was similar to the methods mentioned above. Our model was established using multivariate Cox regression analysis after screening genes associated with outcomes using univariate and LASSO-Cox regression analysis. A series of evaluations of the model showed that it also has an excellent ability to predict patients' prognoses.
Immune cell expression analysis indicated that CDK6 expression level and risk positively correlated with M2 macrophages, which promote tumor development, progression (Zhao et al. 2020), and poor outcome (e.g., in colon cancer (Cheng et al. 2018)). The inflammatory response can accelerate cancer progression (Colaprico et al. 2020). IgG mediates inflammatory responses, promoting tumor metastasis (Stamatiades et al. 2016) (e.g., in pancreatic cancer (Chen et al. 2019)). Excessive release of pro-inflammatory cytokines promotes the occurrence and metastasis of BLCA (Luo and Xu 2020). These findings suggest that higher CDK6 expression and risk are associated with higher IgG quantity. Our results agree with these findings. The effect of CDK6 on the proliferation of UM-UC-3 cells was measured using a CCK-8 assay. c The effect of CDK6 on the proliferation of UM-UC-3 cells was observed using an EdU assay. To determine whether there was a significant difference between groups, the t test was used and was expressed as the mean ± standard deviation Cisplatin and gemcitabine are first-line chemotherapeutic agents for BLCA and improve overall survival (Moufarij et al. 2003;Boulikas and Vougiouka 2004;Crabb and Douglas 2018;Hayashi et al. 2020). Methotrexate, vinblastine, and cisplatin are safe and effective adjuvant therapeutic agents for muscle-invasive BLCA (Plimack et al. 2014). Mitomycin C treats superficial BLCA and prevents recurrence and progression (Ayyildiz et al. 2007). Paclitaxel is used to treat patients with metastatic BLCA (Jacobs et al. 2010). Cisplatin, gemcitabine, mitomycin C, paclitaxel, and vinblastine may be more effective for patients with high-CDK6 expression. Cisplatin, gemcitabine, and paclitaxel may be more effective for high-risk BLCA, while methotrexate may be effective for low-risk BLCA.
Although this study relied on transcriptome data from many samples and experimental validation, it has limitations. Previous researchers provided the data, and comprehensive, in-depth clinical studies are needed before clinical application. In addition, more cohorts are needed to validate the model's stability.

Conclusions
We constructed a robust seven-gene prognostic risk model and validated it using an external dataset. The biological impact of genes on BLCA was identified. The number of M2 macrophages and IgG levels positively correlated with the expression of CDK6 and risk. BLCA patients were grouped based on the expression of CDK6 and risk, and drugs that may be more sensitive in different groups were identified. The proliferation ability of BLCA cells was reduced when the expression of CDK6 was reduced. CDK6 is a potential biomarker that is involved in the proliferation of BLCA. To predict outcomes in BLCA, we recommend this seven-gene signature.