1 Introduction

Bladder cancer (BLCA) is the most common malignant tumor of the urinary system, with an estimated 549,393 new cases and approximately 199,922 deaths globally each year [1]. In 2023, BLCA ranked as the fourth most common cancer in men, representing approximately 6% of new cancer diagnoses and 4% of cancer-related deaths [2]. Various risk factors contribute to the incidence of BLCA, including smoking, occupational exposure, advanced age, exposure to carcinogens, race, and specific medications. Smoking stands out as a major cause of bladder malignancy, while the use of medications containing cyclophosphamide, pioglitazone, phenazopyridine, as well as conditions such as bladder stones, urinary stasis, and chronic cystitis, also elevate the risk of BLCA [3]. BLCA typically originates in the epithelial tissues of the bladder and other urinary organs, with urothelial carcinoma being the most common form, accounting for approximately 80–90% of cases. Depending on the ability to invade the muscle layer, BLCA is classified as either muscle-invasive (MIBC) (approximately 30%) or non-muscle-invasive (NMIBC) (approximately 70%) [4]. MIBC tends to invade the detrusor muscle and has a higher propensity to spread to lymph nodes and other organs, while NMIBC includes in situ carcinoma, papillary tumors invading the lamina propria, and non-invasive papillary tumors [5]. Most BLCA patients are diagnosed due to painless gross hematuria; 10–20% of bladder cancer patients experience hematuria and 2–5% patients presenting with microscopic hematuria. Symptoms such as urinary frequency, urgency, and dysuria also raise suspicion of BLCA. The gold standard for initial diagnosis is cystoscopy, although pathological evaluation is necessary to differentiate BLCA from other benign conditions [2].The treatment of bladder cancer hinges on the histopathology of bladder cancer tissue (NMIBC or MIBC). Histopathology plays a critical role in guiding treatment decisions and understanding tumor biology [6]. NMIBC can be managed with adjuvant intravesical therapy and transurethral resection, with enhanced cystoscopy aiding in reducing recurrence risk. MIBC requires more aggressive treatments such as urinary diversion surgery and radical cystectomy to reduce the risk of disease-specific death and cancer metastasis [7]. Both NMIBC and MIBC have high recurrence rates, with up to 70% for NMIBC, some of which can progress to MIBC. The metastatic stage survival rate of MIBC is less than 5% [8]. Therefore, BLCA patients require frequent monitoring and treatment incurring high treatment costs. However, the current methods for monitoring BLCA recurrence including imaging test, cystoscopy and cytology, cannot adequately address this problem. Thus, there is a pressing need for effective early detection methods to enhance disease diagnosis, treatment, and prognosis [9].

DNA methylation involves the covalent addition of a methyl group to the 5th carbon atom of cytosine-adenine (CpG) dinucleotides [10]. It commonly leads to alterations in gene expression and serves as a crucial epigenetic mechanism regulating gene expression [11]. Studies have revealed early increased methylation of tumor suppressor genes in various cancers, with changes in DNA methylation patterns being among the initial detectable alterations associated with tumorigenesis [9, 12]. Consequently, DNA methylation markers are widely utilized in the diagnosis and prognosis of common cancers [13,14,15,16].

We aim to investigate genes containing DNA methylation sites associated with prognosis in BLCA. We downloaded bulk transcriptome sequencing data, methylation data, and single-cell sequencing data of BLCA from public databases. Initially, we conducted Cox survival analysis for each methylation site, identifying genes with prognostic significance with more than 10 methylation sites to form the bladder cancer prognostic-related methylation gene set. Subsequently, we overlapped these genes with epithelial cell marker genes from single-cell sequencing analysis to obtain epithelial cell-related methylation genes of prognostic significance. Utilizing ten machine learning algorithms with various combinations, we selected key genes (METRNL, SYT8, COL18A1, TAP1, MEST, AHNAK, RPP21, AKAP13, RNH1) based on the C-index from multiple validation sets. By performing univariate and multivariate Cox analyses on clinical characteristics and model genes, we identified independent prognostic factors (HNAK, RNH1, TAP1, Age, and Stage) for constructing a Nomogram model, which exhibited robust diagnostic accuracy, prognostic prediction ability, and clinical decision-making utility. The expression profiles of model genes varied across different clinical features. Further, we assessed the correlation between immune cell scores and nomogram scores using seven immune cell infiltration prediction algorithms. Finally, we conducted drug sensitivity analysis and molecular docking of Nomogram model genes using the CMap database. This study furnishes valuable insights and a theoretical framework for prospectively evaluating prognosis, selecting appropriate drugs, and understanding the implications of DNA methylation alterations on BLCA prognosis.

2 Material and methods

2.1 Data preparation and preprocessing

Data were from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) database, including GSE130001 (scRNAseq of 2 muscle-invasive bladder cancer samples), GSE13507 (microarray analysis of 165 primary bladder cancer samples, 23 recurrent non-muscle invasive tumor tissues, 58 normal looking bladder mucosae surrounding cancer, and 10 normal bladder mucosae), GSE32894 (gene expression profiles of 308 urothelial cancer samples), GSE48276 (gene expression profiles of 116 urinary tract epithelial cancer samples), GSE70691 (expression data and survival outcome data of 22 NAC-MVAC and 27 adjuvant MVAC treated bladder cancer), and GSE31684 (93 bladder cancer patients managed by radical cystectomy). Immunohistochemistry data of normal and tumor cells were obtained from The Human Protein Atlas (HPA, https://www.proteinatlas.org/).

The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov) database was utilized for downloading DNA methylation data, transcriptome data, and clinical data pertaining to BLCA. Following single-factor Cox analysis of 357,834 methylation sites, we selected 714 genes with 10 or more methylation sites showing prognostic significance (P < 0.05). Data processing from public databases in this study included the removal of samples with missing information and the averaging of data for genes with multiple entries in the expression matrix. Furthermore, all public databases allowed unrestricted access and utilization without the need for additional ethical approval. Data acquisition and analysis procedures adhered to relevant regulations.

2.2 Single-cell sequencing data analysis

We conducted quality control and data cleaning on the GSE130001 dataset to ensure the accuracy and reliability of subsequent analyses, based on criteria including nCount_RNA > 1000, 100 < nFeature_RNA < 5000, and percent.mt < 10. Single-cell sequencing analysis in this study was performed using the “Seurat” package for single-cell sequencing data. We utilized PC = 6, Resolution = 1.2 as parameters and employed Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction of single-cell data. Cell subpopulations were manually annotated and visualized using TopGene (http://www.topgene.cn/page/home/home.html), and the expression profiles of specific markers were depicted in the form of heatmaps.

2.3 Initial gene screening and enrichment analysis

We intersected the BLCA prognostic methylation gene set with epithelial cell markers. The resulting genes underwent single-variable Cox analysis to identify those with significant prognostic value (P < 0.05), which were then visualized in forest plots. Subsequently, utilizing the “ssgsea” function, we conducted Single-sample gene set enrichment analysis (ssGSEA) on the refined gene set to generate comprehensive scores for each sample. By stratifying samples into high and low scoring groups based on the median score, we compared Overall Survival (OS) between the two groups and illustrated the predictive capability of the scores through Kaplan–Meier (KM) curves. Further analysis involved examing differential gene expression between the high and low scoring groups and conducting Gene Set Enrichment Analysis (GSEA) based on the list of differentially expressed genes. The top 3 pathways with the highest and lowest enrichment scores were visually represented using line graphs.

2.4 Gene selection through combination of machine learning methods

In order to enhance prediction accuracy and robustness, we employed ten machine learning algorithms, including stepwise Cox, survival-SVM, Least Absolute Shrinkage and Selection Operator (LASSO), SuperPC, CoxBoost, Random Survival Forests (RSF), Enet, Ridge, Partial Least Squares Regression Generalized Linear Models (plsRcox), and Gradient Boosting Machine (GBM), to conduct 101 combinations for gene selection. TCGA-BLCA was utilized as the training set, while GSE13507, GSE32894, GSE48276, and GSE70691 served as validation sets. Consistency indices (C-index) and average C-index were computed for each algorithm combination within each dataset. We selected the algorithm combinations with C-index ≥ 0.6 across all datasets and the highest average C-index for further gene selection. The performance of the genes was evaluated through receiver operating characteristic (ROC) curve analysis.

Subsequently, we visualized the expression level differences of the selected genes in various cell subgroups using Uniform Manifold Approximation and Projection (UMAP) plots and violin plots.

2.5 Construction, calibration, and validation of risk model

We initially conducted univariate Cox analysis integrating the aforementioned genes with clinical-pathological features such as Age, Stage, and Grade. Furthermore, multivariate Cox analysis was further conducted to screen for independent prognostic factors, with significance set at P < 0.05. A nomogram model predicting prognosis was constructed based on the identified independent prognostic factors. The model’s risk score is the sum of the product of each model factor’s score and its corresponding coefficient, as illustrated below:

$$Risk score=\sum_{i=1}^{n}\left\{{Score}_{{factor}_{i}}*\right.\left.{\alpha }_{i}\right\}$$

Here, \({Score}_{{factor}_{i}}\) represents the score of model factors, while \({\alpha }_{i}\) represents the coefficients corresponding to model factors. Calibration curves at three time points (one year, three years, and five years) were used to assess their predictive accuracy. We assessed samples from TCGA-BLCA with a nomogram, categorizing them into high and low score groups using the median, and analyzed the variations in prognosis levels between the two groups through Kaplan–Meier (KM) plots and cumulative risk factor plots. ROC curves were also utilized to assess the model's diagnostic performance at three time points (1 year, 3 years, and 5 years), while Decision Curve Analysis (DCA) was utilized to assess the clinical decision benefit at 1-year, 3-year, and 5-year time points.

To further validate the effectiveness and generalizability of the prognostic model, we conducted external validation using GSE31684 and GSE32894 datasets, evaluating the prognostic value, diagnostic performance, and clinical decision benefit of the model through KM plots, ROC plots, and DCA.

2.6 Immune infiltration analysis

In our analysis of immune cell infiltration, we employed seven immune infiltration algorithms (XCELL, TIMER, QUANTISEQ, MCPCOUNTER, EPIC, CIBERSORT-ABS, CIBERSORT). Through spearman correlation analysis, we calculated the correlation coefficient between immune scores and nomoscores. The correlation coefficients between immune cells and nomoscores, as determined by various algorithms, were visualized using scatter plots. Additionally, the top and bottom four correlation coefficients were displayed in scatter plots.

2.7 Analysis of gene expression levels and drug prediction

We visualized the expression level differences of genes in the nomogram model across different pathological stages, tissue grades, and between normal and tumor tissues using box plots. Additionally, we retrieved immunohistochemical samples of three genes from the HPA database to observe their expression distinctions between normal and tumor tissues. By comparing gene expression profiles from the nomogram model with reference gene expression profiles from the Connectivity Map, we calculated correlation scores and selected the top ten drugs with the lowest scores for visualization. Subsequently, we conducted molecular docking of the three drugs with the lowest scores and genes, illustrating their shared amino acids and hydrogen bonds.

2.8 Statistical analysis

All statistical analyses were conducted using R software (version 4.1.3). COX regression analysis was implemented using the “survival” and “survminer” packages. Differential expression analysis was conducted using the “limma” package. The “ggplot2” package was our primary tool for visualization. A threshold of P < 0.05 was considered statistically significant (*P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001).

3 Results

3.1 Analysis of single-cell data

We conducted quality control and data cleaning on the GSE130001 dataset to ensure the accuracy and reliability of subsequent analyses, and analyzed the single-cell sequencing data using the "Seurat" package. We employed the UMAP algorithm for dimensionality reduction of the single-cell sequencing data, followed by annotation into 5 cell subtypes based on different marker genes: Epithelial cells, NK cells, Endothelial cells, CD8 + T cells, and Fibroblasts (Fig. 1A, B). Subsequently, we visualized the expression of specific marker genes for each cell subtype using a heatmap: In the Epithelial cells subtype, genes AC015849.16, CLIC3, GUCA2A, CCL5, and FXYD4 exhibited higher expression levels; In the NK cells subtype, genes RRM2, DSCC1, MNS1, RAD54L, and ESCO2 showed elevated expression levels; In the Endothelial cells subtype, genes CENPA, NDC80, ASPM, DEPDC1, and IFI27 displayed higher expression compared to other genes; In the CD8 + T cells subtype, genes IFI27, RARRES1, APOL4, CYBA, PDK4, and CRABP2 exhibited elevated expression levels; In the Fibroblasts subtype, genes IFI27, RARRES1, CYBA, COL1A1, COL5A1, NTM, ITGA11, and SFRP4 showed higher expression compared to other genes (Fig. 1C).

Fig. 1
figure 1

Single cell sequencing analysis. A Distribution of cellular subtypes in the GSE130001 dataset. B Distribution of different cell types. C Expression profiles of marker genes in each cellular subtype

3.2 Screening of bladder cancer prognosis-associated methylation genes

We intersected the methylation gene set related to BLCA prognosis with the specific marker genes identified on the cell subtypes (Supplement Table 1), resulting in the selection of 45 genes (Fig. 2A). Subsequently, we conducted single-variable Cox analysis on the obtained genes, identifying 13 genes with statistically significant effects on prognosis (Fig. 2B). Subsequently, based on the selected genes, we performed ssGSEA gene set scoring on all patients, dividing samples into high and low scoring groups based on the median score, and visualized Kaplan–Meier curves. The results indicated that the survival rates of both high and low scoring groups decreased over time, with the low scoring group significantly higher than the high scoring group (P < 0.05), demonstrating that the scores of the 13 selected genes are prognostic risk factors (HR = 1.37, P = 0.038, Fig. 2C). Furthermore, we conducted GSEA analysis on genes differentially expressed between the two groups. We identified the top three categories with the highest Enrichment Score (ES): Cytoplasmic Ribosomal Proteins, Eukaryotic Translation Elongation, and Formation of Tubulin Folding Intermediates By CCT Tric; as well as the three with the lowest ES: Mir5093p Alteration of Yap1ecm AXIS, Creation of C4 and C2 Activators, and Inflammatory Response Pathway, visualizing the results (Fig. 2D–I).

Table 1 Coefficients of nomogram factors
Fig. 2
figure 2

Gene screening and enrichment analysis. A The Venn plot of BLCA methylation genes and Epithelial cell markers. B Forest plot of 13 genes with p < 0.05 in univariate Cox analysis. C Overall survival of high score group and low score group. D GSEA analysis of cytoplasmic ribosomal proteins. E GSEA analysis of eukaryotic translation elongation. F GSEA analysis of formation of tubulin folding intermediates by CCT tric. G GSEA analysis of mir5093p alteration of Yap1ecm axis. H GSEA analysis of creation of C4 and C2 activators. I GSEA analysis of inflammatory response pathway

3.3 Screening key genes using multiple machine learning algorithms

We employed multiple machine learning models to identify key genes from the screened gene set, utilizing TCGA-BLCA as the training set and GEO as the validation set. Among all models, algorithms based on the combination of survivalSVM consistently exhibited higher average C-index values (C-index > 0.6). Notably, the StepCox[both] + survivalSVM algorithm demonstrated the highest C-index values in GSE13507 (C-index = 0.611), GSE32894 (C-index = 0.746), and the average C-index (C-index = 0.655), while also achieving favorable results in GSE48276 (C-index = 0.648) and GSE70691 (C-index = 0.617). Therefore, we selected the StepCox[both] + survivalSVM algorithm combination for gene screening (Fig. 3A).

Fig. 3
figure 3

Comparison of algorithms for selecting model genes and validation of their efficacy. A Calculation of the C-index and average C-index for 101 combinations of machine learning methods in the GSE13507, GSE32894, GSE48276, and GSE70691 datasets B ROC curves of METRNL. C ROC curves of SYT8. D ROC curves of COL18A1. E ROC curves of TAP1. F ROC curves of MEST. G ROC curves of AHNAK. (I) ROC curves of RPP21. J ROC curves of AKAP13. K ROC curves of RNH1

Subsequently, nine genes were identified: METRNL, SYT8, COL18A1, TAP1, MEST, AHNAK, RPP21, AKAP13, and RNH1. ROC curves for each gene indicated that all nine genes had an Area Under Curve (AUC) above 0.6, demonstrating their robust diagnostic performance (Fig. 3B–J). The expression distribution of these nine genes was visualized in a UMAP plot, revealing that AHNAK, MEST, and TAP1 exhibited higher expression levels across various cell subtypes compared to other genes (Fig. 4A).

Fig. 4
figure 4

Characteristics and differences in the distribution of genes selected through multiple machine learning algorithms among single-cell subtypes. A UMAP plots depicting the molecular expression of 9 genes across cell subtypes. B Violin plots illustrating the expression levels of the 9 genes across various cell types

Furthermore, violin plots were utilized to visualize the expression levels of each gene across different cell subtypes. AHNAK showed higher expression in endothelial cells compared to other genes, while AKAP13, COL18A1, and RNH1 exhibited elevated expression in fibroblasts. MEST, METRNL, and SYT8 demonstrated relatively lower expression levels across all genes, whereas RPP21 showed significantly higher expression in NK cells, and TAP1 displayed higher expression in NK cells compared to other genes (Fig. 4B).

3.4 Construction and validation of nomogram model

We conducted single-variable Cox analysis and multivariable Cox analysis, incorporating nine genes along with clinical features including Age, Stage, Grade, and Gender (Supplement Table 2), to identify five independent prognostic factors (P < 0.05): AHNAK, RNH1, TAP1, Age, and Stage (Fig. 5A, B). Among these factors, Stage (HR = 1.451) was identified as prognostic risk factors, while the remaining genes exhibited less pronounced effects on prognosis. Subsequently, the identified independent prognostic factors were utilized to construct a nomogram prognostic model for predicting the survival rates of patients at 1, 3, and 5 years (Fig. 5C, Table 1).

Fig. 5
figure 5

Identifying independent prognostic factors and constructing a nomogram prognostic model. A Univariate COX regression analysis of nine genes and clinical pathological features. B Multivariate COX regression analysis of nine genes and clinical pathological features. C Nomogram model to forecast patient’s outcome in 1, 3 and 5 years. D The 1-year calibration plots for nomogram. E The 3-year calibration plots for nomogram. F The 5-year calibration plots for nomogram

We assessed the accuracy of the nomogram prognostic model at 1, 3, and 5 years using calibration curves. Our model demonstrated close agreement between calculated and observed values, indicating its accuracy in predicting survival at these time points, with the highest accuracy observed at 1 year (Fig. 5D–F). Employing the nomogram, we stratified samples from TCGA-BLCA into high and low-risk groups based on median scores and predicted prognosis using Kaplan–Meier curves. The results revealed a time-dependent decrease in survival rates for both high and low-risk groups, with the low-risk group significantly outperforming the high-risk group (P < 0.001). The nomogram risk model emerged as a prognostic risk factor (HR = 3.29, Fig. 6A). The AUC exceeded 0.7 at 1, 3, and 5 years, indicating good discriminatory performance of the model at these time points (Fig. 6B). In the cumulative risk factor plot, an increase in Risk Score correlated with a gradual rise in the number of deceased patients in both high and low-risk groups, accompanied by a decrease in survival time. RNH1, AHNAK, Stage, and Age exhibited higher scores in the high-risk group, while TAP1 demonstrated the highest scoring in the low-risk group (Fig. 6C). Clinical decision curves illustrated superior clinical decision benefits for patients based on the nomogram model compared to decisions based on individual factors alone at 1, 3, and 5 years (Fig. 6D–F). To further validate the effectiveness and applicability of the prognostic model, external validation was conducted using GSE31684 and GSE32894 datasets. Evaluation via Kaplan–Meier plots, ROC curves, and decision curve analysis (DCA) validated the model's strong prognostic predictive capability, excellent discriminatory performance, and superior clinical decision-making advantages. (Fig. 7A–J).

Fig. 6
figure 6

Evaluation of predictive models. A Kaplan–Meier survival analysis of groups with low and high scores. B ROC curves of the nomogram at 1-year, 3-year, and 5-year time points. C Cumulative risk factor plots and heatmaps of model factor scores for low and high score groups. D Decision Curve Analysis of the nomogram and factors at the 1-year time point. E Decision Curve Analysis of the nomogram and factors at the 3-year time point. F Decision Curve Analysis of the nomogram and factors at the 5-year time point

Fig. 7
figure 7

The validation of nomogram model in verification sets. A Survival curves between two score groups in GSE31684 cohort. B ROC curves at 1-year, 3-year, and 5-year time points in GSE31684 cohort. C Decision Curve Analysis of nomogram and factors at 1-year time point in GSE31684 cohort. D Decision Curve Analysis of nomogram and factors at 3-year time point in GSE31684 cohort. E Decision Curve Analysis of nomogram and factors at 5-year time point in GSE31684 cohort. F Survival curves between two score groups in GSE32894 cohort. G ROC curves at 1-year, 3-year, and 5-year time points in GSE32894 cohort. H Decision Curve Analysis of nomogram and factors at 1-year time point in GSE32894 cohort. I Decision Curve Analysis of nomogram and factors at 3-year time point in GSE32894 cohort. J Decision Curve Analysis of nomogram and factors at 5-year time point in GSE32894 cohort

3.5 Immunological analysis

We employed seven immune infiltration algorithms (XCELL, TIMER, QUANTISEQ, MCPCOUNTER, EPIC, CIBERSORT-ABS, CIBERSORT) to score immune cells in samples from TCGA-BLCA. Spearman correlation analysis was used to calculate the correlation coefficient between immune scores and nomogram scores. Among these, Cancer-associated fibroblast and Macrophage showed relatively high positive correlations with nomoscore across multiple algorithms (Fig. 8A). Specifically, the four with the highest positive correlations with nomoscore were: Cancer-associated fibroblast_MCPCOUNTER (R = 0.410, P < 0.001), Cancer-associated fibroblast_EPIC (R = 0.341, P < 0.001), Cancer-associated fibroblast_XCELL (R = 0.332, P < 0.001), Macrophage_TIMER (R = 0.295, P < 0.001); the four with the highest negative correlations with nomoscore were: T cell CD8 + _XCELL (R = − 0.313, P < 0.001), T cell CD8 + _CIBERSORT (R = − 0.299, P < 0.001), T cell follicular helper_CIBERSORT (R = − 0.276, P < 0.001), T cell CD4 + _EPIC (R = − 0.273, P < 0.001) (Fig. 8B–I).

Fig. 8
figure 8

Immune infiltration analysis. A Correlation coefficients between the scores of each immune cell and nomogram scores under 7 immune infiltration algorithms. B Spearman correlation analysis between cancer-associated fibroblast_MCPCOUNTER and nomoscores. C Spearman correlation analysis between cancer-associated fibroblast_EPIC and nomoscores. D Spearman correlation analysis between cancer-associated fibroblast_XCELL and nomoscores. E Spearman correlation analysis between Macrophage_TIMER and nomoscores. F Spearman correlation analysis between T cell CD8 + _XCELL and nomoscores. G Spearman correlation analysis between T cell CD8 + _CIBERSORT and nomoscores. H Spearman correlation analysis between T cell follicular helper_CIBERSORT and nomoscores. I Spearman correlation analysis between T cell CD4 + _EPIC and nomoscores

3.6 Analysis of clinical characteristics and drug prediction based on nomogram model genes

We visualized the expression differences of nomogram model genes (AHNAK, TAP1, RNH1) across different clinical pathological factors (Pathologic stage, Histologic grade, Normal vs Tumor) using box plots. Among these, AHNAK showed higher expression levels in higher grades of Pathologic stage (P < 0.001) and Histologic grade (P < 0.01), with higher expression in normal tissues compared to tumor tissues (P < 0.001); TAP1 exhibited higher expression in higher grades of Histologic grade (P < 0.001), with lower expression in normal tissues compared to tumor tissues (P < 0.05); RNH1 showed higher expression levels in higher grades of Histologic grade (P = 0.09), with higher expression in normal tissues compared to tumor tissues (P < 0.001) (Fig. 9A–I). Immunohistochemical analysis was also conducted to observe their expression differences in normal and tumor cells; AHNAK exhibited higher expression levels in normal individuals compared to cancer patients (Fig. 9J); TAP1 showed higher expression levels in normal individuals compared to cancer patients (Fig. 9K); RNH1 displayed higher expression levels in normal individuals compared to cancer patients (Fig. 9L).

Fig. 9
figure 9

Expression levels in different clinical features and immunohistochemistry images of model genes. A Box plots showing the expression of AHNAK in different pathologic stages. B Box plots showing the expression of AHNAK in different histologic grades. C Box plots showing the expression of AHNAK in normal and tumor tissues. D Box plots showing the expression of TAP1 in different pathologic stages. E Box plots showing the expression of TAP1 in different histologic grades. F Box plots showing the expression of TAP1 in normal and tumor tissues. G Box plots showing the expression of RNH1 in different pathologic stages. H Box plots showing the expression of RNH1 in different histologic grades. I Box plots showing the expression of RNH1 in normal and tumor tissues. J Immunohistochemistry images of AHNAK in normal and tumor tissues. K Immunohistochemistry images of TAP1 in normal and tumor tissues. L Immunohistochemistry images of RNH1 in normal and tumor tissues

Subsequently, we conducted drug sensitivity analysis on the nomogram model genes based on the Cmap database, selecting the top ten drugs with the lowest scores for visualization. We chose the three drugs with the lowest scores: SB-216763, Evodiamine, Androstenol (Fig. 10A) for molecular docking with the model genes. Molecular docking analysis revealed that in AHNAK, amino acid PHE-97 interacted with the targeted drug SB-216763 through hydrogen bonds at distances of 3.3 Å and 2.6 Å, with a binding energy of − 7.8 kcal/mol; amino acid PHE-97 also interacted with the targeted drug Evodiamine through a hydrogen bond at a distance of 3.1 Å, with a binding energy of − 9.4 kcal/mol; amino acid LEU-95 interacted with the targeted drug Androstenol through a hydrogen bond at a distance of 2.4 Å, with a binding energy of − 7.8 kcal/mol. In TAP1, the targeted drug SB-216763 interacted with amino acids THR-232, SER-233, DC-33, DG34, DC-35 through hydrogen bonds at distances ranging from 1.7 to 3.4 Å, with a binding energy of − 8.5 kcal/mol; amino acid G-8 interacted with the targeted drug Evodiamine through a hydrogen bond at a distance of 2.3 Å, with a binding energy of − 8.4 kcal/mol; amino acid G-6 and C-7 interacted with the targeted drug Androstenol through hydrogen bonds at distances of 2.4 Å and 2.1 Å respectively, with a binding energy of − 7.1 kcal/mol. In RNH1, amino acids GLN-605 and THR-604 interacted with the targeted drug SB-216763 through hydrogen bonds at distances ranging from 1.9 to 3.0 Å, with a binding energy of − 6.9 kcal/mol; amino acid ARG-655, GLN-589, ALA-583, and GLN-552 interacted with the targeted drug Evodiamine through hydrogen bonds at distances of 2.4 Å, 3.1 Å, 3.2 Å, and 2.3 Å respectively, with a binding energy of − 7.6 kcal/mol, demonstrating the favorable targeting effects of small molecule drugs on the above proteins (Fig. 10B).

Fig. 10
figure 10

Drug prediction. A Top ten drugs with the lowest scores for three genes under the Cmap algorithm. B Molecular docking results of the top three drugs with the lowest scores from the above analysis for the three genes

4 Discussion

BLCA, as one of the most prevalent malignant tumors in the urinary system, is characterized by high heterogeneity, rapid progression, and a notable recurrence rate [17]. Data indicates that NMIBC patients face a recurrence risk exceeding 50% and a progression risk exceeding 20% within five years after 1 to 3 years of treatment [3]. In case progressing to MIBC, radical cystectomy (RC) is typically the primary treatment approach, however, around half of postoperative patients still encounter recurrence or metastasis [18]. Hence, there is a pressing need for the development of effective biomarkers to aid in early disease identification and diagnosis, offer personalized treatment targeting different stages of disease, and facilitate prognosis prediction to reduce recurrence rates and improve outcomes.

Aberrant methylation has been implicated in various cancers, where DNA methylation plays a crucial role in regulating the expression of target genes through transcriptional processes, thereby influencing different cellular behaviors [19]. Based on the stable presence of DNA methylation gene products in blood and their ease of detection, we believe that leveraging methylation genes to construct predictive models for early BLCA diagnosis and prognosis holds great promise [20].

Utilizing single-cell sequencing data from the GEO database (GSE130001), we noted a predominance of epithelial cells displaying a scattered distribution, whereas CD8 + T cells, endothelial cells, NK cells, and fibroblasts exhibited a more clustered arrangement. This indicates a high proportion of cancer cells in tumor tissue, with some presence of immune and fibroblast cells. There is considerable heterogeneity in gene expression patterns within tumor cells, and further exploration of the gene expression characteristics of each cell component and the labeling of cell subpopulation states is crucial for the development of highly targeted therapies.

After performing single-factor Cox analysis for each methylation site, we selected genes with prognostic significance (P < 0.05) containing 10 or more methylation sites as methylation genes relevant to BLCA prognosis in this study. Intersection of the malignant marker of epithelial cells and the DNA methylation gene set yielded 45 methylation genes related to BLCA. Through single-factor Cox analysis, we identified 13 genes associated with BLCA prognosis, which was validated by ssGSEA algorithm analysis of OS scores in high and low scoring groups. We found that while each gene in these 13 genes had limited impact on prognosis individually (HR value close to 1), their combined scoring demonstrated high prognostic value, with patients scoring higher showing worse prognosis, suggesting that prognosis-related methylated epithelial genes may synergistically exert oncogenic effects.

Furthermore, our GSEA revealed an upregulation of crucial biological processes such as protein synthesis and folding, signal transduction, and gene expression regulation in the high scoring group. In contrast, pathways associated with signal transduction, immune system regulation, and inflammation response were found to be elevated in the low scoring group. Cells in the high scoring group exhibited an active state of material synthesis and growth signal coordination, often linked to malignant tumor progression, conversely, the low-scoring group exhibited a more active immune status, indicating superior tumor immune function and thus, a more favorable prognosis.

In our comprehensive comparison of 101 machine learning methods, we identified an algorithm that combines StepCox with survivalSVM as the most valuable. Following refinement, we identified 9 genes: METRNL, SYT8, COL18A1, TAP1, MEST, AHNAK, RPP21, AKAP13, and RNH1. Through ROC curve validation, we found their strong diagnostic efficiency (AUC > 0.6), suggesting the potential for these genes to serve as diagnostic test criteria. Using umap plots, we observed that the expression levels of AHNAK, TAP1, MEST, and RNH1 are higher than other genes, suggesting these genes may be more active in BLCA tissues, possibly playing certain biological roles requiring further research. Violin plots showed that several genes are expressed at higher levels in NK cells and fibroblasts, indicating these genes are detrimental to tumor immunity and beneficial for maintaining the tumor microenvironment by fibroblasts. These hypotheses necessitate additional experimental validation to substantiate their implications.

We combined clinical and pathological factors to perform univariate and multivariate Cox analyses on the 9 genes, identifying AHNAK, RNH1, TAP1, age, and stage as independent prognostic factors, and constructed a nomogram prediction model based on these factors. AHNAK, recognized as a nucleoprotein, is a Protein Coding gene encoding a substantial (700 kDa) structural scaffold protein featuring a central domain containing repeat sequences of 128 amino acids. This protein is implicated in various processes including blood–brain barrier formation, cell structure and migration, regulation of cardiac calcium channels, and tumor metastasis [21]. Presently, it is believed that AHNAK mainly regulates the cell cycle and cell growth by bolstering the transcriptional activity of receptor-regulated Smads (R-Smads) and amplifying the signal transduction of transforming growth factor β (TGFβ) [22]. Silencing AHNAK has been associated with promoting nasopharyngeal carcinoma progression by upregulating ANXA2 protein, although its precise role in BLCA remains unresolved [23]. RNH1, alternatively known as Angiogenin Inhibitor 1, serves as a ribonuclease within the protein cytoplasmic RNase inhibitor family, exerting its effects through binding to RNases in various tissues [24]. Given the pivotal role of angiogenesis in cancer development, RNH1 can inhibit tumor growth by impeding cell proliferation, migration, and invasion through the inhibition of epithelial-mesenchymal transition [25]. RNH1 has been shown to play a regulatory role in the malignant progression and resistance in various cancers, such as gastric cancer and breast cancer [26, 27]. However, its role in BLCA requires further study. TAP1, an ABC transporter protein linked to antigen processing, forms a complex with TAP2 to facilitate the unidirectional translocation of peptide antigens from the cytosol to the endoplasmic reticulum (ER) for loading onto MHC class I molecules [28, 29]. Disruption of MHC class I antigen presentation due to TAP1 gene alterations can enable cancer cells to evade immune surveillance [30]. For example, in melanoma, miR-26b-5p and miR-21-3P bind to the 3’ untranslated region (UTR) of TAP1 transcript and downregulate TAP1 expression [31]. Our prediction model exhibited robust diagnostic efficiency, prognostic prediction value, and clinical-decision making benefit in both training and validation cohorts. With these promising results, we believe this model hold considerable potential for accurate prognostic predictions.

Based on predictive modeling, we employed seven immune infiltration algorithms for scoring and conducted correlation analysis. The results revealed that among the top four positively correlated factors, three were associated with cancer-associated fibroblasts (CAFs), suggesting a crucial role of CAFs in BLCA. Further investigation into CAFs is warranted. Conversely, the top four negatively correlated factors were all related to T-cell immunity, indicating a lower T-cell immune level in more malignant BLCA cases. Activation and restoration of T-cell immunity in the tumor microenvironment represent important avenues for improving tumor progression. Analysis of the expression levels of AHNAK, TAP1, and RNH1 showed higher expression levels of AHNAK and RNH1 in normal tissues compared to tumor tissues, whereas RNH1 exhibited the opposite trend, which was confirmed by immunohistochemistry results. However, we observed contradictory findings where AHNAK and RNH1 exhibited higher expression levels in higher stages or grades, potentially due to sample size limitations introducing bias or other factors influencing their biological effects despite their possible anti-cancer roles. As such, our hypotheses necessitate rigorous experimental validation to elucidate the underlying mechanisms further.

Through drug sensitivity analysis, we found that Evodiamine exhibited the highest average binding affinity with three genes, followed by SB 216763. However, SB 216763 showed more binding sites with the proteins of the three genes. Both drugs demonstrated strong binding capabilities with target proteins and characteristics of low off-target effects, indicating their potential as targeted therapeutic agents. Further experiments are needed to select the most suitable targeted drugs based on toxicity, efficacy, and other factors. Our analysis provides insights for the development of targeted drugs.

However, there are some limitations to this study. The conclusions drawn are primarily based on bioinformatics analysis and require further in vitro and in vivo validation. Additionally, the samples involved in this study were retrospective, necessitating comprehensive and in-depth clinical research for potential application.

5 Conclusion

We conducted an in-depth investigation into genes harboring DNA methylation sites in BLCA patients, aiming to explore their relationship with patient prognosis. Through comprehensive data analysis from public databases, we successfully identified a set of DNA methylation sites associated with BLCA prognosis and subsequently formulated a prognostic-associated methylation gene collection. Utilizing various machine learning algorithms, we we curated a set of pivotal prognosis-linked genes. We also developed a Nomogram model to predict the prognosis of BLCA patients, which demonstrated good predictive performance and could provide robust support for clinical decision-making. Additionally, our study revealed differences in gene expression profiles across distinct clinical attributes, alongside shedding light on the potential influence of immune cell infiltration and drug sensitivity on prognosis. In summary, this study provides a comprehensive theoretical framework and practical directives for the evaluation of prognosis, personalized therapeutic interventions, and precision medicine in BLCA, thus significantly enhancing patient survival quality and treatment outcomes.