1 Introduction

Prostate cancer is one of the most common malignant tumors in men, accounting for approximately 29% of cases [1]. Although its five-year survival rate is relatively high, once the tumor becomes invasive, it can quickly metastasize and become difficult to treat. Currently, the main treatments for prostate cancer include surgical resection, radiation therapy, chemotherapy, and hormone therapy. Although these treatment methods could control the growth and spread of the tumor to some extent, there are still certain limitations, such as unstable treatment effects and the development of drug resistance [2]. Therefore, a deep understanding of the occurrence and development mechanisms of prostate cancer is of great significance for improving treatment effectiveness and developing new treatment strategies.

While cell proliferation and differentiation in normal tissues are stringently controlled, tumors exhibit unusual alterations in this regulatory process, which governs the immune reaction, thereby preventing immune system assaults. The complexity in treating tumors arises from the diversity in differentiation, which means that different cells exhibit different phenotypes and functions, leading to varied responses to the treatment from different cells [3,4,5].

Single-cell transcriptomics offers significant advantages over bulk transcriptomics by providing insights into gene expression at the individual cell level. Unlike bulk transcriptomics, which averages gene expression across a population of cells, single-cell approaches allow for the analysis of gene expression variations among distinct cells. This high-resolution perspective is crucial for understanding cellular heterogeneity and identifying subtle differences that might be masked in bulk analyses [8, 9]. Additionally, single-cell transcriptomics is instrumental in discovering novel cell types. The ability to profile the transcriptome of individual cells allows for the identification of previously uncharacterized cell types and states. This capability is particularly valuable in expanding our knowledge of cellular diversity and function, leading to new insights into biological processes and potential therapeutic targets [10]. Furthermore, single-cell transcriptomics facilitates the study of diseases and individual differences at a finer scale. By examining gene expression patterns in individual cells, researchers can uncover how specific cell types contribute to disease progression or exhibit unique characteristics in different individuals. This detailed resolution helps in understanding the molecular mechanisms underlying various conditions and enables more precise identification of disease-associated cellular subpopulations [11,12,13].

Although single-cell transcriptomics has deeply explored the heterogeneous expression characteristics of prostate cancer, given that normal epithelial cells themselves have different states of differentiation, these normal differentiation characteristics may lead to confusion with heterogeneous tumor characteristics. In this study, we combined non-negative matrix factorization clustering and graph-based clustering techniques to jointly confirm the reliability of clustering influence. Additionally, this study utilized anchor integration algorithms to identify specific epithelial cells in peritumoral tissues that are consistent with the differentiation state of tumor cells, providing a benchmark for subsequent research to more accurately detect irregular expressions among various tumor cell subpopulations.

2 Patients and methods

2.1 Preprocessing of single-cell data

Single-cell prostate data were cited from the GEO public database (GSE193337) as regards to single-cell and bulk RNA sequence analysis of human prostate adenocarcinoma and normal benign prostate samples [14].

The above data were imported through the R language (version 4.3.1), converted to Seurat objects using the Seurat R package (version 4.3.0.1) [15], normalized and linear regression of the Seurat objects through the NormalizeData function and ScaleData function, normalized the gene expression matrix to total cell read counts, and used the IntegrateData function to perform anchor-based integration of single-cell data. Mass filtration was then performed to remove cells with genes less than 201 or greater than 8000 or total genes greater than 30,000 or greater than 20% from the mitochondrial genome Unique Molecular Identifiers (UMIs).

The processed expression matrix mentioned above was further regressed against some potential confounding factors (such as "S.Score", "G2M.Score", and "percent.mt"), and then subjected to principal component analysis (PCA) using the RunPCA function to reduce the dimensionality of the data. Furthermore, RunTSNE functions were used to perform t-SNE dimensionality reduction analyses, incorporating the first 15 components of the data, which optimally balanced the expression characteristics of cells and avoided background noise. Constructed a neighbor plot for clustering analysis, and selected a 0.5 resolution parameter to identify small differences between cell groups.

2.2 Cell type identification

The pretreated single-cell data were compared to the reference dataset using the SingleR function package [16], and the cell type annotation information for each single-cell sample was obtained by correcting the results of the SingleR analysis according the characteristic genes for each cluster. Given the potential difficulty in precisely distinguishing normal epithelium from tumor epithelium in tumor tissues through algorithms, and the possibility that normal epithelium within tumor tissues may exhibit characteristics similar to tumor cells due to the influence of the tumor microenvironment, we have cautiously defined normal epithelial cells and tumor-associated epithelial cells (TAECs) based on the source of the samples. In the single-cell dataset, monocytes/macrophages (CD14,CD68,CD163), T cells (CD3D,CD3E,CD4), and B cells (CD19,CD79A,CD79B) were isolated and uploaded to Azimuth (http://azimuth.satijalab.org/app/azimuth), where identification of cell types was based on gene expression resemblance and characteristic alignment, followed by manual proofreading, culminating in more comprehensive cell annotation outcomes. The proportion of cell types was calculated.

2.3 Heterogeneous clustering of TAECs cells and Preliminary enrichment analysis

Using the NMF R package (version 0.26) [17], 3434 TAECs epithelial cells from 4 tumor patients underwent unsupervised NMF clustering. By choosing k = 4 as the factor number, 16 distinct patterns of metagene expression were discovered across four tumors samples. Then the four sets of characteristic genes were artificially identified by hierarchical clustering and comparing these 16 metagene expression patterns. The total load of repeat genes was summarized for each set of characteristic genes. The original load of the unique genes was retained, wherein the second group of characteristic gene sets retained the characteristic genes of T cell subsets (CD4, CD8) and with a low number of cells (164/3434), which was considered to be doublets. Ultimately retained the remaining three characteristic gene sets as prostate TAECs heterogeneous subsets. Then visualized the difference in the expression of TAECs subsets in each sample. A TSNE map was constructed to visualize the expression of characteristic genes of cell subsets. The differential genes between the three subpopulations and the total normal epithelial cells were calculated, and the functional enrichment analysis was performed based on these differential genes.

2.4 Cell trajectory analysis, matching of TAECs to normal epithelial cells, and functional enrichment analysis of TAECs subsets

The Monocle2 R package (v2.8.0) [18] was used to extract and construct CDS (Cell Data Set) objects from Seurat objects, select high-variable genes with average expression greater than 10%, constructed quasi-timeline trajectories, and arranged cells in quasi-temporal and visualize.

Based on anchor point integration, paired normal epithelial cells with the same TSNE neighbor graph position was identified for each of the three TAECs subpopulations. These three normal epithelial clusters were used as controls for the three subpopulations in the same differentiation state. Subsequently, the differential genes of three paired samples of TAECs and normal epithelium in the same differentiated state were calculated. For these differential genes, the "fgsea" function [19] was used for enrichment analysis to find the co-expressed genes of the common enrichment pathway of the three paired samples.

2.5 Cell communication analysis

Cell communication analysis was performed using the "CellChat" R package [20] (http://www.cellchat.org/). Using network analysis and pattern recognition techniques, we deduced and examined intercellular communication networks from scRNA-seq data, illustrating the varied ligand-receptor interactions and their signaling routes in normal tissue epithelial cells.

2.6 Transcription factor regulatory network analysis

SCENIC analysis (version 1.3.1) [21, 22] was performed using the motif databases of RcisTarget and GRNboost with default parameters (version 1.3.1, corresponding to RcisTarget 1.2.0.0 and AUCell 1.22.0) to construct the transcription factor regulatory network. Transcription factor activity scores were calculated for three TAECs subtypes compared to their corresponding normal epithelial cells. Upstream regulatory factors of the co-expressed differential genes of the common enrichment pathway (P53, apoptosis signaling pathways) of the three paired samples (TAECs1-N1, TAECs2-N2, TAECs3-N3), were identified.

2.7 Immune infiltration analysis

Downloaded TCGA-PRAD data from the UCSC database [23], retaining transcriptome data of 482 malignant tumor samples with complete clinical information for subsequent analysis. The proportion of 22 tumor-infiltrating immune cells in each sample was determined using 'CIBERSORT' (R package). CIBERSORT was employed to predict the proportions of 22 tumor-infiltrating immune cells in various tissues, namely: B cells naive, B cells memory, Plasma cells, T cells CD8, T cells CD4 naive, T cells CD4 memory resting, T cells CD4 memory activated, T cells follicular helper, T cells regulatory (Tregs),T cells gamma delta, NK cells resting, NK cells activated, Monocytes, Macrophages M0, Macrophages M1, Macrophages M2, Dendritic cells resting, Dendritic cells activated, Mast cells resting, Mast cells activated, Eosinophils, Neutrophils. The correlation between the expression levels of target genes (YY1, EHF, NKX3-1, ATF3) and tumor immune cell infiltration was calculated.

2.8 Pan-cancer analysis

The debatch-effecting and normalized pan-cancer gene expression RNA sequencing data were downloaded from the UCSC database. The data were processed using R software, and wilcox tests were performed in tumor tissues and adjacent tissues of various cancers for YY1, EHF, NKX3-1, and ATF3 genes to compare their expression differences. A COX proportional hazards model was established by univariate COX regression to evaluate the risk ratios of disease-free survival (DFI), overall survival (OS) and progression-free survival (PFI) in each cancer type with differences in the expression of YY1, EHF, NKX3-1 and ATF3, respectively.

3 Results

3.1 Machine annotation combined with manual annotation used to identify cell types

A total of 7 cell clusters, epithelial cells (EPCAM), fibroblasts (DCN), vascular endothelial cells (VWF), T cells (CD3E), B cells (CD79B), mast cells (KIT), and monocytes/macrophages (CD68), were tentatively identified according to the genetic profile of the cells and the typical marker genes (NKG7 was used to validate NK cells that are subsequently identified). Subsequently, Azimuth assisted in acquiring more comprehensive annotation outcomes (Fig. 1A–C). A range of lymphocyte subtypes were identified from immune cells, including T cells, B cells, and monocytes/macrophages. Furthermore, NK cells, DC cells, and a few other blood cells were isolated from these cells (Fig. 1A–C). Ultimately, calculations were made for the cell count ratios of various cell types (Fig. 1D), and the expression of characteristic genes representative of specific cell types was depicted (Fig. 1E) (Supplementary Figure 1A). The identification of characteristic genes demonstrates the dependability in classifying cell types (Fig. 1E) (Supplementary Figure 1A). Despite the varied distribution of cell types across samples, each sample exhibits a relatively comprehensive range of cell type expressions (Fig. 1C, D).

Fig. 1
figure 1

A-C t-distribution random neighbor embedding (t-SNE) plot of cells from 4 patients, colored by cell types (A), specific cell types with fine cell subpopulations (B), sample sources (C). D Relative proportions of cell types per prostate cancer patient. E t-SNE plots of marker genes for each cell subset.

3.2 Heterogeneous clustering of TAECs and their general characteristics

TAECs cells were extracted from epithelial cells, and the 16 metagene expression patterns obtained by non-negative matrix decomposition (NMF) were characterized by four gene expression patterns (Fig. 2A), of which one gene expression pattern with low confidence was eliminated. Three heterogeneous TAECs subpopulations with different gene expression patterns (TAECs1, TAECs2, and TAECs3) were finally determined (Fig. 2A, B). Hierarchical clustering reveals significant variation among the three subpopulations, with the TAECs2 and TAECs3 subpopulations showing closer affinity (Fig. 2A). Either a TSNE map or the hierarchical cluster diagram reveals clear divisions among these subpopulations, demonstrating that the outcomes of the clustering effectively depict the diversity present in TAECs (Fig. 2A, B). Furthermore, each sample contained three subsets of TAECs, indicating that these heterogeneities were not derived from individual differences (Fig. 2C).

Fig. 2
figure 2

A Heat map depicting pairwise correlations of 16 tumor metagene expression patterns from 4 tumor patients. Clustering identified 4 coherent feature sets across tumors samples. B t-distribution random neighbor embedding (t-SNE) plots of tumor cells from 4 patients, colored by TAECs subpopulations. C Relative proportion of TAECs types per prostate cancer patient. D t-SNE plots of characteristic gene expression for three heterogeneous TAECs subpopulations. E GSEA enrichment analysis of TAECs1 subsets versus total normal epithelial cells. F GSEA enrichment analysis of TAECs2 subsets versus total normal epithelial cells. G GSEA enrichment analysis of TAECs3 subsets versus total normal epithelial cells.

By examining the expression of some characteristic genes, we found that the TAECs2 subpopulation overexpressed the basal cell marker genes TP63, KRT5, KRT14, and KRT19 (Fig. 2D); the TAECs1 subpopulation overexpressed hormone and secretion-related genes associated with differentiated terminal epithelial cells, such as AR, NPY, KLK3, and KLK4 (Fig. 2D), which indicates that the three subpopulations might have time-dependent differentiation development features similar to normal epithelial cells. The gene expression heatmap was used to further characterize the expression features of the three subgroups (Supplementary Figure 1B).

Enrichment analysis was performed for each subpopulation compared to total normal epithelial cells, revealing that the TAECs1 subpopulation was enriched in the androgen receptor pathway (Fig. 2E); the TAECs2 subpopulation was enriched in the epithelial-mesenchymal transition and extracellular matrix pathways (Fig. 2F); and the TAECs3 subpopulation was enriched in the interferon pathway and various antigen processing and presentation, and vesicle transport pathways (Fig. 2G). Notably, this might be the case that three subpopulations exhibit biological characteristics associated with different stages of differentiation from normal epithelial cells, rather than true tumor behavior.

3.3 Pseudo-time analysis and paired enrichment analysis

Analysis of cell differentiation and developmental trajectories using monocle 2 validates the hypothesis that TAECs follow the trajectories of TAECs2, TAECs3 to TAECs1 (Fig. 3A–C), for which TAECs3 subpopulation closely related to the TAECs2 subpopulation (Fig. 3A–C). The characteristic gene expression trends of the three subpopulations were consistent with the inferred differentiation trajectories (Fig. 3D, E).

Fig. 3
figure 3

A–C TAECs cell developmental trajectories inferred using monocle2 colored by TAECs subpopulations (A), inferred differentiation state (B), pseudo time (C). D Curve plots showing expression changes of characteristic genes related to differentiation state along pseudo time. Point colors correspond to cell state colors in B. E The distribution of TAECs is shown along pseudo time, Heatmap showing dynamic expression changes of selected genes and related pathways along pseudo time, clustering was carried out according to the dynamic expression changes of genes with pseudo-time. F Differential enrichment analysis (GSEA) of the TAECs1 subpopulation with its matched N 1 normal epithelium. G Differential enrichment analysis (GSEA) of the TAECs2 subpopulation with its matched N 2 normal epithelium. H Differential enrichment analysis (GSEA) of the TAECs3 subpopulation with its matched N 3 normal epithelium.

Given possible problems with previous enrichment analysis, following earlier findings on dimensionality reduction clustering from anchor point integration, clusters of normal epithelial cells, identical in differentiation to TAECs, were pinpointed. (N1, N2, N3) (Supplementary Figure 2A-C). The paired enrichment study was conducted on groups of three TAECs subtypes along with their matched normal epithelial cells, subpopulations in each pair were at an identical stage of differentiation (Fig. 3F–H). Findings indicated that each of the three subtypes possessed more potent anti-apoptosis properties than their corresponding normal epithelial cells, as demonstrated by the inverse enhancement in the P53, apoptosis, and NFKB-driven tumor necrosis factor signaling pathways (Fig. 3F–H). Additionally, the TAECs1 subpopulation shows an down-regulated enhancement in epithelial-mesenchymal transition, interferon signaling, and specific immune response routes (Fig. 2F), indicating its function as a final differentiation phase and immune avoidance. The TAECs2 subpopulation, resembling basal cells, show enhanced epithelial-mesenchymal transition activity compared to matched normal cells (Fig. 3G), suggesting possible stromal infiltration, though this is not histologically verified. TAECs3 subpopulations exhibit enhanced secretion capabilities (Fig. 3H).

3.4 Cell communication analysis

TAECs cells are the major senders of MIF signals (Fig. 4A–C). Compared with matched normal epithelial cells at the same differentiation stage, they show higher levels of expression of EGFR, ERBB2, and interferon receptors (Fig. 4D–F). In addition, TAECs express VEGF signaling at an earlier stage of differentiation than normal epithelial cells and might undergo a partial expression shift from VEGFA to VEGFB signaling (Fig. 4G).

Fig. 4
figure 4

A Cell-to-cell communication mediated by the MIF signaling pathway. B Expression of all signaling genes associated with the MIF signaling pathway. C Heatmap visualization of the calculated centrality score to identify the main signaling role of the cell population in the MIF pathway. D Cell-to-cell communication mediated by EREG_EGFR_ERBB2 signaling pathways. E Expression of all signaling genes associated with the EGFR signaling pathway. F Expression of all signaling genes associated with the interferon signaling pathway. G Expression of all signaling genes associated with the VEGF signaling pathway. H Transcription factor activity heat map of tumor subsets and their paired normal epithelial cells at the same TSNE neighbor plot location based on anchor integration.

Tumor cells abnormally expressed cell adhesion related signals (SEMA4A, OCLN, NECTIN, MPZL1, JAG, ALCAM, CD46, CD99) (Supplementary Figure 3A), which were closely related to tumor survival and invasion, angiogenesis, and tumor immunity. Compared to matched normal epithelial cells, the expression of MHC molecules related to antigen presentation is higher in TAECs2 and TAECs3 subpopulation (Supplementary Figure 3A).

Compared to matched normal epithelial cells, the TAECs2 and TAECs3 subpopulations expressed higher levels of SEMA3C, CXCL16 (Supplementary Figure 4A-C), and WNT5B signal (Supplementary Figure 4D); the TAECs1 and TAECs3 subpopulations showed higher levels of BAG signal (Supplementary Figure 4E, F).

The TAECs2 subpopulation showed higher expression of LAMB3 and LAMA5 molecules compared to matched normal epithelial cells (Supplementary Figure 3B). Also, it, as well as TAECs3 subpopulation, exhibited higher expression of matrix-related receptors (SDC1, SDC4, ITG) (Supplementary Figure 3C), indicating a closer association with the extracellular matrix.

Compared to matched normal epithelial cells, the TAECs1 subpopulation had higher expression of FZD5 and LRP6 (Supplementary Figure 5A, B), which could serve as receptors for WNT10A signal in plasma cells; the TAECs2 subpopulation specifically expressed ADM signal, acting on endothelial cells (Supplementary Figure 5C); the TAECs3 subpopulation showed higher expression of TNFRSF15, acting on MAIT cells and ILC cells (Supplementary Figure 5D).

3.5 Transcription factor regulatory network analysis

Using scenic to infer the transcription factor regulatory network in epithelial cells (Supplementary Table 1), we calculated the transcription factor activity score (Supplementary Table 2) for three TAECs subpopulations with reference to matched normal epithelial cells at the same stage of differentiation. We extracted the top 20 transcription factor regulatory units with high and low transcriptional activity to draw the transcription factor activity heatmap (Supplementary Table 3) (Fig. 4H). We found that in the three TAECs subpopulations, the transcriptional activity of the transcription factors YY1, NKX3-1, and EHF was higher than in the matched normal epithelium at the same differentiation stage (YY1 regulatory unit ranked in the top 10) (Supplementary Table3).

As previously noted, the MIF signaling pathway is overexpressed in various TAECs subpopulations (Fig. 4A–C). The scenic-inferred transcription factor regulatory network suggests that YY1 acts as a dependable transcriptional upstream regulator for MIF (Supplementary Table 1), with its target genes containing diverse genetic activity elements, underscoring its vital function in tumor growth and progression. Since it was previously found that the functional enrichment of the three TAECs subpopulations were weakened in the apoptotic related pathway (Fig. 3F–H), we further intersected the differential genes enriched by the apoptotic and P53 pathway of the three subpopulations (Supplementary Tables 4), and found that ATF3 was the upstream regulator of most of these genes (Supplementary Tables 4), and the transcriptional activity of ATF3 in normal epithelial cells was higher than that of its paired TAECs (Supplementary Table3), suggesting that TAECs might inhibit ATF3 by some mechanism to escape apoptosis.

3.6 Correlations of prostate cancer immune infiltration with YY1, EHF, NKX3-1 and ATF3 expressions

Correlation analysis showed that the expression of YY1 was significantly positively correlated with the immune infiltration of CD4 resting memory T cells, resting dendritic cells and M1 macrophages (P<0.001, Cor= 0.30000817, 0.22245352, 0.18262690) (Supplementary Figure 6 A).

The expression of NKX3-1 was significantly positively correlated with the immune infiltration of resting mast cells and M1 macrophages (P<0.001, Cor= 0.163012839, 0.154186876). The expression of NKX3-1 was significantly negatively correlated with the immune infiltration of suppressor T cells (Tregs) and large chemical cells (P<0.001, Cor= 0.246266240, 0.218359890) (Supplementary Figure 6B).

The expression of EHF was significantly positively correlated with the immune infiltration of CD4 resting memory T cells and M1 macrophages (P<0.001, Cor= 0.298198031, 0.154686150). It was negatively correlated with the immune infiltration of inhibitory regulatory T cells (Tregs), CD8 T cells and activated natural killer cells (P<0.001, Cor= 303334670, 0.210507210, 0.155223607) (Supplementary Figure 6C).

The expression of ATF3 was significantly positively correlated with the immune infiltration of large chemical cells, neutrophils, naïve B cells and follicular helper T cells (P<0.001, Cor= 0.491918316, 0.241133868, 0.218357002, 0.177083614). It was significantly negatively correlated with the immune infiltration of resting mast cells and memory B cells (P<0.001, Cor= 0.391263842, 0.154324034) (Supplementary Figure 6D).

3.7 Pan-cancer analysis revealing the expression and prognosis of YY1, NKX3-1, EHF and ATF3

In order to further explore the prognosis of several transcription factors with aberrant transcriptional activity in prostate cancer, we subsequently performed a pan-cancer analysis (Supplementary Figure 7), and the results showed that YY1 and NKX3-1 were significantly overexpressed in tumor tissues of prostate cancer and ATF3 was significantly underexpressed in tumor tissues relative to adjacent tissues, which was consistent with the results obtained by single-cell analysis. Among them, the hazard ratio of YY1 in overall survival of prostate cancer was 11.9 (95% confidence interval (1.76−81.2, P = 0.0112), the hazard ratio of ATF3 in prostate cancer was 0.791 (95% confidence interval 0.631−0.991, P = 0.0411), and the risk ratio of ATF3 in prostate cancer was 0.88 (95% confidence interval 0.778−0.995, P = 0.0413). In addition, these four transcription factors were associated with an increased or decreased risk of poor prognosis in a variety of cancers, among which YY1, EHF, and NKX3-1 all showed a significant increase in the risk of consistent adverse prognosis in pancreatic cancer (YY1: DFI HR 8.11 95% confidence interval 2.18−30.1, P = 0.00177, OS HR 3.17 95% confidence interval 1.68−5.97, P = 0.000357, PFI). HR 2.63 (95% CI 1.41−4.89), P = 0.00232; EHF: DFI HR 1.57 95% CI 1.07−2.32, P = 0.0224), OS HR 1.26 95% CI 1.05−1.5, P = 0.0106, PFI HR 1.24 95% CI 1.06−1.46, P = 0.00875; NKX3-1: DFI HR 2.46 95% CI 1.41−4.28, P = 0.00145, OS HR 1.3 95% The confidence interval was 1.05−1.61, P = 0.0152, PFI HR 1.61, 95% confidence interval was 1.3−1.99, P = 0.106*10-6). <0.001), high expression of YY1 showed a consistent increase in the risk of poor prognosis of renal papillary cell carcinoma (DFI HR 3.71 95% confidence interval 1.27−10.9, P = 0.0166, OS HR 9.1 95% confidence interval 3.78−21.9, P = 0.832*10-8, PFI HR 4.19 95% confidence interval 2−8.78, P = 0.000149) (Supplementary Figure 8).

4 Discussion

Cell annotation is an essential process in single-cell analysis. Once cell types are initially identified in the dataset, the absence of dependable feature genes complicates the use of these genes for further cell annotation, especially if a more intricate categorization of specific cell types is needed. The selection of methods based on comparison with reference datasets for the automatic annotation of cells often leads to less accurate results [24, 25]. During our analysis of subsequent annotation of the preliminary identified immune cells, we compared the PBMC demonstration dataset on Azimuth, achieving a more detailed cell annotation. Investigations into the expression of CD4, CD8, and HLA molecules linked to antigen presentation revealed alignment with Azimuth's automated annotation findings.

M1 macrophages are generally considered tumor-killing macrophages, mainly involved in anti-tumor and immune promotion, while M2 macrophages exhibit immunosuppression, promoting tissue repair and tumor initiation and development [26]. However, in actual disease, M1 and M2 are not necessarily clearly defined, and M1 and M2 features are not mutually exclusive but often coexist [27, 28]. Furthermore, the impact of M1 or M2 stimuli must be considered for their dynamic intricacy, and the interplay among numerous signals necessitates additional in vivo and in vitro research [26]. The research revealed no notable polarization in the expression of polarization markers, possibly because the cancer samples were in the disease's initial phase. Consequently, we decided to cease additional labeling of macrophages' polarized state to ensure the reliability of the annotation.

In single-cell analysis, two prevalent clustering techniques are NMF and Graph·-based clustering. NMF clustering, also known as Non-negative Matrix Factorization clustering, utilizes matrix factorization for clustering. It's apt for single-cell non-negative data like gene expression and offers pattern-oriented clustering outcomes. Each cluster is depicted as a linear amalgamation of various gene expression patterns, enhancing interpretability [17]. Consequently, NMF clustering is typically chosen in tumor heterogeneity studies to distinguish different cell groups and gene expression trends, aiding in a profound comprehension of tumor heterogeneity traits. The Seurat clustering Graph·-based clustering technique, rooted in graph theory, enhances visualization and visually represents clustering outcomes by charting the cellular similarity network [15]. In contrast to NMF clustering, Seurat clustering Graph·-based clustering shows greater sensitivity to the choice of parameters, and varying parameter configurations can yield diverse clustering outcomes. Noise and batch effects can impact Seurat clustering Graph·-based clustering, necessitating pre-processing measures to mitigate these impacts [29, 30]. The research employed the NMF algorithm for grouping diverse TAECs, identifying three distinct TAECs subpopulations. Post-clustering annotations were then integrated into Seurat's visualization outcomes, revealing clear distinctions among these subpopulations in the TSNE diagram. This indicates that there is significant heterogeneity among the three subpopulations. The outcomes of clustering demonstrate dependability and consistency. Concurrently, it's proposed that NMF and graph theory-based clustering could yield comparable outcomes, provided the batch effect [29] is efficiently managed and the quality control outcomes across samples are consistent and consistent.

The heterogeneity characteristics inside TAECs and the differences exhibited by epithelial cells following the normal differentiation trajectory could both be observed in single-cell clustering. Although the two concepts seem to overlap, it is necessary to distinguish between them to find their respective different tumor features among the subtypes of tumor heterogeneity; Otherwise, certain characteristics might be overlooked or exaggerated.

Enrichment analysis was performed on the trio of tumor subpopulations identified through clustering, with each subpopulation being matched against the aggregate of normal epithelial cells. Research revealed that the TAECs1 subpopulation was enriched in the androgen receptor pathway (Fig. 2E); the TAECs2 subpopulation was enriched in the epithelial-mesenchymal transition and extracellular matrix pathways (Fig. 2F); the TAECs3 subpopulation was enriched in the interferon pathway, various antigen processing and presentation, and vesicle transport (Fig. 2G). Nonetheless, our suspicion is that these enhanced traits might be present in epithelial cells during various differentiation phases, which doesn't automatically imply they have tumor characteristics. Consequently, utilizing anchor points, we pinpointed typical epithelial cells aligning with tumor cell differentiation levels, serving as a benchmark for comparing tumor subpopulations at identical differentiation phases. Subsequently, we performed another enrichment analysis, which revealed results that were inconsistent with earlier research outcomes.

Despite the high expression of AR in highly differentiated luminal epithelial cells, the paired enrichment analysis revealed that TAECs showed increased activity in the androgen receptor pathway at the onset of differentiation. Indeed, prior research has revealed that irregular expression of AR receptors [31] on the basal membrane of prostate cancer cells, along with the fundamental function of the AR signaling pathway in various phases of prostate cancer development. Despite the development of endocrine resistance in numerous advanced cancer stages, tumors still maintain their responsiveness to hormones [32, 33]. Research indicates that the pathogenesis of these tumors is often associated with overactive downstream signaling, the presence of various AR mutations, and the activity and expression of AR co-regulatory factors [32, 33]. Our research revealed unusually elevated levels of EGFR and ERBB2 receptors in TAECs, as determined by examining intercellular communication. This discovery is consistent with the previous research reports on the detection of growth factor-related receptors in dysfunctional cavity epithelial cells and malignant cells of prostate cancer [34]. The research additionally uncovered a TAECs2 subpopulation during the initial differentiation phase of basal cells, exhibiting epithelial-mesenchymal transition activities that exceeded those of typical basal cells and demonstrated a more intimate interaction with extracellular matrix ligand receptors. This result aligns with earlier studies that noted an increased expression of the IV-type collagenase, which breaks down the basement membrane, compared to regular prostate epithelial tissue [34], thereby boosting the cells invasive capacity, which is consistent with one another. Furthermore, through a pairwise enrichment study, it was noted that every TAECs subset exhibited a down-regulated enhancement in the P53 signaling route, apoptosis process, and the NFKB-driven tumor necrosis factor signaling pathway, in contrast to their corresponding normal epithelial cells. This suggests that the trio of TAECs variants possess an enhanced ability to withstand apoptosis. Subsequent studies into the transcription factor regulatory network have identified ATF3 as a crucial regulatory element. By blocking ATF3, tumor cells might avoid apoptosis, and YY1 could be involved in controlling MIF signal expression, aiding in reshaping the tumor's microenvironment. In summary, it is entirely justified and crucial to carry out paired study at different stages of differentiation.

Additional studies have noted an increased presence of tumor subpopulations regarding antigen presentation and immune/inflammatory reactions, which are believed to correlate with improved prognoses [35]. Conversely, some research indicates these subpopulations might be the focal points of inflammatory responses close to cancerous tissues [36]. In the course of our study, through the analysis of cellular communication, we observed a significant increase in the expression of HLA-related molecules in epithelial cells in the early stages of their differentiation (N2, N3, TAECs2, TAECs3), especially in the N3 and TAECs3 subsets. Within cancerous tissues (TAECs1, TAECs2), the heightened expression was notably more pronounced, yet our pairwise enrichment analysis revealed no significant enrichment of the TAECs3 subset in the antigen presentation pathway compared to the N3 subset in normal epithelial cells. Numerous studies have been reported on the expression of HLA-related antigens in various tissues, including tumor cells [37]. Still, no definitive studies have revealed their specific roles and significance in re-epithelial tissues, although some researchers have suggested that HLA-G might be involved in the induction of immune tolerance [38]. Thus, there may exist a hypothesis that epithelial cells might start from their basal cell state in the early stage of differentiation, briefly express HLA molecules, and exert antigen display functions, which helps immune cells carry out immune monitoring and thus prevent cancer during subsequent growth. When in a low differentiation state, the invasion of prostate cancer cells seems to be more advantageous because their connection with the stroma is closer; on the other hand, epithelial cells in the early differentiation stage exhibit a higher level of HLA molecule expression and are clustered on the antigen presentation pathway, which might be related to a better prognosis. This appears to be a contradictory occurrence. Our hypothesis is that, as the disease progresses, tumor cells might experience increased dysregulation in two key aspects: the positive control of epithelial-mesenchymal transition and the negative regulation of antigen presentation, which may lead to further exacerbation of the disease. Observations from the NMF clustering results indicate that the heterogeneity among samples within the TAECs3 subtypes is significantly higher compared to the other two subtypes. Anyway, the TAECs subtype 3 may play a crucial role in the progression of tumors in the future. Before that, we believe that in-depth research on the existence and function of subtype 3 in normal tissues is essential.

In further pan-cancer analysis, we conducted a detailed study of several transcription factors that exhibited aberrant transcriptional activity in prostate cancer to understand their expression and prognosis in different cancer types. We observed that YY1 and NKX3-1 were significantly overexpressed in prostate cancer tumor tissues, while ATF3 showed significantly low expression relative to adjacent tissues, which was consistent with the results of previous single-cell analyses. In particular, the overall survival hazard ratio of YY1 in prostate cancer was 11.9 (95% confidence interval 1.76−81.2, P = 0.0112). This result highlights that high expression of YY1 in prostate cancer patients is associated with poorer overall survival. In addition, the hazard ratio for disease-free survival of ATF3 in prostate cancer was 0.791 (95% confidence interval 0.631−0.991, P = 0.0411) and the hazard ratio for progression-free survival was 0.88 (95% confidence interval 0.778−0.995, P = 0.0413). This suggests that high expression of ATF3 in prostate cancer patients might be associated with better survival.

We further compared the expression and prognosis of these transcription factors in a variety of cancers. Interestingly, YY1, EHF, and NKX3-1 showed consistent poor prognosis in pancreatic cancer, with high expression of YY1 associated with poorer survival (DFI HR 8.11, 95% CI 2.18−30.1, P = 0.00177; OS HR 3.17, 95% CI 1.68−5.97, P = 0.000357; PFI HR 2.63, 95% CI 1.41−4.89, P = 0.00232).

For renal papillary cell carcinoma, significantly higher expression of YY1 was associated with an increased risk of poor prognosis (DFI HR 3.71, 95% confidence interval 1.27−10.9, P = 0.0166; OS HR 9.1, 95% confidence interval 3.78−21.9, P = 8.32*10^-8, PFI HR 4.19, 95% confidence interval 2−8.78, P = 0.000149). These pan-cancer analyses further reinforce the importance of YY1 and ATF3 in prostate cancer and suggest that they may have a broad range of roles in multiple cancer types. However, we should note that these results are observational only and further validation and in-depth functional studies are needed to confirm the exact mechanism of their action in cancer. In addition, more comprehensive studies are needed to understand the biological significance of the different roles of these transcription factors in other cancer types. In future studies, the use of laboratory models and clinical samples could be considered to further validate these findings and explore potential therapeutic targets. Such research efforts will provide new insights into cancer prognosis assessment and treatment.

5 Conclusions

Through a detailed analysis of prostate TAECs subsets, especially those that match the differentiation status of normal epithelial cells, we successfully identified unique or shared tumor characteristics among them. Enrichment analysis in epithelial cells at different stages of differentiation revealed key pathways associated with three TAECs subsets, providing valuable clues for insight into the development of prostate cancer. This study expands our understanding of the prostate cancer microenvironment at the microscopic level and lays the groundwork for the discovery of new targets for prostate cancer treatment. The results of immune invasion and pan-cancer analysis highlight the critical role of YY1 and ATF3 in prostate cancer and correlate their aberrant expression with patient survival. In-depth studies of the expression and prognosis of these transcription factors in other cancers have revealed their potential roles in different cancer types, providing new perspectives for conducting cross-cancer research. Taken together, our study provides insights into the abnormal expression characteristics of TAECs in prostate cancer across different differentiation backgrounds.

6 Limitations

  • 1. Limited sample source: The small number of samples may not fully represent the tumor heterogeneity of the entire patient population. The research results may not be generalizable to a broader patient population, limiting the clinical application value of the findings.

  • 2. The conclusions of the study have not been sufficiently validated through pathological sections, in vivo and in vitro experiments. Further research is needed to enhance the credibility of the research conclusions, providing a more solid scientific basis for the diagnosis, treatment, and prognosis assessment of prostate cancer.