Introduction

Breast cancer (BC) is the most common malignancy in women and the second leading cause of cancer-related deaths in developed and industrialized countries. In 2020, there were 2.3 million new cases and 685,000 deaths worldwide1. BC is a process characterized by uncontrolled cell proliferation in the cells of the mammary glands, in which mutations are progressively acquired, generating genomic instability2.

The two most common types of BC by origin are ductal and lobular3. However, great morphological and biological heterogeneity generates Intrinsic Subtypes of Breast Cancer (ISBC) such as Luminal A and Luminal B, Basal-like, low in claudin, over-expressed with HER2, triple-negative, and normal type4. Intrinsic surrogate subtypes are also classified into triple-negative, over-expressed with HER2, Luminal B as HER2+, Luminal B as HER2−, and Luminal A4. Intrinsic surrogate subtypes are also classified into triple-negative, over-expressed with HER2, Luminal B as HER2+, Luminal B as HER2−, and Luminal A. This heterogeneity, in turn, generates differences in clinical behavior, pathological characteristics, and treatment responses, among others5,6,7.

An essential issue for treatment is the selection of the correct therapeutic modality, which largely depends on the Subtype of Breast Cancer (SBC)8. This is crucial because when BC is diagnosed early and managed with a multidisciplinary approach, it becomes a potentially curable disease9. To date, the estrogen receptor (ER), the progesterone receptor (PR), and the human epidermal growth factor receptor 2 (HER2) are utilized as biomarkers to streamline clinical decision-making, providing prognostic information, and predict responses to targeted therapies10. Currently, BRCA1, BRCA2, and TP53 genes; STAT3, ESR1, and MUC-1 proteins; miRNAs (26b-5p, 124-3p, and 201-5p); and lncRNAs (HOTAIR, Linc-ROR, TROJAN), have been described as possible targets for treatment or diagnosis for BC11,12,13,14,15,16,17.

In this regard, several works aimed at identifying groups of genes, proteins, or consensus regions using different bioinformatics tools in diseases such as asthma, kidney, and colorectal cancer have been described in BC18,19,20. Therefore, identifying molecular targets can lead to a more accurate diagnosis of the type of BC, precisely targeted treatments, and potentially treatments that are less toxic than the current ones21. In this regard, it is possible to highlight some proteins as potential molecular targets, like Transmembrane Protein 170B (TMEM170B)22, Lipocalin 2 (LCN2)23, Macrophage Migration Inhibitory Factor (MIF)24, Signal Transducer and Activator of Transcription 3 (STAT3)25, Tumor Protein P5326, Vascular Endothelial Growth Factor A (VEGFA)27, Estrogen Receptor 1(ESR1)28, BCL2 Associated X (BAX)29, Ribonucleotide Reductase Regulatory Subunit M2 (RRM2)30, Matrix Metallopeptidase 1 (MMP1)31, and Maternal Embryonic Leucine Zipper Kinase (MELK), among others32. However, some of these molecules are biased to certain types of BC for a specific signal or were identified from a small biological sample33.

Nevertheless, it is essential to identify biomarkers according to each SBC to detect with high precision intra-tumor and inter-tumor heterogeneity in patients34, prognostic biomarkers35,36 that allow for understanding the degree of disease progression, as well as identifying possible specific therapeutic targets for each cancer subtype.

The amount of information on platforms such as The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC), or cBioPortal for Cancer Genomics makes information on gene and protein expression from cell lines, as well as tissue samples from different diseases, including breast cancer, increasingly accessible. Therefore, computational approaches are required to analyze the large amount of data37, such as the Weighted Gene Coexpression Network Analysis (WGCNA). This algorithm allows the identification of genes with similar coexpression profiles from different experiments or samples. Indeed, this algorithm has been implemented to identify potential biomarkers for diagnosis, prognosis, and therapeutic targets in diseases such as asthma, coronary artery disease, and different cancer types, like BC18,38,39,40,41.

Therefore, understanding the heterogeneity of BC from the gene expression profiles of cell lines and being able to identify genes expressed specifically for each of the main SBC would allow us to identify potential biomarkers that would improve our ability to diagnose this disease with greater precision, to understand its degree of progression, and even to identify specific therapeutic targets for each SBC that would enable us to determine the most appropriate treatments for patients42.

In this work, we identified 5 modules of genes with similar co-expression patterns and hub genes significantly associated with each of the Basal A, Basal B, Luminal A, Luminal B, and HER2 ampl SBC. By using the transcriptomic profiles of 49 breast cancer cell lines, through WGCNA and other bioinformatic methods. Our results provide novel possible biomarkers for specific SBC, which need to be explored for their potential.

Materials and methods

Datasets

The workflow chart of data preparation, processing, and analysis is illustrated in Fig. 1. The dataset used in this study (CCLE_expression) was downloaded from the DepMap, database version 21Q1 (https://depmap.org). This database contains gene expression data (RNAseq) for 1,376 cell lines (RSEM, gene) of 57,829 genes43, of which 49 cell lines related to BC were selected (Supplementary Table S1). Then, the data were normalized by using log2(TPM + 1) and filtered using the varFilter function in R, to exclude genes that exhibit less than 50% variation among samples, leaving a total of 28,143 expressed genes with greater variability for the subsequent steps.

Figure 1
figure 1

Workflow diagram outlining the proposed method for identifying modules and key genes associated with breast cancer subtypes using gene co-expression networks. (a) Initially, gene expression data from breast cancer cell lines were retrieved from the DepMap database version 21Q1. (b) A comprehensive analysis was then conducted on 49 breast cancer cell lines to obtain specific details about gene expression in these cell lines. (c) Next, a gene co-expression network was reconstructed based on the processed expression profiles using the WGCNA package in the R programming environment, resulting in 65 modules. (d) Subsequently, the module with the highest correlation for each breast cancer subtype (pink, turquoise, yellowgreen, skyblue, and navajowhite2) was selected to construct a correlation network consisting of 50 highly correlated genes from each chosen module. The software used for this process was Cytoscape. (e) Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)48,49,50 were then utilized to perform functional analyses aiming to understand the biological functions and metabolic pathways associated with the genes in the module. (f) The top five genes with the highest intramodule connectivity were identified. (g) Finally, the hub genes within each module were identified for each subtype of breast cancer.

Cell lines were classified according to the following features: by origin, 23 primary and 26 metastatic; and by subtypes: 15 Basal A, 8 Basal B, 12 Luminal A, 1 Luminal B, and 13 HER2 ampl (Supplementary Table S1)4,44.

Weighted analysis of gene co-expression network

WGCNA45 was performed to evaluate the expression profile of the 49 cell lines and 28,143 genes of BC. To this end, a cluster analysis was conducted to identify atypical samples using the flashclust package in the R environment (Fig. 2a). Then, the biological network's scale-free topology property was incorporated by calculating the β parameter using SoftThreshold with a value of 14. An adjacency matrix was calculated using signal correlation networks and the Pearson correlation coefficient among all the genes. This adjacency matrix was transformed into a Topological Overlap Matrix (TOM), where a higher TOM value facilitates the identification of gene modules for each gene pair with strong interconnectivity. Afterwards, gene modules with similar expression patterns were identified using the average linkage hierarchical clustering algorithm, with default parameters (deepSplit set to 3 and a minimum module size set to 30). Additionally, modules with highly correlated eigengenes were merged based on a minimum height of 0.25 (mergeCutHeight function)46. Each module was identified with a different color, with the gray color reserved for uncorrelated genes47.

Figure 2
figure 2

Weighted gene co-expression network analysis. (a) Clustering analysis to remove outliers. Sample dendrogram and trait indicator. Each color circle represents an SBC, five intrinsic subtypes, and six surrogate intrinsic subtype. (b) Network analysis of gene expression in BC identifying distinct modules of co-expression genes. (c) Correlations between module eigengenes and different SBCs with only the most significant module. The most correlated modules are shown for each SBC.

Enrichment analysis of all the modules that make up the network was performed to identify and interpret the biological functions based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database48,49,50 and Gene Ontology (GO), using the Annotation Visualization and Integrated Discovery (DAVID). All analyses were performed under the criteria of an adjusted p value of < 0.05.

To determine the association between each SBC and the modules, the module-trait relationship (MTR) was employed46,47. This approach was used to correlate the gene traits with the sample traits in each SBC, focusing on the most significant correlations.

Identification of hub genes

In this study, we selected the top 50 genes with highly correlated values across different thresholds from each SBC in the modules, as detailed in Supplementary Table S2. This selection was made through the module of high module membership (MM), high genetic significance (GS), and high intramodular connectivity (IC). Afterward, we identified the five hub genes from each subtype, characterized by the highest intramodule connectivity, which is calculated as the sum of the degree of correlations between that gene and other genes within the module.

Functional annotation

To evaluate the significance of the biological function of genes in each SBC, the Database for Annotation, Visualization, and Integrated Discovery (DAVID; version 6.8; https://david.ncifcrf.gov) was used. For this purpose, an enrichment analysis of Gene Ontology terms and KEGG pathways was implemented and displayed using the ggplot2 package in R, including molecular functions (MF), biological processes (BP), and cellular components (CC) with a statistical significance at an adjusted p value of < 0.05. Protein–protein interaction (PPI) networks were built using STRING software (version 10.5; http://string-db.org), with the interaction parameter set to confidence maximum of 0.90. Finally, to visualize and analyze the PPI network and the correlation network, Cytoscape software (version 3.6.1; http://www.cytoscape.org) was used.

Validation of hub genes in an external dataset

To confirm hub shared genes in each SBC, the mRNA expression dataset of breast cancer was searched using the keywords: “breast cancer samples”, “TNBC”, “Luminal A”, “Luminal B”, “Her2”, “Homo sapiens”, “expression profiling by array” against the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo). After a review, the GSE65194 (Affymetrix Human Genome U133 Plus 2.0 Array), GSE96860, GSE52194, and GSE134359 profiles, were selected. We conducted differentially expressed genes (DEG) analysis on the dataset using the R package “limma”. An adjusted p value < 0.05 was considered as the cutoff value, where our main hub genes and long non-coding RNAs were found for each SBC.

Results and discussion

Weighted analysis of gene co-expression network

To evaluate the gene expression of BC, RNA seq data from 49 cell lines comprising 28,143 genes were selected (Fig. 2a; Supplementary Table S3). These samples were used to construct a Gene Co-expression Network using the WGCNA software, with a Soft Threshold power β of 14 as the scale-free topology criterion, a signed network that allows for the identifying modules with more significant enrichment of functional groups, and Pearson correlation (Supplementary Fig. 1a,b)45. After, the network was processed by hierarchical clustering, identifying 65 modules (Fig. 2b); that is, a set of genes with similar expression patterns. Then, the eigengene modules were clustered to improve the reliability of the module divisions (Supplementary Fig. S2). The clustering algorithm was used to identify modules of highly correlated genes and was refined to improve the reliability of the module divisions, where each module was assigned a color, with the smallest containing 34 genes (yellow4) and the largest one containing 2297 genes (turquoise).

The modules were found to include genes previously associated with BC, such as TP53, which plays a key role in controlling cell division and cell death; KI67, related to cellular proliferation; BRCA1, BRCA2, which are hallmarks for hereditary BC32; and specific targets that coincide with the different types of BC, such as BCL2L14, which is apoptosis facilitator51, the tumor necrosis factor receptor superfamily member 1A (TNFRSF1A)25, and a potassium inwardly-rectifying channel subfamily J member 3 (KCNJ3)52, among others. Therefore, the role of these genes has already been demonstrated in BC, which supports the representativeness of the genes in our sample.

Significant modules associated to SBCs

A Pearson correlation analysis was conducted to identify the most significant modules associated with each SBC. Below, we describe the most significant modules (Fig. 3, Supplementary Fig. S3). For Basal A, the pink module, with 879 genes (r = 0.77, p value = 7e−11) was found to be as significant. For Basal B, the turquoise module was identified with 2297 genes (r = 0.66, p = 2e−7); in the Luminal A subtype, the yellowgreen module, consisting of 131 genes (r = 0.59, p = 8e−6), was noted. For Luminal B, the skyblue module includes 169 genes with (r = 0.39 and p = 0.009). Finally, the navajowhite2 module representing HER2 ampl with 70 genes, was identified, showing r = 0.63 and p = 1e−5.

Figure 3
figure 3

Basal A breast cancer subtype correlation network. The circles in the network represent different biological functions, each color-coded for easy identification. Dark blue circles represent catalytic activity, pink circles represent ATP-dependent activity, and brown circles represent calcium-binding proteins. Purple circles indicate binding functions, while yellow circles are involved in antigen processing to generate class I binding peptides. Red circles represent defense or immunity proteins, light blue circles represent signaling functions, and green circles represent transported activity. Lastly, orange circles represent the regulation of cell proliferation and apoptosis. The diamond shapes in the network represent hub genes, which are genes that have a high degree of connectivity in the network and play a crucial role in the biological processes. The size of each node (circle or diamond) corresponds to the number of connections or the degree of involvement in biological processes. Larger nodes have more connections, indicating a higher degree of involvement in various biological processes.

To associate the biological processes with the pink, turquoise, yellowgreen, skyblue, and navajowhite2 modules, which exhibit high correlation values, a Functional Enrichment of Gene Ontology Analysis with DAVID was achieved. In this regard, for Basal A, we identified 171 genes mainly involved in defense response (GO:0006952) with an adjusted p value of 2.60e−33 (Supplementary Table S3), regulating the body’s immune response, and in the body’s ability to contend against cancer cells which play a critical role throughout the course of breast carcinogenesis53,54. In breast cancer, specifically in the Basal A subtype, it has been shown that the immune system plays a dual role in tumor initiation and progression. This system can both inhibit and promote tumor expansion. Cytokinesas TGFβ, INFγ, and TNFα have been involved in both functions, by inducing different expression profiles in tumor cells55. Additionally, immune-related proteins that are overexpressed in this subtype of breast cancer have been associated with invasion and metastasis, as is the case with CD5856. Furthermore, the differences in the mutation profiles among subtypes underscore the complexity of this disease and the need for specific therapeutic approaches. In the Basal A subtype of breast cancer, a predominance of mutations in the TP53 gene has been observed in 80% of cases. This high frequency is associated with increased invasion, migration, and cellular resistance, thereby contributing to tumor aggressiveness and a poor prognosis. This pattern suggests that alterations in TP53 play a fundamental role in the pathogenesis of this subtype57. For Basal B, we found 370 genes involved in the movement of cell or subcellular components (GO:0006928) with an adjusted p value of 1.70e−24 (Supplementary Table S3). This movement is essential for many body functions, including tissue growth and repair, migration of immune cells to sites of inflammation, and cancer cells' invasion into surrounding tissues58. This is also consistent with work that has documented that patients with the Basal B subtype of breast cancer have a poor prognosis and earlier relapses with metastases. Moreover, metastases occur mainly in the brain, followed by the lungs and distal lymph nodes, suggesting that there are metastatic signatures for each organ59. Furthermore, significant overexpression of genes related to migration and invasion has been previously observed, such as EGFR, MMP13, SOX4, and IGFBP260,61. Similarly, in the Basal B subtype, the prominence of mutations in TP53 is maintained, followed by alterations in PIK3CA, NF1, and FBXW7 in primary tumors. The mutations extend to ESR1, BRCA2, RB1, ERBB2, and AKT1 in metastatic cases, revealing a genetic complexity that affects treatment efficacy. This mutational diversity indicates the intrinsic heterogeneity of breast cancer tumors and the importance of identifying specific biomarkers to improve therapeutic approaches62.

For Luminal A, 16 genes were involved in the plasma membrane (GO:0044459) with an adjusted p value of 0.19 (Supplementary Table S4), regulating how cells attach to surrounding tissues and other blood vessels. These genes also control the migration and invasion of cancer cells, essential for tumor growth and cancer spread. They regulate how cells respond to cancer treatments, affecting the effectiveness of therapy63. Furthermore, the Luminal A subtype, as reported by Padua et al.64 observed that in the Luminal subtype, there was an effect on the expression of genes associated with the plasma membrane and the composition of the extracellular region, as well as the expression of the ERα receptor. This alteration may explain the characteristics of more aggressive cell growth. ERα is a crucial molecule in tumor development and progression due to its ability to activate various signaling pathways, whether cytoplasmic or nuclear, which are present, among other places, in the plasma membrane65. Additionally, this subtype shows low expression of genes related to cell proliferation66. On the other hand, in the Luminal A subtype, PIK3CA emerges as the most frequently mutated gene, followed by GATA3, MAP3K1, and TP53, reflecting a distinctive mutational profile that could influence therapeutic decisions67. This pattern suggests thatESR1, GATA3, KMT2C, and PTEN also play significant roles in cancer progression and treatment response68,69.

For Luminal B, 19 genes were associated with the extracellular space (GO:0005615) with an adjusted p value of 0.14 (Supplementary Table S4), which are important in cancer because they regulate how cancer cells interact with their immediate environment. These genes can control the production of proteins that modulate the adhesion of cancer cells to surrounding tissues and other blood vessels, essential for tumor growth and spread. In addition, these genes can also control the production of proteins that promote angiogenesis, and regulate how cells respond to cancer treatments, affecting the effectiveness of therapy70. The composition and three-dimensional structure of the ECM vary significantly, affecting processes such as apoptosis and cell proliferation. Furthermore, changes in the ECM have been linked to resistance to endocrine treatments and cancer recurrence, demonstrating the complexity of the interaction between tumor cells and their microenvironment71. Additionally, in the Luminal B subtype, a relationship has been established between the heterogeneity of breast cancer tumors and the components of the ECM, whose composition and three-dimensional structure undergo constant remodeling through closely regulated mechanisms during tumorigenesis. It has been evidenced that the ECM of different SBCs, such as Luminal and TNBC, exhibit disparities in the transcriptomic profile, especially in genes that influence cell behavior, such as apoptosis, the cell cycle, DNA replication, and the estrogen signaling pathway, including E-cadherin, vimentin, and ERα. Similarly, a correlation has been established between changes in ECM composition, endocrine resistance, and relapses72.

These subtypes also present a high expression rate of genes associated with proliferation, such as MKI67 and AURKA73. The Luminal B subtype is characterized by a high frequency of mutations in PIK3CA, GATA3, and TP53, with a marker difference in the prevalence of P53 mutations compared to Luminal A62. This subtype also illustrates how the breast cancer genome can evolve during therapy, with 20% of patients acquiring mutations in ESR1, leading to resistance to endocrine therapy. This phenomenon underscores the importance of understanding mutational dynamics to address treatment resistance74. Finally, for HER2 ampl, four genes with coenzyme binding function were found (GO: 0050662) with an adjusted p value of 0.36 (Supplementary Table S5). These genes can control the production of enzymes involved in metabolic pathways, such as the pentose phosphate pathway, essential for the biosynthesis of molecules needed for cell growth and division75. In addition, Moasser et al.76 highlights the overexpression of the HER2 gene and its implication in an aggressive tumor phenotype through the activation of signaling pathways such as RAS/MAPK and PI3K/AKT. This subtype shows a clear association between the enzymatic function of involved proteins and a higher propensity for cell proliferation and survival of the HER2-positive subtype.

Approximately 20% of cases exhibit amplification or overexpression of HER2, associated with an aggressive tumor phenotype and poor survival. HER2 encodes for a transmembrane tyrosine kinase receptor, playing a crucial role in cell–cell and cell–stroma communication through signal transduction. Many proteins and intermediaries are involved in these pathways exhibiting enzymatic activity. The main signaling pathways involved, RAS/MAPK, PI3K/AKT, and JAK/STAT, among others, affect cell proliferation, survival, mobility, and adhesion77. In this subtype, gene enrichment is primarily related to coenzyme-binding function76. Furthermore, this subtype has the second-highest prevalence of P53 mutations (70%), making it a significant biomarker and therapeutic target78. Additionally, three mutations (Q429R, Q429H, and T798M) in HER2, identified in the MCF7 cell line, have been shown to diminish the effectiveness of trastuzumab in this cell line, and are also postulated as potential biomarkers for predicting resistance to therapy79. These results demonstrated that the SBCs, Basal A, Basal B, Luminal A, Luminal B, and HER2 ampl, exhibit different co-expression patterns, thereby implicating divergent biological processes. This distinction could facilitate the identification of potential biomarkers unique to each subtype.

In Basal A, for metabolic pathways analysis, 21 genes are involved in the phagosome pathway with an adjusted p value of 2.4e−5, and 23 genes are involved in Epstein–Barr virus infection with an adjusted p value of 8.0e−5. A meta-analysis indicates a strong statistical relationship between Epstein–Barr virus infection (EBV) and BC risk, suggesting a potential role for EBV infection in the development of BC80. Gupta et al.81 reported that the presence of EBV in triple-negative breast cancer poses a high risk for patients. Additionally, there is a potential association between serous epithelial ovarian cancer (EOC) and EBV infection, which could be subtype-specific82. For the Basal B subtype, 52 genes participate in proteoglycans in cancer with an adjusted p value of 3.1e−7 (Supplementary Table S6). Proteoglycans contribute to the development and progression of the disease and its response to treatment by potentially affecting tumor growth, invasion, and metastasis through interactions with other extracellular matrix molecules and the regulation of signaling pathways83. These results illustrate that the genes in the pink module, highly correlated with the Basal A subtype, differ from those in the turquoise module, highly correlated with the Basal B subtype, involving them in distinct biological processes and molecular functions.

Correlation network and identification of hub genes

Hub genes are characterized by being genes whose degree value is higher than the average of the threshold network degree value84. Many studies have shown that the relationship between connectivity and node significance carries important biological information47. In this respect, to identify hub genes for each SBC, we obtained the top 50 highly degree genes for pink, turquoise, yellowgreen, skyblue, and navajowhite2 modules, and their networks were displayed (Supplementary Tables S7, S8). The results show that the Basal A network is composed of 50 nodes and 75 edges with a threshold of 0.05. The gene with the highest degree, corresponding to the main hub gene, is IFIT3 (Interferon-induced protein with tetratricopeptide repeats 3) with 16 degrees, followed by PSMB9 (Proteasome 20S Subunit Beta 9) with 12 degrees, anxa8l2 (annexin A8-like 2), ANXA8 (annexin A8), and ANXA8L1 (annexin A8-like 1) with 7 degrees each one (Fig. 3, Supplementary Table S9). In this regard, IFIT3 encodes for a transcription factor involved in the negative regulation of apoptotic processes and the negative regulation of cell proliferation85,86, indicating its potential role in cancer development and progression.

It inhibits proliferation in human BC cells (treated with T-47D curcumin)87, which is consistent with the biological process identified for the pink module. Moreover, ifit3 is considered a predictive biomarker for chemotherapy and radiation in some human cancers, such as breast, lung, prostate, neck, head cancer, and glioma86,87, indicating its potential utility across multiple cancer types. However, this is the first time that it has been identified as a potential biomarker associated with the Basal A subtype. Recently, Lamsal et al.88 demonstrated that IFIT3 mRNA and its protein are highly expressed in human metastatic breast cancer cell line MDA-MB-231, while its expression is low in non-metastatic MDA-MB-453 cell line. The next hub gene, PSMB9, encodes a modified enzyme with peptidase activity involved in the proteasome pathway, which helps cells deal with oxidative and proteotoxic stress. The overexpression of proteasomes is present in a wide variety of cancer types89, further highlighting the importance of these genes in cancer progression and treatment response. Proteasomes play a key role in the stress response because they degrade damaged proteins90. PSMB9 is also a potential diagnostic biomarker and tumor suppressor for human uterine leiomyosarcoma91. ANXA8L2 Protein 2, ANXA8 protein (annexin A8), and ANXA8L1 protein 1 are members of the annexin family of evolutionarily conserved Ca2+ and phospholipid-binding proteins; their overexpression has been associated with acute myelocytic leukemia32. In addition, ANXA8L1 participates in biological regulation, cellular process, localization, metabolic process, calcium-binding protein, response to wounding, and endomembrane system organization92,93. Interestingly, ANXA8 expression is directly related to a subgroup of basal-type breast cancer patients with poor prognosis94, which correlates with the association identified between these three hub genes and the Basal A subtype in this work.

In the case of the Basal B subtype, the network consists of 50 nodes and 57 edges with a threshold of 0.0670. The main hub gene is the transcription factor ets1 (23 degrees). The remaining four hub genes are ADAMTS6 with 6 degrees, MIR100HG with 6 degrees, PTGFR with 6 degrees, and CAV1 with 5 degrees (Fig. 4, and Supplementary Table S10). ETS1 belongs to the ETS family of transcription factors and is overexpressed in some types of cancer cells, such as prostate cancer and leukemia95. It participates in cancer pathways like angiogenesis, PDGF, VEGF, and RAS signaling, cellular senescence, human T-cell leukemia virus infection, and renal cell carcinoma. The biological processes of ETS1 include extracellular structure organization, response to wounding, blood vessel development, negative regulation of cell proliferation, regulation of cell death, DNA-binding, and transcription regulator activity92,93.

Figure 4
figure 4

Basal B breast cancer subtype correlation network. The circles in the network symbolize different biological functions, each distinguished by a unique color. Dark blue circles represent catalytic activity, pink circles represent transcriptional regulator activity, and brown circles symbolize molecular transduction activity. Light blue circles indicate binding functions, while yellow circles are associated with molecular adaptor activity. Red circles represent gene-specific transcriptional regulators, purple circles indicate transporter activity, green circles represent modifying enzymes, and light green circles indicate regulators of cell proliferation. The diamond shapes in the network represent hub genes. These are genes that have a high degree of connectivity within the network and play a significant role in the biological processes. The size of each node, whether a circle or diamond, corresponds to the number of connection degrees involved in biological processes. Larger nodes have more connections, indicating a higher degree of involvement in various biological processes.

It was detected as a potential prognostic biomarker in gastric cancer96. Recently, ETS1 was also described as a specific prognostic biomarker for TNBC patients, but not for patients who are not triple-negative or other patients with breast cancer. In addition, a very close association with invasiveness, metastasis, and a poor prognosis was determined in TNBC patients, which is directly related to the overexpression it induces of the genes that code for the metalloproteinases MMP1, MMP-3, and MMP9, as well as VEGF among others97. This result agrees with the one obtained in this work where the hub ETS1 gene is directly associated with the Basal B subtype. ADAMTS6 has catalytic activity and is involved in cellular processes, protein-modifying enzymes, extracellular structure organization, and blood vessel development. It contains a disintegrin and metalloprotease domains with thrombospondins motifs, members of this family can be secreted by cancer and stromal cells and can contribute to modifying the tumor microenvironment by multiple mechanisms, such as cell adhesion, migration, proliferation, and angiogenesis98. Additionally, ADAMTS6 suppresses tumor progression via the ERK signaling pathway and is a prognostic marker in human BC99.

The next hub gen is MIR100HG, a long non-coding RNA (lncRNAs). It acts as a regulator of cell proliferation and has been reported as a prognostic biomarker for gastric cancer100. LncRNAs are endogenous in various cancers, including BC101, and are involved in different biological processes like migration, metastasis, and proliferation17. PTGFR is a tumor suppressor gene involved in a tumor promotion process, perhaps via interaction with its specific ligand (PGF2α). It has been described as a potential biomarker to predict prostate cancer progression, particularly between stage II and subsequent stages of the disease102. CAV1 is a membrane protein with a crucial role in the advancement of BC, including cell proliferation, apoptosis, autophagy, invasion, migration, and metastasis103. CAV1 has been identified as a biomarker of colorectal, lung, penile, prostate, and non-small cell lung cancer104,105,106,107,108. Interestingly Al et al. found genes identified from samples of patients with BC of basal subtype to IFI44L (Basal A), PLAU, and IL6 (Basal B), which are in the top 50 that we identified of these subtypes109.

For subtype Luminal A, we identified a network with 50 nodes and 264 edges with a threshold of 0.0445. The gene with the highest degree and the main hub gene is the transcriptional regulator ENSG00000259723 with 32 degrees, followed by ENSG00000266934 with 30 degrees, ENSG00000236055 with 29 degrees, CA4 with 25 degrees, and AL354813.1 with 24 degrees (Fig. 5, Supplementary Table S11). ENSG00000259723 is a lncRNA expressed in seven cancers, such as bladder cancer, colon adenocarcinoma, head and neck squamous cell carcinoma, lung squamous cell carcinoma, ovarian cancer, prostate adenocarcinoma, and endometrial carcinoma of the uterine body110,111. ENSG00000266934 is present in 31 species, such as Gorilla gorilla gorilla, Carlito syrichta, Pteropus vampyrus, Dasypus novemcinctus112; and ENSG00000236055, located on chromosome 17, and 1 respectively. CA4 is a carbonic anhydrase implicated in cell proliferation, and inhibits cell proliferation, invasion, and metastasis32. CA4 has been reported as a clinicopathological biomarker of immune infiltration and outcomes in kidney carcinoma, lower-grade glioma, pulmonary adenocarcinoma, and uveal melanoma113, and identified as a biomarker for gastric cancer114. AL354813.1 is a lncRNA located on chromosome 20 and has one transcript with 2437 bp110. Recent research has shown that the amount of IncRNA in cancer differs in normal tissues, and there are many different prognostic-related expressions. In other types of cancer, lncRNAs have been identified as prognostic markers115,116. It was found that lncROPM plays a crucial role in sustaining and has functioned as a chemo-resistance biomarker in BC cancer stem cells117. Although many lncRNAs have been identified in BC, their functional mechanism remains unknown. Interestingly, we found not only hub genes that encode transcription factors but also lncRNAs, as has also been reported118,119.

Figure 5
figure 5

Representation of a correlation network for the Luminal A subtype of breast cancer. Visualising the relationships between different biological functions and hub genes. The circles in the network represent different biological functions, each distinguished by a unique color. Dark blue circles denote catalytic activity, pink circles represent transcriptional regulator activity, and brown circles symbolize molecular transduction activity. Light blue circles indicate binding functions, while orange circles are associated with metabolite interconversion enzyme activity. Red circles represent transporter activity, purple circles indicate transmembrane signal receptor activity, yellow circles indicate cytoskeletal protein functions, and gray circles represent protein-binding activity modulators. The diamond shapes in the network represent hub genes, which are genes with a high degree of connectivity within the network, indicating their important role in the biological processes. The size of each node, whether a circle or diamond, corresponds to the number of connection degrees involved in biological processes. Larger nodes have more connections, indicating a higher degree of involvement in various biological processes.

The network for the subtype luminal B is composed of 50 nodes and 265 edges, with a threshold of 0.043. The gene with the highest degree is AL033519.3 with 36 degrees, MS4A15 with 32 degrees with signal transduction activity, WASIR1 with 27 degrees, and PAMR1 with 25 degrees, and TTR with 25 degrees (Fig. 6, Supplementary Table S12). AL033519.3 is a hypothetical gene located on chromosome 6 and comprises 771 nucleotides. Currently, there are s no studies linking these genes to tumorigenesis or its progression. MS4A15 is involved in cellular processes, response to stimulus, and signaling32. It regulates the cellular metabolism involved in the malignant transformation of numerous tumors, including ovarian cancer120. Its overexpression correlates with poor survival in patients with ovarian cancer, making it a promising therapeutic target. MS4A15 has also been suggested as a possible biomarker for colon and gastric cancer121. WASIR1 is a lncRNA that appears to interact with LINC00511, which is overexpressed in BC and promotes its progression122. PAMR1 is predicted to be involved in proteolysis32. It is inactivated by promoter hypermethylation in BC, therefore, it has been considered a tumor suppressor123 and a potential biomarker that has a negative correlation to the spread of cervical cancer124. TTR plays a role in stimulating tumor growth mediated by activation of mitogenic and oncogenic molecules, as well as immune and endothelial cells, specifically in lung cancer. It has a significant effect on the range of functions of endothelial cells, managing both tumor and immune cell migration and infiltration. When endothelial cells are treated with TTR, there’s a suppression of T cell proliferation125 linking it to disease pathologies and marking it as Oxidative Stress Biomarker126. Additionally, the inhibition of TTR, associated with conditions such as malnutrition and inflammation, is recognized as a biomarker for various human morbidities127.

Figure 6
figure 6

Correlation network for the Luminal B subtype of breast cancer. The circles in the network symbolize different biological functions, each distinguished by a unique color. Dark blue circles denote catalytic activity, pink circles represent proliferation inhibitor activity, and brown circles symbolize disease-susceptibility for panbronchiolitis. Light blue circles indicate binding functions, while orange circles are associated with molecular function regulator activity. Red circles represent molecular transducer activity, green circles denote transporter activity, and gray circles represent signal transduction functions. The diamond shapes in the network represent hub genes. These are genes that have a high degree of connectivity within the network, indicating their significant role in the biological processes. The size of each node, whether a circle or diamond, corresponds to the number of connection degrees involved in biological processes. Larger nodes have more connections, indicating a higher degree of involvement in various biological processes.

Finally, the network for the subtype HER2 ampl consists of 50 nodes and 162 edges with a threshold of 0.0211. The gene with the highest grade and considered as the main hub is TMEM86A with 32 degrees, linked to ichthyosis disease; followed by four important genes in this network which are CAPN9 with 25 degrees, ERVE-1 with 21 degrees, SCGB2A2 with 14 degrees, and RIPOR3 with 13 degrees (Fig. 7, Supplementary Table S13). TMEM86A is a lysoplasmalogenase that may be important for regulating keratinocyte membrane properties in terminal differentiation cells128 proposed as a candidate for tumor surface antigen in HER2 ampl BC, which coincides with our results129. CAPN9, the calpains are intracellular cysteine proteases that function in various important cellular processes, including signaling, motility, apoptosis, and cellular survival130. Down-regulated CAPN9 is related to poorer patient clinical outcomes following endocrine therapy in BC131. ERVE-1 is an endogenous retroviral sequence from the family ERVs (endogenous retroviral sequence) detected in several cancers, while they remain silent in healthy tissues132. However, in BC, the expression of HERV-K (human endogenous retrovirus type K) is frequent, and HERV-K (HML-2) antibodies and mRNA tended to be higher in BC patients with a primary tumor, who later will develop the metastatic disease than in patients who did not develop cancer metastasis133. It is important to note that most of the HER2 ampl cell lines that used for this analysis are metastatic. SCGB2A2 encodes a glycoprotein indicated as a potential diagnostic and prognostic marker for BC134. It is involved in biological regulation, cellular processes, response to stimulus, and signaling. A genetic rearrangement has been detected in breast tumors that overexpress SCGB2A2, suggesting changes in transcriptional regulation as a cause of overexpression135. Its expression does not seem to be affected by steroid hormones135,136. SCGB2A2 is a biomarker for the early detection of bone marrow micrometastasis134. A downregulation of RIPOR3 in tongue cancer is associated with a worse prognosis. Its expression is related to the modulation of immune pathway function and hypermethylation, making it a promising prognostic biomarker and is linked to the immune cell infiltration of cell carcinoma of the mobile tongue137. The functions and roles of the hub genes described above for each SBC underscore their significance in breast cancer or other cancer types. Our results demonstrate that each SBC evaluated (Basal A, Basal B, Luminal A, Luminal B, and Her2 ampl) presents a different gene expression pattern, identifying highly correlated hub genes for each subtype. Moreover, some of these genes are involved in similar biological and cellular processes including proliferation, invasion, metastasis, and immune response, among others. Therefore, these hub genes may be potential diagnostic biomarkers for each SBC.

Figure 7
figure 7

Visualisation of a correlation network for the HER2 ampl subtype of breast cancer. The circles in the network represent different biological functions, each distinguished by a unique color. Dark blue circles denote catalytic activity, pink circles represent ATP-dependent activity, and brown circles symbolize biological regulation. Purple circles indicate binding functions, while yellow circles are associated with multicellular organismal processes. Red circles represent functions linked to autosomal recessive congenital ichthyosis, and light blue circles denote signaling functions. The diamond shapes in the network represent hub genes, which have a high degree of connectivity within the network, indicating their important role in biological processes. The size of each node, whether a circle or diamond, corresponds to the number of connection degrees involved in biological processes. Larger nodes have more connections, indicating a higher degree of involvement in various biological processes. This image is a valuable tool for visualizing complex biological interactions and understanding the role of different functions and genes in HER2 ampl.

Protein–protein interaction (PPI)-networks for each SBC

The interactions between the proteins encoded by the hub genes of each module can provide insights into the pathways or mechanisms activated in the different SBCs. For this reason, we analyzed the PPIs in the STRING database. The results showed that the PPI-network for Basal A consists of 33 nodes and 93 edges (Supplementary Fig. 4a). The protein most closely connected is ISG15 ubiquitin-like protein, which functions as a cytokine to modulate immune responses such as viral infections (14 degrees). Interestingly, its expression has been relevant in the pathogenesis of BC due to its role in motility, invasion, metastasis, and a poor response to chemotherapy and radiotherapy138,139. Additionally, XAF1 counteracts the inhibitory effect of apoptosis (13 degrees)140, IFIT2 involve positive regulation of the apoptotic process and response to the virus (13 degrees)141, RSAD2 is an interferon-inducible antiviral protein (13 degrees), and OAS2 involved in the innate immune response to viral infection (13 degrees)142. The Basal B network consists of 40 nodes and 28 edges. The most connected protein is CAV1, recognized as a tumor suppressor (10 degrees) with a dual role in BC progression103; CD44, which participates in cell adhesion and tumor metastasis, among others (9 degrees)143 PLAU related to apoptotic pathways in synovial fibroblast (4 degrees)144; PTRF, plays a significant role in caveolae formation, ribosomal transcriptional activity in response to metabolic challenges in the adipocytes, among others (3 degrees); CAV2 involve in signal transduction, lipid metabolism, cellular growth control, and apoptosis (3 degrees)32 (Supplementary Fig. 4b). The PPI network for Luminal A consists of 17 genes and 0 edges (Supplementary Fig. 4c). The Luminal B network includes 29 genes and four edges, with the most connected protein being ANG, which induces angiogenesis (2 degrees)145; and TRH releasing mature thyrotropin, a hypothalamic regulatory hormone (2 degrees)32; CAPN6 may play a role in tumor formation by inhibiting apoptosis and promoting angiogenesis (1 degree)146; TTR transports thyroid hormones and participates in proteolysis, nerve regeneration, autophagy, and glucose homeostasis (1 degree)32,147; ITM2A is a type II membrane protein that may be involved in osteo- and chondrogenic differentiation (1 degree)148 (Supplementary Fig. 4d). The PPI-network for HER2ampl consists of 34 genes and nine edges. The protein which the highest connection is SCGB2A2, involved in the androgen receptor signaling pathway (3 degrees)149; SCGB1D2 a chemotherapeutic agent widely used for prostate cancer (2 degrees)32; BPIFB1 involved in the innate immune response to bacterial exposure in the mouth, nasal cavities, and lungs (2 degrees); SCGB2A1 bind androgens and other steroids, a chemotherapeutic agent used for prostate cancer, under the transcriptional regulation of steroid hormones (2 degrees); CEACAM5 plays a role in cell adhesion, intracellular signaling, and tumor progression(2 degrees)32,150 (Supplementary Fig. 4e).

Hub gene expression in BC patients

We conducted a DEG on the GSE65194, GSE96860, and GSE52194 datasets, revealing that our main hub genes, IFIT3, ETS1, CA4, and TMEM86A, were overexpressed in each SBC with their respective adjusted p values, while MS4A15 only showed significant overexpression in dataset GSE52194 (Supplementary Table S14). Interestingly, IFIT3 has been recognized as a possible biomarker for predicting the response to chemotherapy and radiation in specific human cancers (breast, lung, prostate, head and neck cancer, and glioma)86,151. Additionally, the expression of IFIT3 mRNA could function as a predictive marker for the efficiency of immunostimulant treatments in breast cancer cells152. Recently, ETS1 was identified as a predictive biomarker for TNBC, notably enhancing the progression of TNBC by activating the YAP signaling pathway97, aligning with our findings. CA4 has been associated with the prognosis of lung adenocarcinoma153. Finally, TMEM86A has been suggested as a candidate for a tumor surface antigen in HER2 ampl BC129, and another for differential expression in various cancer types, including liver, colon, and gastric154. A differential gene expression analysis on the GSE134359 dataset, which includes long non-coding RNA (lncRNA) from SBCs, identified ENSG00000259723 and AL033519.3 as the main hub genes for Luminal A and Luminal B, with an adjusted p values of 3.43E−01 and 2.46E−05, respectively. In breast cancer, lncRNA functions as both a promoter and inhibitor of metastasis115. It is important to recognize that research into lncRNAs in breast cancer is still evolving, with more comprehensive studies required to fully grasp their functions and mechanisms. Nevertheless, progress in this area has already demonstrated that lncRNAs significantly impact BC biology. They offer promising potential as therapeutic targets, highlighting the importance of continuing exploration in this domain21.

Ultimately, the search for molecules capable of identifying and distinguishing SBCs, as well as serving as therapeutic targets, remains a critical endeavor in cancer treatment. Correct diagnosis enables the administration of the most effective treatments. Moreover, it is crucial to monitor treatment efficacy, prevent cancer cell resurgence, and avert the development of drug resistance, all to eliminate breast cancer cells. Our research underscores the identification of pivotal genes whose overexpression is specifically correlated with certain SBC. While further validation in diverse patient cohorts of various SBCs is necessary, our findings align with those of Charafe-Jauffret et al.155, who utilized 31 breast cancer cell lines and DNA microarrays to identify, among others, three key genes we also recognized: ANXA8 (Basal A), ETS1 and CAV1 (Basal B), associating them with the basal subtype of breast cancer. Calderon-Gonzalez et al.156 and Stein et al.157 have also shown that ANXA8 is overexpressed in basal subtype cell lines of both human and mouse origin. Regarding ETS1, its association with local metastasis in TNBC patient samples was noted97. Xie et al.99 reported adamts6 (Basal B) as a potential prognostic marker in breast cancer, supported by a multivariate analysis of patient tumor samples and cell lines (Supplementary Table 15).

In the case of the Luminal A subtype, the ca4 gene hub identified in our study has also been detected and proposed as a prognostic biomarker from samples of patients with invasive cancer158. For the Luminal B subtype, none of the gene hubs identified in this work have been proposed as potential biomarkers in other studies. However, Wu et al.159 suggest the capn6 gene as a prognostic marker for BC. While it is not the main gene hub correlating with this SBC in our analysis, it is present in the skyblue module. Similarly, the presence of PPP2R2B, HPD, ADGRG7, ADCY4, and ITM2A supports the correlation of the expression of these genes with the Luminal B subtype (Supplementary Table 15), underscoring the complexity and the nuanced understanding required to navigate the molecular landscape of SBCs. Additionally, these insights highlight the potential of specific genes to serve as biomarkers for BC, emphasizing the need for further research to validate these associations and understand their implications for diagnosis and treatment strategies.

Finally, regarding the gene hubs associated with the HER2 ampl subtype134, SCGB2A2 is predominantly overexpressed in BC patients and shows a high correlation with increased expression and the presence of micrometastases134. The similarities and differences in identifying these gene hubs may stem from the origin of the samples, whether they are cell lines or patient tumor samples. While these sources have been shown to possess great similarities160, they also exhibit certain differences, particularly when compared with samples from advanced-stage tumors161,162. These disparities may also relate to the number of samples used, the types of algorithms applied, and the specific correlation or approach sought in the research. Additionally, this variability underscores the well-documented heterogeneity in breast cancer, highlighting the complexities of diagnosing and treating this multifactorial disease.

Conclusions

Identifying biomarkers that enable the precise diagnosis of specific SBC is crucial for understanding the type of tumor, its prognosis or stage, and selecting the most effective treatment. This study identified a specific module with the highest correlation for each SBC, pinpointing at least one hub gene with the highest correlation for each subtype: Basal A: IFIT3; Basal B: ETS1; Luminal A: ENSG00000259723; Luminal B: AL033519.3; HER2ampl: TMEM86A. These genes are involved in defense response, cell process regulation, and transport for each respective SBC.

Validation analysis of the highly connected hub genes expressed in patients with SBCs, using the external dataset GSE65194, revealed significant differential expression of IFIT3, ETS1, CA4, and TMEM86A. Our findings could provide a theoretical basis for enhancing prognosis and diagnosis through potential SBC biomarkers. Nevertheless, the mechanisms underlying these associations need to be explored in future studies.

This research proposes further analysis of these targets as potential biomarkers. It represents the first study to use a large sample of breast cancer cell lines and WGCNA to identify potential biomarkers associated with SBCs. The development of co-expression networks in each SBC may elucidate possible biomarkers for risk assessment, prognostic determination, screening, differential diagnosis, treatment response prediction, and disease progression monitoring. These hubs could be promising targets in drug development for monitoring or treating SBCs. A significant strength of this study is its focus on SBCs rather than BC in general, offering a more nuanced understanding of subtype-specific biomarkers.