Identification of key genes controlling breast cancer stem cell characteristics via stemness indices analysis
- 134 Downloads
With the gradual unveiling of tumour heterogeneity, cancer stem cells (CSCs) are now being considered the initial component of tumour initiation. However, the mechanisms of the growth and maintenance of breast cancer (BRCA) stem cells are still unknown.
To explore the crucial genes modulating BRCA stemness characteristics, we combined the gene expression value and mRNA expression-based stemness index (mRNAsi) of samples from The Cancer Genome Atlas (TCGA), and the mRNAsi was corrected using the tumour purity (corrected mRNAsi). mRNAsi and corrected mRNAsi were analysed and showed a close relationship with BRCA clinical characteristics, including tumour depth, pathological staging and survival status. Next, weighted gene co-expression network analysis (WGCNA) was applied to distinguish crucial gene modules and key genes. A series of functional analyses and expression validation of key genes were conducted using multiple databases, including Oncomine, Gene Expression Omnibus (GEO) and Gene Expression Profiling Integrative Analysis (GEPIA).
This study found that mRNAsi and corrected mRNAsi scores were higher in BRCA tissues than that in normal tissues, and both of them increased with tumour stage. Higher corrected mRNAsi scores showed worse overall survival outcomes. We screened 3 modules and 32 key genes, and those key genes were found to be strongly correlated with each other. Functional analysis revealed that the key genes were related to cell fate decision events such as the cell cycle, cellular senescence, chromosome segregation and mitotic nuclear division. Among 32 key genes, we identified 12 genes that strongly correlated with BRCA survival.
Thirty-two genes were found to be closely related to BRCA stem cell characteristics; among them, 12 genes showed prognosis-oriented effects in BRCA patients. The most significant signalling pathway related to stemness in BRCA was the cell cycle pathway, which may support new ideas for screening therapeutic targets to inhibit BRCA stem characteristics. These findings may highlight some therapeutic targets for inhibiting BRCA stem cells.
KeywordsBreast cancer Cancer cell stemness mRNAsi WGCNA
Cancer stem cells
mRNA expression-based stemness index
Epigenetically regulated mRNAsi
The Cancer Genome Atlas
Weighted gene co-expression network analysis
Gene Expression Omnibus
Gene Expression Profiling Integrative Analysis
Breast cancer stem cells
Insulin growth factor
One-class logistic regression
Triple negative breast cancer
Differentially expressed genes
False discovery rate
Topological overlap matrix
Principal component analysis
Kyoto Encyclopedia of Genes and Genomes
Search Tool for the Retrieval of Interacting Genes
Breast cancer is one of the most common and lethal cancers in women. According to the latest cancer statistics, the number of estimated new cases and deaths from breast cancer was 268,600 and 41,760, respectively, and the incidence and mortality rates of breast cancer were nearly 30% and 15%, respectively, among all cancers in females worldwide . The incidence rates of breast cancer increased slightly from 2006 to 2015 and this change is considered to be caused by the prevalence of obesity and decrease in parity in women . The most crucial problem in the clinical treatment of breast cancer is that most people first diagnosed with breast cancer are often in an advanced stage because of the lack of access to sensitive markers and effective therapy. Breast cancer is a complex process involving multiple cellular activities and signalling pathways. Hence, it is critical for us to precisely understand the molecular mechanism underlying this complicated malignancy, which could be beneficial for discovering valuable biomarkers to diagnose or predict clinical outcomes.
As a result of the development of single-cell DNA or RNA sequencing technology, tumour heterogeneity is broadly understood and unveils the fact that there are different cell populations in the same tumour tissues, one type of which is cancer stem cells (CSCs) [3, 4]. CSCs show a high degree of plasticity, which leads to distinct cellular phenotypes, functions and metabolic features. One of the reasons for plasticity caused by CSCs is that this cell population has the competence to transform between quiescent and proliferative states when they are stimulated in certain situations . Breast cancer stem cells (BCSCs) were initially reported in 2003 and increasing studies have revealed that BCSCs are closely related to breast carcinogenesis . In addition, the presence of BCSCs was reported to correlate with tumour survival, metastasis and therapy resistance . In addition, progressively increased genes were described to play a role in BCSC regulation in breast tissues, such as metalloproteases (MMPs) and insulin growth factor (IGF), and these genes were upregulated in conventional breast tumour cells . Hence, scientists suspected that cancer cells might arise from a cell population with self-renewal ability, which was thought to be tumour stem cells. Although studies on BCSCs have been continuously conducted worldwide, the role of BCSCs in BRCA pathogenesis and progression is still unclear; identification of the key factors or vital pathways that initiate BCSCs from a quiescent state to a malignant state is urgently needed. To solve these issues, some researchers have used artificial intelligence and deep learning methods to summarize and analyse stem cell features. Malta et al.  used a one-class logistic regression (OCLR) machine learning algorithm to extract the transcriptomic and epigenetic feature sets from normal tissue-derived pluripotent stem cells, including embryonic stem cells, induced pluripotent stem cells, and their differentiated progeny, which have different degrees of stemness; in this way, they identified stem cell signatures and quantified stemness with a multi-part analysis containing transcriptomes and methylomes. Finally, two stemness indices, mRNAsi and mDNAsi, were proposed in this study: the former reflected gene expression, and the latter reflected epigenetic features. To verify the two stemness indices, the researchers further annotated and analysed cancer stemness in nearly 12,000 samples of 33 tumour types. Based on this study, we can obtain the stemness indices of each BRCA tissue in the TCGA database.
In the present study, we aimed to recognize key genes and pathways correlated with BRCA stemness by combining mRNAsi in BRCA in TCGA via bioinformatic analysis. The WGCNA model was constructed, and gene modules that are closely related to the mRNAsi index are displayed. We identified three key gene modules and further selected key genes from one of them. Gene and module functional analyses were conducted to show their significance in BRCA. In summary, our study used a novel method to identify stemness-related genes and benefited us by identifying CSC-related genes and predicting their roles in cancer.
Data collection and pre-processing
The RNA sequencing (RNA-Seq) expression data of 1222 samples, including 113 normal samples and 1109 breast cancer samples, and the corresponding clinical information of 1097 cases were downloaded from the TCGA database on September 2019 (https://portal.gdc.cancer.gov). The mRNAsi indices and tumour purity of breast cancer cases in TCGA were obtained from previous studies [9, 13]. We used the Perl language (http://www.perl.org/) to combine the RNA-Seq results of each sample and the Ensemble database (http://asia.ensembl.org/index.html) to convert gene IDs to gene symbols in a matrix profile. After useful information filtering, we took 1097 cases and corresponding clinical information for the next analysis.
Clinical characteristic correlation analysis
The prognostic value of mRNAsi or corrected mRNAsi was investigated using the survival package in R. The correlation between mRNAsi or corrected mRNAsi and tumour stages or tumour grades was analysed with the beeswarm package in R.
Screening of differentially expressed genes (DEGs)
Raw expression data from the TCGA were transformed with log2, and identification of differentially expressed genes (DEGs) was conducted using the limma package in R language . The cut-off criteria for DEG selection were as follows: |log2-fold change| > 1, p < 0.01, and false discovery rate (FDR) < 0.05. Volcano plots and heatmaps were drawn using the limma and pheatmap packages in R.
The WGCNA package in R was utilized to build a co-expression network targeting DEGs . All paired genes adopted the average linkage method and Pearson’s correlation matrices, and the co-expression similarity matrix was built using the absolute values of the correlations between transcription data. The function Amn = |Cmn|β (Cmn = Pearson’s correlation between gene m and gene n; Amn = adjacency between gene m and gene n) provided us with a method to establish a weighted adjacency matrix. β defined a correlation power (soft thresholding parameter) showing strong relations between genes and penalizing the weak correlation. We first selected a β value to build a co-expression network, and then we converted the adjacency into a topological overlap matrix (TOM) to measure the network connectivity of genes, and the TOM summed up the adjacent genes for the network gene ratio and calculated the corresponding dissimilarity. We used average linkage hierarchical clustering based on TOM dissimilarity measurement to classify genes showing similar expression profiles with gene modules. The minimum size of the gene group was 30 for the gene dendrogram.
Identifying key modules and genes
We chose mRNAsi and epigenetically regulated mRNAsi (EREG-mRNAsi), which is a stemness index generated using a set of stemness-related epigenetically regulated genes, as the sample traits to find CSC-related modules and genes. We selected modules related to the mRNAsi, and genes in these modules were thought to be co-expressed CSC-related genes. First, we calculated the correlation between gene expression levels and sample traits, which was defined as the gene significance (GS). The module eigengenes (MEs) function was used as a key part of the principal component analysis (PCA) for each gene module. In a certain module, the expression model of each gene can be summarized as an expression pattern with a distinct expression feature. The calculation of GS was the log10 transformation of the p value (GS = lgp), which reflected a linear regression between the gene expression and mRNAsi or EREG-mRNAsi. Module significance (MS) was the average GS in a specific module, which represented the correlation between the module and sample traits. We merged some quite similar modules using a cut-off (< 0.25), and then the modules that had the largest MS were considered the most sample trait-related modules. After finding the modules of interest, we calculated GS and module membership (MM, correlation between genes in a certain module and gene expression profiles) for each gene. We defined the thresholds for the selection of key genes in a certain module as cor.gene MM > 0.8 and cor.gene GS > 0.5.
Functional annotation and pathway enrichment analysis
The cluster profiler package in R was selected to perform gene ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses to investigate and visualize the biological function behind key genes . A p-value < 0.05 and an FDR < 0.05 were considered statistically significant.
Co-expression analysis of key genes and protein–protein interaction (PPI) network analysis
To determine the co-expression relationship between key genes, we chose the corrplot package in R to calculate the Pearson correlations at the transcription level. We used an online tool, Search Tool for the Retrieval of Interacting Genes (STRING), to evaluate the protein–protein interaction (PPI) among key genes.
Oncomine (http://www.oncomine.org) and GEPIA (http://gepia.cancer-pku.cn/) were used to verify the mRNA expression of key genes between tumour and normal tissues in BRCA. The threshold of Oncomine screening was as follows: p-value, 1E−4; fold change, 2; gene level, top 10%. We selected three datasets, GSE29431, GSE10797, and GSE65194, from the Gene Expression Omnibus (GEO) database (www.ncbi.nlm.nih.gov/geo/). The online database (http://www.kmplot.com/) was used to examine the survival values of key genes.
All of the cut-offs used in this paper, including mRNAsi, corrected mRNAsi and key gene expression levels, were the median level of each item. The Wilcox test function in R was applied to evaluate the difference in mRNAsi scores or corrected mRNAsi scores between the normal group and the tumour group. A two-sided log-rank test in the survival package in R was employed to assess the survival difference between the two groups. The Kruskal test in R was used to test the correlation between mRNAsi scores or corrected mRNA scores and clinical characteristics. A p value < 0.05 was considered statistically significant.
mRNAsi and corrected mRNAsi according to clinical characteristics of BRCA
DEGs between BRCA tissues and normal tissues
WGCNA: identification of mRNAsi-related modules and genes
Gene function annotation and pathway analysis
Correlation between key genes at transcription and protein levels
Validation and analysis of key genes expression
Although many studies have focused on BRCA diagnosis and treatment in recent years, therapeutic strategies to prevent and treat this malignancy are still inadequate and ineffective. With the emergence of the tumour CSC hypothesis, cancer cells are now considered likely to originate from a cell population called stem cells, which has self-renewal capacity; in addition, CSCs have been reported to be involved in tumour progression, therapeutic resistance and recurrence . Thus, it is fairly important and urgent to identify the key genes driving the crucial cellular processes involved in the transformation from quiescent stem cells to non-renewing cancer cells with unlimited proliferative potential. In the present study, we first analysed the correlation between mRNAsi scores and clinical characteristics in BRCA samples and proved that tumour tissues always had higher stemness than normal tissues, which was consistent with previous findings . Considering that the tumour tissues were composed of complex cell types, including cancer-related cells and normal microenvironment cells, and mRNAsi was a stemness index for all cells in a certain sample, to eliminate the bias of mRNAsi caused by non-cancer cells in tumour samples, we calculated corrected mRNAsi scores using the tumour purity. The corrected mRNAsi showed evidently higher levels in BRCA tissues than in normal tissues, and the corrected mRNAsi scores increased as the tumour pathological stage increased, with T4 stage tumours showing the highest stemness. The mRNAsi scores did not show a significant correlation with patient survival unless they were corrected by tumour purity. As previously reported, CSCs mediated tumour metastasis and treatment resistance, which finally predicted poor survival of patients .
WGCNA is a tool to analyse the gene expression pattern in multiple samples; it can classify those genes with similar expression patterns into clusters and further analyse the correlations between different gene clusters and certain characteristics . We used WGCNA to initially classify the DEGs into different gene clusters based on a weighted connection analysis of the DEG expression profile between BRCA and normal tissues. Thus, those highly co-expressed genes constituted a gene module that could be used to evaluate the correlation strength between gene modules and the clinical features of interest. In this way, we discovered more than one gene module with strong connections to mRNAsi. Gene function annotation and signalling pathway analysis revealed that these gene modules may exert distinct functions in BRCA; for instance, genes in the green module were most enriched in focal adhesion and ECM-receptor interaction pathways; genes in the brown module were mainly focused on the PI3K-AKT signalling pathway, MAPK pathway and Ras signalling pathway; and genes in the turquoise module were primarily concentrated on the cell cycle and cellular senescence pathways. Given that the cell cycle and cellular senescence determine cell fate and self-renewal, we selected the turquoise module for the next analysis. Based on GS and MM, we selected 32 key genes from the turquoise module. These key genes were all upregulated in BRCA tissues, and gene functional enrichment was most focused on the cell cycle pathway. Some cell cycle regulators have been reported to be involved in not only breast cancer progression but also in the stem-like cell activity of breast cancer cells; for example, inhibition of cyclin D1 or CDK4/6 increases or decreases the migration capacity of stem cells in breast cancer . Additionally, BCSCs are considered to exist in a slow cycling state or a quiescence state, which is the direct consequence of cell cycle dysregulation .
The validation of the stemness-related key genes in multiple cancer tissues revealed that most of the key genes were overexpressed in various cancer tissues. In BRCA, the expression levels of these key genes were verified using several GEO datasets, including GSE29431, GSE65194 and GSE10797. As expected, key genes were all overexpressed in BRCA tissues, and the most important was that we found that the expression of key genes in TNBC tissues was quite higher than that in normal tissues. TNBC has a poorer prognosis than other types of breast cancer because of its high degree of malignant phenotypes, which are similar to those of cancer stem cells . We discovered differences in the expression levels of key genes in different subtypes of breast cancer, and 28 of 31 key genes were also upregulated in luminal B tissues compared with luminal A tissues. The expression differences of key genes between TNBC and normal tissues or luminal B and luminal A demonstrated their significance in the regulation of breast cancer stemness characteristics. Only 8 key genes showed expression differences between stromal cells and epithelial cells in breast cancer tissues. We thought the metastasis competence between stromal cells and epithelial cells in the same cancer tissues may not be enough to trigger long-distance metastasis, thus these two cell types may not show evident differences in stemness characteristics. Among all key genes, 12 genes (TPX2, EXO1, CCNB2, CENPA, SGO1, RAD54L, SKA1, FOXM1, PLK1, CDC20, KIF4A and SGO1) were correlated with the survival of BRCA patients.
FOXM1, PLK1, and CENPA composed a cell cycle kinetics regulation pathway in a previous study, which reported that FOXM1 regulated the expression of CENPA and PLK1 to promote mitosis, further regulating the proliferation of pancreatic β cells . The investigation of pancreatic β cells mainly focused on the transition from a quiescent state to a normal cell cycle state, which was quite similar to the characteristics of cancer stem cells. Furthermore, inhibition of PLK1 blocked the growth of CD44 high/CD24-/low tumour-initiating cells in TNBC . CENPA is a critical component of the cell cycle signalling pathway and a necessary regulator of the mitotic spindle; it was found to be expressed in cardiac progenitor cells and to function as a promoter of the proliferation of cardiac progenitor cells . These key genes were mainly focused on the cell cycle signalling pathway, and a previous study reported that cell cycle genes involved in DNA replication and G2 phase progression showed an intrinsic propensity towards the pluripotent state, which suggested that control of the pluripotent state is hardwired to the cell cycle pathway .
In conclusion, 32 genes were found to be closely related to BRCA stem cell characteristics; among them, 12 genes showed prognosis-oriented effects in BRCA patients. The most significant signalling pathway related to stemness in BRCA was the cell cycle pathway, which may support new ideas in therapeutic target screening for inhibitors of BRCA stem characteristics. Conclusions derived from bioinformatic analysis of retrospective data certainly need to be validated by further biological studies, and this is what we are going to do next.
JP and YL designed this research. JP carried out the data analysis. YL wrote the manuscript. YW revised the manuscript. All authors read and approved the final manuscript.
There is no funding for this research.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 9.Malta TM, Sokolov A, Gentles AJ, Burzykowski T, Poisson L, Weinstein JN, Kaminska B, Huelsken J, Omberg L, Gevaert O, Colaprico A, Czerwinska P, Mazurek S, Mishra L, Heyn H, Krasnitz A, Godwin AK, Lazar AJ, Stuart JM, Hoadley KA, Laird PW, Noushmehr H, Wiznerowicz M. Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell. 2018;173(2):338.e15–354.e15.CrossRefGoogle Scholar
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.