Highlights

  1. 1.

    LR model screened 10 important feature DEGs.

  2. 2.

    ABCA3, CCL22, FOXJ1, and IL1RN were good prognostic factors.

  3. 3.

    KCNIP3 and MRPL13 were poor prognostic factors.

Background

Breast cancer is one of the most frequent malignancies in both developed and developing countries, with an estimated 1.5 million new cases per year [1]. Although mammography screening and biomarker testing can improve early diagnosis, tumor recurrence, local, or distant metastases, following conventional therapies is a leading cause of morbidity and mortality in breast cancer patients [2, 3]. As a result, it is critical to identify biomarkers that could predict the recurrence or disease-free survival (DFS) of breast cancer.

Traditionally, the most widely used prognostic factors for the recurrence of breast cancer included tumor size, histologic grade, and the number of axillary lymph nodes with metastasis [4, 5]. These prognostic factors can supply independent prognostic information for patients with breast cancer, whereas they are not suitable for optimal patient management, especially as we move towards the era of personalized treatment [6]. A recent study has suggested that a variety of gene expression changes have occurred in early or precancerous breast cancer, which often precede the appearance of clinical symptoms and can serve as molecular biomarkers of early breast cancer [7]. Thus, a great deal of researches has been devoted to the development and validation of molecular biomarkers that cannot only provide prognostic information but also predict the response to therapy [8, 9]. Currently, many prognostic biomarkers for recurrence of breast cancer have been established, such as 21-gene Oncotype DX assay panel, 70-gene MammaPrit panel, 36-gene signature, PAM50-based Prosigna risk of recurrence (ROR) (NanoString), Breast Cancer Index (BCI) (bioTheranostics), and EndoPredict (EPclin) (Myriad Genetics) [10,11,12]. Although the findings above, our understanding of the molecular mechanisms of breast cancer recurrence is far from clear because of the molecular heterogeneity of breast cancer.

In this study, we aimed to further determine important genes related to the prognosis and recurrence of breast cancer by analyzing the breast cancer gene expression profile in TCGA database based on Logit regression (LR) model analysis and survival analysis. The results may help to provide more powerful biomarkers for the prognosis and recurrence of breast cancer.

Materials and methods

Data sources

Illumina HiSeq 2000 gene expression test data of breast cancer patients were downloaded from TCGA database (https://gdc-portal.nci.nih.gov/), involving a total of 1217 samples. After corresponding to the provided clinical information of the samples, the prognostic grouping was conducted according to the following rules [13]: breast cancer samples with recurrence and death were defined as poor DFS group, while samples without recurrence and survival beyond 5 years were defined as better DFS group. Finally, there were respectively 52 samples and 181 samples in poor and better DFS groups. Besides, after removing the samples without clinical information on breast cancer subtypes (Supplementary materials-table 1), 188 samples were left. Based on the subtypes of breast cancer, these samples in TCGA database were divided into four groups, including Basal (45 samples), Her2 (13 samples), LumA (94 samples), and LumB (36 samples). In addition, another gene expression profile dataset (GSE45725 [14]) of breast cancer was downloaded, which included 340 breast cancer tumor samples. The detection platform was GPL6883 Illumina humanref-8 v3.0 expression beadchip. After removing the samples that had not been followed-up for 3 years, the remaining samples were divided into DFS group (107 samples) and poor DFS group (20 samples), based on the 3-year survival associated with the clinical information. This dataset was used as the validation data. The histopathological data from TCGA and GSE45725 were shown in Supplementary materials-table 1 and Supplementary materials-table 2, respectively.

Data preprocessing and differential expression analysis

After downloading the original expression level data, the Z-score transformation method [15, 16] was used to normalize the original data. Then, according to the grouping, and the R3.4.1 limma package version 3.34.7 [17] (https://bioconductor.org/packages/release/bioc/html/limma.html) was adopted for differentially expressed gene (DEG) screening for the better DFS and poor DFS group samples. False discovery rate (FDR) < 0.05 and |log2 fold change (FC)| > 1 were selected as the threshold for screening the DEGs. To verify whether DEGs can be used to distinguish samples with different prognostic conditions, bidirectional hierarchical clustering was conducted by the R3.4.1 pheatmap package version 1.0 [18]. (https://cran.r-project.org/web/packages/pheatmap/index.html) based on the Pearson correlation algorithm [19].

Function enrichment analysis

Gene Ontology (GO) function (biological process (BP), molecular function (MF), and cellular component (CC)) and Kyoto Encyclopedia of Genes and Genomes (KEGG) [20] pathway annotation for DEGs was performed using DAVID version 6.8 [21, 22] (https://david.ncifcrf.gov/). P value less than 0.05 was selected as the threshold of enrichment significance. The DEGs that were enriched in the GO function and KEGG signaling pathway were selected for further analysis.

Independent prognostic DEG screening

Based on the breast cancer tumor samples and the clinical prognostic information in TCGA dataset, the DEGs enriched in the GO function and KEGG signaling pathway were subjected to screening of significant prognostic correlation using the univariate cox regression analysis in R3.4.1 survival pack version 2.41-1 [23] (http://bioconductor.org/packages/survivalr/). Then, multivariate cox regression analysis was used to further screen independent prognostic DEGs, and log-rank p value less than 0.05 was selected as the threshold of significant correlation.

LR model analysis

Based on the obtained independent prognostic DEGs, we used the glm function in R3.4.1 language to conduct LR model to screen important feature DEGs and classify the two groups of patients with different prognosis. All the genes with p < 0.05 were considered as important feature genes, and then, the accuracy was calculated based on the significant feature genes. Based on the expression characteristics of feature DEGs, all samples were divided into better and poor DFS groups in TCGA training dataset and GSE45725 validation dataset, respectively.

Correlation analysis between LR model classification and survival prognosis

Based on the classification result of LR classification model in the training set and validation set, the Kaplan-Meier (KM) curve method in R3.4.1 survival package version 2.41-1 [15] was used to evaluate the correlation between the grouping conditions (better and poor DFS) and survival prognostic information. Then, the expression levels of important feature DEGs in TCGA training dataset and GSE45725 validation dataset were displayed. In addition, receiver operating characteristic (ROC) curve was drawn to compare the sensitivity and specificity. The area under the curve (AUC) was calculated from the ROC curve. The genes based on the LR models were also verified in the different subtypes of breast cancer (Basal, Her2, LumA, and LumB types).

Results

DEG screening

According to the DEG screening threshold (FDR < 0.05 and |log2FC| > 1), a total of 540 DEGs (better DFS vs. poor DFS) were screened, including 177 significantly downregulated and 363 significantly upregulated DEGs (Fig. 1a). The bidirectional hierarchical clustering heatmap based on DEG expression level is shown in Fig. 1b. It can be clearly seen from the figure that samples with similar gene expression patterns were stratified and clustered into the same group, indicating that the selected DEGs can well distinguish samples of different prognostic types.

Fig. 1
figure 1

a The volcano plot of differentially expressed genes (DEGs). Blue circle represents DEGs, black horizontal line represents FDR < 0.05, and two black vertical lines represent |log2FC| > 1. b Bidirectional hierarchical clustering heatmap based on DEG expression level. White and black bars represent poor and better disease-free survival (DFS) breast cancer tumor samples, respectively

Function enrichment analysis

The downregulated and upregulated DEGs were respectively enriched into 26 and 89 GO terms as well as 6 and 7 KEGG signaling pathways, as shown in Table 1. The results showed that the significantly upregulated DEGs in the better DFS group were significantly enriched in BP functions such as immune response and defense response, CC terms associated with plasma membrane part, and MF terms of cytokine binding, and serine-type peptidase activity. Additionally, they were significantly involved in KEGG signaling pathways such as cell adhesion molecules, antigen processing and presentation, and chemokine signaling pathway. The downregulated DEGs were significantly related to the development of primary sexual characteristics, negative regulation of signal transduction, synapse part, and calcium-dependent protein binding and were involved in KEGG signaling pathways such as cell adhesion molecules, mTOR signaling pathway, and RNA degradation. A total of 283 DEGs were involved in all GO functions and KEGG signaling pathways.

Table 1 GO function node list (top 3) significantly enriched by upregulated and downregulated DEGs

Independent prognostic DEG screening

Based on the 283 DEGs involved in all GO functions and KEGG pathways, a total of 186 prognostic DEGs were identified after univariate cox regression analysis by a combination of the 233 breast cancer tumor samples. Further multivariate cox regression analysis of the 186 prognostic DEGs screened 42 independent prognostic DEGs.

LR model analysis

For the 42 DEGs that were significantly correlated with independent prognosis, LR model was used to screen important feature DEGs, and a total of 10 important feature DEGs were screened, as shown in Table 2. According to the median value of each DEG expression level, the training set samples were divided into high expression (expression level higher than the median value) and low expression (expression level lower than the median value), and the correlation between the samples in different expression level groups and survival prognosis was evaluated by KM curve method. As shown in Fig. 2, the p values of TP binding cassette subfamily a member 3 (ABCA3), C-C motif chemokine ligand 22 (CCL22), forkhead box J1 (FOXJ1), interleukin 1 receptor antagonist (IL1RN), mitogen-activated protein kinase kinase 6 (MAP2K6), ubiquitin conjugating enzyme E2 L6 (UBE2L6), APC regulator of Wnt signaling pathway 2 (APC2), potassium voltage-gated channel interacting protein 3 (KCNIP3), mitochondrial ribosomal protein L13 (MRPL13), transient receptor potential cation channel subfamily M member 2 (TRPM2) between high expression, and low expression were respectively 2.619E−02, 2.048E−03, 2.477E−02, 3.175E−03, 4.929E−02, 5.546E−04, 5.066E−04, 1.128E−02, 2.629E−02, and 1.157E−02, which indicated that these genes can be well distinguished. Based on the hazard ratio (HR) of these genes, ABCA3 (0.536), CCL22 (0.406), FOXJ1 (0.530), IL1RN (0.426), MAP2K6 (0.587), and UBE2L6 (0.362) were good prognostic factors, which were significantly upregulated in the better DFS group. In contrast, APC2 (2.743), KCNIP3 (2.049), MRPL13 (1.885), and TRPM2 (2.050) were poor prognostic factors that were significantly upregulated in the poor DFS group. Based on the DEG expression level of 10 important feature genes, the LR model was used to classify the prognostic types of TCGA training set and GSE45725 validation data samples, respectively. In the GSE45725 validation dataset, 127 breast cancer samples were selected, of which 20 were breast cancer samples with recurrence within 3 years, and were defined as poor DFS group; 107 were samples with no recurrence and survival within 3 years and were defined as better DFS group. The fuzzy matrix of classification results is shown in Table 3.

Table 2 The important prognostic feature DEGs screened by Logit model
Fig. 2
figure 2

The KM curves of 10 important characteristic DEGs. Red represents the high-expression sample group, and blue represents the low-expression sample group

Table 3 The classification result in TCGA dataset and GSE45725 microarray dataset

Correlation analysis between LR model classification and survival prognosis

There was a significant correlation between the grouping of samples after classification and the actual survival prognosis based on 10 important LR classification models (Fig. 3a, b). In addition, the AUC of the ROC curve in the training set and validation set were 0.903 and 0.839, respectively. These results demonstrated that the selected feature genes based on LR model can be well used to predict the prognosis and recurrence of breast cancer.

Fig. 3
figure 3

The KM curves of the correlation between the classification groups in TCGA training set (a) and GSE45725 validation set (b) and the actual survival prognosis based on the LR classification model. The blue and red curves represent the groups of good and poor prognosis samples predicted by the LR classification model, respectively. c Area under the curve (AUC) was calculated from the receiver operating characteristic (ROC) curve. Black and red curves represent the TCGA training set and GSE45725 validation set

In the different subtypes of breast cancer, including basal type (Fig. 4a), Her2 type (Fig. 4b), LumA type (Fig. 4c), and LumB type (Fig. 4d), the prognosis of breast cancer predicted by the LR classification models was similar with the actual survival prognosis. Additionally, the AUCs of ROC curve in the basal type (Fig. 4a), Her2 type (Fig. 4b), LumA type (Fig. 4c), and LumB type (Fig. 4d) were 0.892, 0.818, 0.916, and 0.838, respectively. These results implied that the 10 important genes screened based on LR model can also be well utilized to predict the prognosis of different subtypes of breast cancer.

Fig. 4
figure 4

The KM curves and AUC were analyzed in the basal type (a), Her2 type (b), LumA type (c), and LumaB type (d) of breast cancer using TCGA dataset. The blue and red curves represent the groups of better and poor DFS samples predicted by the LR classification model, respectively

The expression levels of 10 important DEGs in TCGA training set and GSE45725 validation data samples were displayed in Fig. 5. As shown in the figure, the expression level of each important DEG in the two data sets was consistent, and the 7 DEGs, ABCA3 (p = 3.56E−04 in TCGA and p = 1.01E−02 in GSE45725), CCL22 (p = 1.14E−04 in TCGA and p = 4.72E−02 in GSE45725), FOXJ1 (p = 3.45E−05 in TCGA and p = 4.88E−02 in GSE45725), IL1RN (p = 5.55E−05 in TCGA and p = 2.19E−02 in GSE45725), KCNIP3 (p = 2.68E−04 in TCGA and p = 2.92E−02 in GSE45725), MAP2K6 (p = 5.87E−05 in TCGA and p = 3.28E−02 in GSE45725), and MRPL13 (p = 8.27E−05 in TCGA and p = 9.74E−04 in GSE45725), were significantly expressed in both groups in the two data sets.

Fig. 5
figure 5

The column diagram of the expression levels of the 10 important characteristic DEGs in the TCGA training set (a) and the GSE45725 validation set (b) in the better and poor DFS groups, with white and black columns representing the better and poor DFS sample groups, respectively

Discussion

The mRNA expression studies have been widely used to predict the prognosis of breast cancer patients [24,25,26]. In this study, based on the breast cancer gene expression profile data in TCGA, 540 DEGs were screened between better DFS and poor DFS groups, including 177 downregulated and 363 upregulated DEGs. Through LR model screening, 10 important feature DEGs were identified and then validated in the GSE45725, among which, ABCA3, CCL22, FOXJ1, IL1RN, KCNIP3, MAP2K6, and MRPL13, were significantly expressed in both groups in the two data sets. Additionally, based on the K-M curve, it was found that ABCA3, CCL22, FOXJ1, IL1RN, and MAP2K6 were factors associated with good prognostic outcome, while KCNIP3 and MRPL13 were risk factors associated with poor prognostic outcome. Our findings will improve our understanding of breast cancer and provide novel biomarkers for the prognosis and recurrence of breast cancer.

Among the five factors associated with good prognostic outcome, CCL22, FOXJ1, and IL1RN were found to be significantly involved in function associated with immune response. Defense against tumors is one of the functions of the immune system, and the host immune response plays a key role in progression and response to therapy of breast cancer [27]. A study has revealed that the risk of breast cancer is associated with impaired immune responses [28]. CCL22 has an effect on repressing immune responses to tumor cells through its ability of recruiting Treg and Th2-cells, thereby enhancing tumor development [29]. Li et al. [30] has reported that CCL22 is an independent prognostic predictor of breast cancer patients. In addition, recent studies have suggested that FOXJ1 may be a tumor suppressor, which suppresses cell migration and invasion in ovarian cancer [31]. A study by Wang et al. [32] have demonstrated that downregulated FOXJ1 is an independent prognostic predictor for gastric cancer, and it is found to be hypermethylated in breast tumorigenesis [33]. Our study showed that higher expression of FOXJ1 was associated with better DFS in breast cancer, and FOXJ1 may be a good prognostic factor for the prognosis of breast cancer. For IL1RN, its low expression is an early driver of carcinogenesis of urothelial carcinoma of the urinary bladder [34]. Moreover, genetic polymorphisms of IL1RN are found to be associated with individual susceptibility for breast cancer development in Korean women [35]. Combined with our results, it was speculated that CCL22, FOXJ1, and IL1RN may play an important role in ameliorating the prognosis of breast cancer patients through involving in immune response.

ABCA3, also a prognostic factor of a good prognostic outcome, has been reported to be involved in lipid transport and lipid secretion, and is expressed in some human epithelial cells [36]. A recent study has found that ABCA3 has strong expression in normal mammary gland tissue and was exclusively expressed in the epithelial cell layer. The loss of ABCA3 protein expression was related to a more aggressive phenotype in breast cancer patients. The upregulation of ABCA3 was associated with a better prognosis of patients with breast cancer [37]. In accordance with the findings above, our study also suggested that upregulated ABCA3 may be related to the good prognosis of breast cancer and served as a protective factor in breast cancer.

In addition, MAP2K6 is an upstream kinase of the p38/MAPK signaling pathway [38]. Recent studies have found that MAP2K6 may be associated with the progression of cancers [39]. MAP2K6 expression is found to be significantly upregulated in gastric cancer, colon cancer, and esophageal cancer compared with the control [39]. Overexpression of MAP2K6 predicts a worse prognosis of patients with nasopharyngeal carcinoma [40]. However, Wang et al. [41] reported that MAP2K6 gene had a low expression in breast cancer compared with control. The possible reason is that the expression of MAP2K6 in different cancers is different, and its function is complicated. In our study, we found that MAP2K6 had a higher expression in the better DFS group compared with the poor DFS group, which indicated that high expression of MAP2K6 may be associated with good prognosis breast cancer. Further studies are needed to investigate the specific mechanisms of MAP2K6 in the prognosis of breast cancer.

However, MRPL13 and KCNIP3 were predicted to be risk factors associated with poor prognostic outcome in breast cancer. MRPL13 was associated with the function of mitochondrial ribosome. It has been suggested that mitochondrial ribosomes are linked to tumorigenesis. The expression of genes encoding for mitochondrial ribosomal proteins is modified in numerous cancers [42, 43]. Recently, a study reported that overexpression of mitochondrial ribosomal protein S18-2 provides a permanent stimulus for cell division, which suggested its involvement in carcinogenesis [44]. The role of MRPL13 in breast cancer has not been reported to our knowledge. For KCNIP3, it has been identified as a potential biomarker for the early detection of basal-like breast cancer [45]. In alignment with our results, we speculated that MRPL13 and KCNIP3 may be served as poor prognostic biomarkers, and their downregulation may have a better prognosis of breast cancer.

Conclusion

In conclusion, ABCA3, CCL22, FOXJ1, MAP2K6, and IL1RN may serve as good prognostic factors, while KCNIP3 and MRPL13 may be poor prognostic biomarkers to predict the recurrence and prognosis of breast cancer. These good and poor prognostic biomarkers are required to be confirmed using larger cohorts, different validation data or different subgroups. The results will provide more powerful biomarkers for the prognosis and recurrence of breast cancer.