Introduction

Precision medicine has become an important emerging approach to the diagnosis, treatment, and prevention of disease, especially cancers; it takes into account the individual variability of each person in terms of genes, environment, and lifestyle. Breast cancer is the most common malignancy in women [1, 2]. Owing to tumor heterogeneity caused by cell phenotype diversity, different approaches to treatment and prognosis have been shown to be highly correlated with the intrinsic subtypes of the breast cancer [3]. Triple-negative breast cancer [TNBC, ER(−), PR(−), HER2(−)], which accounts for about 15% of breast cancers worldwide, is characterized by aggressive tumor behavior and a strong resistance to ant hormone treatment, chemotherapy, and targeted therapy [46].

Previously, using whole-genome (genome wide) analysis, including gene expression analysis (gene expression profiling), various TNBC molecular subtypes have been further identified. For example, six specific subtypes, namely, basallike1 (BL1), basallike2 (BL2), mesenchymal (M), mesenchymal stem like (MSL), immune response (M), and luminal androgen receptor positive (LAR) were first described by Lehmann et al. [7]. Since then, more investigations have targeted TNBC tumor heterogeneity using gene ontology [810], therapeutic targets [11, 12], and using mRNA or long noncoding RNAs (lncRNAs) as diagnostic criteria [13]. Although the six subtype classification has been refined recently [14, 15], the variation in molecular classification of TNBC across various different populations remains to be elucidated.

Accumulating evidence has shown that social economic, epidemiological, and genetic factors all play roles in tumor behavior, cancer subtype, and the prognosis of patients among different racial/ethnic groups [1618]. For example, women of African heritage, compared to women of Caucasian heritage, have a higher rate of TNBC and a lower rate of receptor (+)/HER2(−) breast cancers after the age of 35 years [19]. Furthermore, a high prevalence and poorer clinical outcomes have been observed among African-American women with TNBC than among women of European descent [20, 21]. There is consensus that genome-wide studies, such as gene expression profile analysis, provide multi-gene signatures that are closely linked to TNBC carcinogenesis [22, 23]. Previous studies have demonstrated a significant association between the PTEN mutation, a high Ki67 index and the CD44+/CD24 phenotype among African-American women with TNBC [24]. In addition to the above findings, it has also been noted that there are frequently variations in the EGFR-activating mutations found in TNBCs among East Asians patients and this is not true for European patients [25]. In the context of these findings, controversy exists regarding the amount of variations that occurs in genomic profiles between different ethnic populations [26]. Therefore, the aim of the present study was to compare the molecular subtypes of triple-negative breast cancers (TNBCs) between Taiwanese female patients and nonunion female patients.

Methods

Subjects

Under the approval of the Institutional Review Board (# 201310020BC) of Taipei Veterans General Hospital, Taiwan, ROC, a total of 57 patients between June 2013 and September 2015 with TNBC [ER(−), PR(−), HER2(−)] were identified by immunohistochemical analysis of their pathological specimens. Total RNA was extracted from these TNBC tissue samples, and the RNA samples were used to conduct oligonucleotide microarray analysis by the Genome Research Center, National Yang-Ming University [27].

Data set collection and TNBC identification by bimodal filtering

GE profiles from fourteen publicly available breast cancer microarray datasets, including twelve nonunion and two Taiwanese datasets (Sun Yat-Sen Cancer Center and Cathy hospital) (GEO, http://www.ncbi.nlm.nih.gov/gds; Array Express, http://www.ebi.ac.uk/microarrayas/ae/) were compiled and these were added to our dataset (GSE95700) (Supplementary Reference 1). In total, 1915 human breast cancer samples were included and among these samples a total of 617 TNBCs were identified (Table 1). The GE raw values for each of the datasets were normalized independently using the RMA procedure. The Affymetrix probes used for ER, PR, and HER2 were 205225_at, 208305_at, and 216836_s_at, respectively. A two-component Gaussian mixture distribution model was used to analyze the empirical expression distributions of ER, PR, and HER2 and the default parameters were estimated by maximum likelihood optimization using R statistical software (https://www.rproject.org/). After the posterior probability of a negative expression state for ER, PR, and HER2 had been estimated, a sample was defined as having negative expression if the posterior probability was less than 0.5. This process was followed by bimodal filtering to remove all ER/PR/HER2 positive tumors. The remaining TNBC tumors were then normalized along with positive controls for ER, PR, and HER2. Only samples that displayed a marked reduction in expression based on the above criteria compared to the positive controls were classified as TNBC (n = 617).

Table 1 Triple-negative breast cancer (TNBC) distribution in publicly available data sets

Identification of TNBC subtypes

Previously, six distinct TNBC molecular subtypes were proposed by Lehmann et al. [7] and these were basallike1 (BL1), basallike2 (BL2), mesenchymal (M), mesenchymal stem-like (MSL), immune response (M), and luminal androgen receptor positive (LAR). Accordingly, using the published six type gene lists, we clustered and replotted the six types of heat map using our compiled complete dataset. In addition to background correction, the MAS5 procedure was applied to the Taiwanese data and then consensus clustering and k-means clustering were used to determine the optimal number of stable TNBC subtypes. Cluster robustness was assessed by consensus clustering using agglomerative k-means clustering using the average linkage for the 123 TNBC profiles based on the most differentially expressed genes (SD > 0.9; n = 5463 genes). The optimal number of clusters was determined from the Consensus Cumulative Distribution Function (CDF), which plotted the corresponding empirical cumulative distribution; this was defined over the range [0,1], and calculated based on the proportional increase in the area under the CDF curve. Following this, the number of clusters was decided when any further increase in cluster number (k) did not lead to a corresponding marked increase in the CDF area. Principal component analysis (PCA) and heat maps were generated using GeneSpring software (GeneSpring GX 11.5; Agilent Technologies, Inc., Santa Clara, CA, USA) and further pathway analysis was carried out using Ingenuity Pathway Analysis software [27] (IPA; Qiagen, Redwood City, CA, USA).

Gene selection specific to each TNBC subtype

After consensus clustering and k-means clustering of the Taiwanese data, the TNBC subtypes were determined. The genes specific to each TNBC subtype were defined as followings: (1) the strongest probe with a fold change (ratio), >1.75 (upregulation) or <0.5 (downregulation), compared with the other subtypes; (2) the percentage of the sample with a GE difference >0 (sample GE − mean GE of other subtypes) of >80%; and a p value <104 (t test: specific subtype versus other subtypes).

Cell line and reagents

Under the approval of Institutional Review Board (# 201606012BC) of Taipei Veterans General Hospital, Taiwan, ROC, the human triple-negative breast cancer cell lines MDA-MB-468 (BL1), MDA-MB-231 (MSL), BT-549 (M), MDA-MB-453 (LAR), and DU4475 (IM) were obtained from the American Type Culture Collection (ATCC, Manassas, VA, USA), and these were then maintained in specific culture medium, namely F12 MEM (No. 12400024, Gibco, NY, USA), RPMI, as appropriate; the media were supplemented with 10% FBS, 2 mM l glutamine and penicillin/streptomycin, and the cells were cultured at 37 °C in a humidified atmosphere containing 5% CO2. Cells that were from three passages to ten passages were used.

Total RNA extraction and reverse transcription PCR

Total RNA was isolated using the modified single step guanidinium thiocyanate method [28] (TRI REAGENT, T9424, Sigma Chem. Co., St. Louis, MO, USA). After the cells from the five different subtypes, namely, MDA-MB-468 (BL1), MDA-MB-231 (MSL), BT-549 (M), MDA-MB-453 (LAR), and DU4475 (IM) had been grown up and total RNAs extracted, complementary DNA (cDNA) was created using a First Strand cDNA Synthesis Kit (Invitrogen, CA, USA). TaqMan® Gene Expression Assays were used to validate the differential expression at the mRNA level of the various identified genes sets that had been selected from consensus clustering results (Table 2). The TaqMan system was supported by a well-established primer database that reduces significantly the experimental failure due to inappropriate primer design.

Table 2 Gene list for validation of Taiwanese TNBC subtype

Any possible contamination of the various PCR components was excluded by performing a PCR reaction with these components in the absence of the RT product for each set of experiments (contemplate control, NTC). For the statistical comparisons, the relative expression level of the mRNA of each specific gene was normalized against the amount of GAPD mRNA in the same RNA extract. All samples were analyzed in triplicate.

Statistic analysis

Data are expressed as mean ± SEM. Differences between groups were identified by repeatedly measured one-way ANOVA, followed by Dunnet’s post hoc test. Differences between different groups were identified by Mann–Whitney U test for nonparametric analysis or the Student’s t test. A p value of <0.05 is considered statistically significant.

Results

Dataset collection and TNBC identification by bimodal filtering

From June 2013 to September 2015, 57 patients whose tumor samples were screened as TNBC by immunohistochemistry (ER < 1%, PR < 1%, HER2, not amplified) were identified at Taipei Veterans General Hospital. These tumor samples were sent for microarray analysis. Next, two Taiwanese (n = 408) and twelve nonunion datasets (n = 1450) were downloaded from the public domain. Thus, a total of 1915 human breast cancer samples, including ours (n = 57), were available for expression analysis. The gene expression information generated from Affymetrix microarrays were then normalized independently using RMA procedures (Fig. 1a and Supplementary Reference 1).

Fig. 1
figure 1

Protocol for the acquisition and analysis of the gene expression datasets. GEO Datasets for nonunion (12 groups, n = 1450) and Taiwanese (3 groups, n = 465) female breast cancer samples, including 617 triple-negative breast cancer (TNBC) samples, were acquired, normalized, and cluster analyzed (a). TNBC was identified by bimodal filtering (b) and was demonstrated in (c)

The gene expression distributions of ER, PR, and HER2 for the TNBC samples were validated by two-component Gaussian distribution, and the cutoff point was estimated by maximum likelihood optimization using the optimize function (R statistical software) (Fig. 1b). This resulted in a heat map showing the TNBC tumors normalized along with positive controls for ER, PR, and HER2 (Fig. 1c). Finally, the TNBCs identified as true TNBCs (n = 617) were enrolled into the compiled dataset.

The GE TNBC subtype samples of nonunion and Taiwanese women clustered in terms of the published 6-subtype gene lists

Since TNBC subtyping has been suggested as a useful approach, we acquired the published gene lists of the 6-subtype of TNBC and used these for clustering of our compiled dataset, which included nonunion (Fig. 2, left panel) and Taiwanese (Fig. 2, right panel) women. The results showed that the percentages of TNBC subtypes in nonunion women, namely, BL1, BL2, IM, M, MSL, and LAR were 13.56, 8.91, 16.80, 20.45, 8.30, and 11.13%, respectively, while those in Taiwanese women was 14.63, 4.07, 17.89, 16.26, 17.89, and 20.33%, respectively.

Fig. 2
figure 2

Heat maps of the clustered triple-negative breast cancer (TNBC) subtype for nonunion and Taiwanese women. The published gene lists of the six subtypes of TNBC were imported and used for the clustering of our compiled dataset, which consisted of a nonunion group (left panel) and a Taiwanese group (right panel) TNBC

When the two groups of women are compared, there exist some discrepancies between nonunion and Taiwanese women in terms of TNBC subtypes. To address this, background correction for the Taiwanese data was performed and consensus clustering and k-means clustering were used to determine the optimal number of TNBC subtypes for Taiwanese (Fig. 3). The results showed that five stable subtypes were obtained based on the Taiwanese TNBC data (Fig. 4a). These were IM (13.82%), MSL (30.89%), M (22.76%), LAR (23.58%) and BL (8.94%). The genes specific to each subtype were 274227458_at (CD 274 or PDL1) for IM, 205225_at for MSL, 200091_s_at for M, 226192_at (androgen receptor) for LAR, and 229538_s_at (IQGAP3) for BL (Fig. 4b). The genes specific to each TNBC subtype having been identified (Supplementary Reference 2) and correlated with the Lehmann et al. genes (Table 3) were analyzed using ingenuity pathway analysis (IPA); furthermore, their top canonic pathways, their upstream regulators, their top disease and their biofunctions were also analyzed. The results are summarized in Tables 4 and 5.

Fig. 3
figure 3

Cluster analysis of the triple-negative breast cancer (TNBC) subtype for Taiwanese women. After background correction of the Taiwanese data, consensus clustering and k-means clustering were used to determine the optimal number of TNBC subtypes. The optimal number of clusters was determined from the Consensus Cumulative Distribution Function (CDF)

Fig. 4
figure 4

The triple-negative breast cancer (TNBC) subtypes for TNBC from Taiwanese women. The heat map shows five stable TNBC subtypes (a). The genes specific to each subtype are 274227458_at (CD 274 or PDL1) for IM, 205225_at for MSL, 200091_s_at for M, 226192_at (androgen receptor) for LAR, and 229538_s_at (IQGAP3) for BL (b)

Table 3 Correlation of subtype-specific genes between Taiwanese’s and Lehmann’s genes
Table 4 Ingenuity pathway analysis for up-regulated genes in TNBC subtypes
Table 5 Ingenuity pathway analysis for down-regulated genes in TNBC subtypes

Model identification using representative genes in human TNBC cell lines

Using the gene lists selected from the clustering results (Supplementary Reference 2), which were identified as specific to each subtype, real-time PCR was carried targeting a 47 gene signature (Table 2) using customized chip. This analysis was carried out on five human TNBC cell lines, namely, MDA-MB-468 (BL1), MDA-MB-231 (MSL), BT-549 (M), MDA-MB-453 (LAR), and DU4475 (IM).

Using DU4475 (IM) as the reference line, significant downregulation of THSD4, ECT2, RAB27B, and ITGB8 was found (Fig. 5a), together with significant upregulation of PDCD1 (PD1), CD274 (PDL1) (except MDAMB231), and PDCD1LG2 (PDL2) (Fig. 5b), in DU4475 compared to the other cell lines MDA-MB-468 (BL1), MDA-MB-231 (MSL), BT-549 (M), and MDA-MB-453 (LAR). Using MDA-MB-231 (MSL) as the reference line, significant upregulation of DUSP4, together with significant downregulation of CCDC18 and GRTP1 (Fig. 5c) were found in MDA-MB-231 compared to the other cell lines. Using BT-549 (M) as the reference line, significant upregulation of CDCA3 and MATP in BT-549 (Fig. 5d) was found compared to the other cell lines. However, in addition these findings for BT-549, it needs to be noted that there was significant upregulation of DUSP4 in MDA-MB-231 (MSL) and of AR in MDA-MB-453 (LAR) compared to BT-549 (M) (Fig. 5d). When using MDA-MB-453 (LAR) as the reference line, significant upregulation of AR, ABCA12, IGQAP3, and KLRG2 in MDA-MB-453 (Fig. 5e) was found. Finally, when using MDA-MB-468 (BL1) as the reference line, significant upregulation of ITGB8, PABPC1, and WNT5A in MDA-MB-468 (Fig. 5f) was found.

Fig. 5
figure 5

Model identification using representative genes in human triple-negative breast cancer (TNBC) cell lines. Using the DU4475 (IM) as the reference line, there was significant downregulation of THSD4, ECT2, RAB27B, and ITGB8 (a) together with significant upregulation of PDCD1 (PD1), CD274 (PDL1) (except MDAMB231), and PDCD1LG2 (PDL2) (b) compared to the other cell lines).Using the MDA-MB-231 (MSL) (c) as the reference line, there was significant upregulation of DUSP4 together with significant downregulation of CCDC18 and GRTP1 compared to other cell lines. Using the BT-549 (M) (d) as the reference line, there was significant upregulation of CDCA3 and MATP in this line, compared to other cell lines and there was significant upregulation of DUSP4 in MDA-MB-231 (MSL) and AR in MDA-MB-453 (LAR), compared to the BT-549 (M) line. Using the MDA-MB-453 (LAR) as reference line (e), there was significant upregulation of AR, ABCA12, IQGAP3, and KLRG2 in this line, compared to other cell lines. Using the MDA-MB-468 (BL1) as the reference line (f), there was significant upregulation of TPGB8, PABPC1, and WNT5A in this line, compared to other cell lines

Discussion

Breast cancer raises important health problem worldwide. Even after considering the many therapies for the various subtypes of breast cancer, treatment of triple-negative breast cancer (TNBC) remains a challenging issue. The heterogeneity of TNBC tumors contributes to their poor response to chemotherapy, and this had led to the development of TNBC subtyping. In this study, we compiled GE profiles from publically available breast cancer microarray datasets that included both nonunion and Taiwanese populations. These were then cluster analyzed, which was followed by model identification using representative genes in TNBC cell lines.

There is consensus that significant preprocessing, including background adjustment, normalization, and summarization, is required before a specific gene may be accurately assessed using a complied dataset [29]. Based on the published gene lists of the six subtypes of TNBC proposed by Lehmann et al. [7], using our compiled dataset, we found that there was clearly distinct subtype presentation among nonunion samples (Fig. 2, left panel), but this subtyping was not the same for the Taiwanese population (Fig. 2, right panel). Based on these finding, we renormalized the Taiwanese data using the MAS5 procedure and carried out clustering; this resulted in five rather than six clear subtypes being present in the Taiwanese population. Previous studies have suggested that the GCRMA approach might be responsible for introducing artifacts into the data analysis and that this can lead to a systematic overestimate of pairwise correlations within the data. In this context, it has been suggested that the MAS5 approach provides the most faithful cellular network reconstruction [30, 31].

Although from three to six TNBC subtypes have been proposed by various authors either using gene ontologies [10, 32], therapeutic targets [11, 12] or mRNA profiles as the diagnostic criteria [13], the exact number of TNBC subtypes that occur in women remains an open question [14]. Our findings identified five subtypes and these were the IM, MSL, M, LAR and BL subtypes. Interestingly, the BL1 and BL2 subtypes of the Lehmann’s six type classification were clustered as a single BL subtype in our Taiwanese dataset. We attribute this discrepancy to a result of a smaller sample size, as the number of subtypes tends to increase with sample size.

Several lines of evidence suggest that the interactions of cancer cells with their microenvironment are a critical feature during tumor progression. The cell types involved in such interactions are not necessarily stromal cells [33], but also include macrophages [34], endothelial cells [35], and T cells [36]. Interestingly, we found significant upregulation of PDCD1 (PD1), CD274 (PDL1), and PDCD1LG2 (PDL2) expression in the IM subtype compared to the MSL subtype in our compiled dataset. However, when using DU4475 (IM) as the reference line, there was significant upregulation of PDCD1 (PD1), and PDCD1LG2 (PDL2), but not of CD274 (PDL1), compared to MDA-MB-231 (MSL) (Supplementary Reference 3). We attribute this discrepancy to the study samples used, namely, cell lines versus tumor tissue. In the former, only cancer cells were investigated, while in the latter, cancer cells and other cells participating in the tumor microenvironment were investigated as a pool. It should be noted that the IM and MSL subtypes in our dataset share many canonical pathways, such as the iCOS-iCOSL signaling pathway (Table 4), which suggests the presence of significant similarity between these two subtypes. This seems to be supported by previous findings, which indicated that some transcripts present in the IM and MSL subtypes are contributed to by the tumor microenvironment [14].

The expression of the androgen receptor (AR) plays various different prognostic roles depending on the breast cancer subtype, such as the difference between ER-positive and ER-negative breast cancers with the expression levels of around 67–88% [37, 38] and 12–50% [39] for AR, respectively. Importantly in this context, it should be noted that the prevalence of AR expression has been found to range from 0–53% of TNBC [40].

In our compiled dataset, the percentages of the LAR subtype among nonunion and Taiwanese TNBC women were found to be 11.13 and 23.58%, respectively. There is evidence suggesting that AR expression is about 60% among early breast cancers and is more frequently expressed in ER-positive than ER-negative breast cancers [41]. We speculate that ethnic differences might explain the variation in the percentage of the AR subtype between these different populations. However, further validation of this speculation is needed. If we examine cell line-specific gene expression, although the AR gene in BT-549 (M) is upregulated compared to DU4475 (IM), MDA-MB-468 (BL1) and MDA-MB-231 (MSL), the AR gene transcript in MDA-MB-453 (LAR) is ninefold higher than in BT-549 (M), which suggests that this change in AR gene expression is specific to the LAR subtype. Recent discrepancies concerning the role of AR have been noted in various TNBC basic and clinical studies and both AR agonist and AR antagonist clinical trials have been designed for the treatment of TNBC and ER+ breast cancers [4143]. Thus, the therapeutic role of AR remains an open question.

In summary, our findings suggest that there exist different presentations between nonunion and Taiwanese female populations in terms of TNBC subtypes. The fact that there seems to be correlation between the IM and MSL subtypes suggests the involvement of the tumor microenvironment in TNBC subtype classification might help to provide important information when selecting therapeutic targets or designing for clinical trials for TNBC patients.