Introduction

The mammalian breast is a dynamic organ, with major morphological changes occurring during organogenesis, puberty, pregnancy, lactation, and involution [1]. Underlying these mammary gland changes is a complex cell hierarchy that supports these processes [24]. The simplest model places the multipotent mammary stem cell (MaSC) at the base of this hierarchy, having extensive, self-regenerative potential [5]. During mammary development, the MaSC has been proposed to divide asymmetrically to produce basal/myoepithelial cells as well as luminal progenitors (LumProg), which have more restricted proliferative and differentiation capabilities [5]. LumProg cells are capable of further differentiation into mature luminal (MatureLum) cells, such as estrogen receptor (ER)-positive ductal epithelium, which have an even more limited proliferative potential and some of which are terminally differentiated [5].

Breast tumors may originate from several, if not all, of the cell types within this complex mammary hierarchy. These various cellular foundations for tumor initiation may help explain the heterogeneous nature of human breast tumors [6], which consist of multiple histological and genomic subtypes; these genomic groups, which are defined by their gene expression profiles, have become known as the intrinsic subtypes of breast cancer and are referred to as basal-like, claudin-low, HER2-enriched, luminal A, and luminal B [710]. A simple etiological explanation for these different subtypes involves a one-to-one relationship between each intrinsic subtype and a distinct cell type of origin that largely maintains its phenotypic identity after oncogenic transformation; however, both normal and neoplastic non-stem cells can acquire stem-like properties, suggesting that the normal cell hierarchy model could also include an element of reversibility [11]. This also raises the possibility that molecular features defining tumor subtypes, may be acquired during tumorigenesis [12].

Genetically engineered mouse models (GEMMs) of breast carcinoma develop heterogeneous tumors [13, 14], but the extent to which they represent human disease is an area of active investigation. We previously showed that murine mammary tumors comprise at least 17 distinct intrinsic subtypes/classes, with eight classes being identified as strong human subtype counterparts by gene expression similarity [14]. As with human breast cancer, the degree to which murine models reflect normal mammary epithelial subpopulations requires further analysis. Characterization of the cellular features of these murine classes is also needed to better determine their preclinical utility, to shed light on trans-species associations [14], and to help interpret preclinical study observations [1518].

Several studies have independently profiled fluorescence-activated cell sorted (FACS) purified normal mammary cell types from both human [1921] and murine [22, 23] mammary tissues. Here, we use a meta-analysis approach to compare the transcriptomic profiles from FACS-enriched mammary cell populations with each other and with primary tumors. These data not only identify a number of clinically relevant biomarkers that may be useful for predicting chemotherapy benefit, but also suggest a cell type of origin for many tumor subtypes.

Methods

Detailed methods can be found in Supplemental File 1.

Mammary cell subpopulation gene signatures

Gene expression measurements from FACS-enriched mammary subpopulations were obtained from three human and two murine published studies: GSE16997 [19], GSE19446 [22], GSE27027 [23], GSE35399 [20], and GSE50470 [21]. Using a meta-analysis approach, a consensus ‘enriched’ gene signature was produced for each mammary subpopulation. ‘Enriched’ signatures comprised genes that were identified as being uniquely and highly expressed (false discovery rate (FDR) < 5 %) within a given subpopulation as determined using a two-class (subpopulation X versus all others) significance analysis of microarrays (SAM) analysis [9, 24]. Each ‘enriched’ signature was further refined by supervised clustering using the human UNC308 breast tumor dataset [9] to identify subpopulation ‘features’, which were defined as having at least ten genes with a Pearson correlation greater than 0.5 across all tumors [15, 25]. Expression scores for gene signatures were determined by calculating the mean expression of the signature within each tumor; all gene signature lists are provided in Supplemental Table 1.

Mammary cell subpopulation centroids

Mammary cell subpopulation centroids were created using the union of the ‘enriched’ epithelial gene signatures. Distance weighted discrimination (DWD) single sample predictor [26] was used to calculate the shortest Euclidean distance between each tumor and each epithelial cell-enriched centroid. Samples with a positive silhouette width were considered to have a strong association with a given subpopulation [27].

Chemotherapy response

A combined breast cancer gene expression dataset of patients treated with neoadjuvant anthracycline and taxane chemotherapy regimens was created from three public datasets: GSE25066 [28], GSE32646 [29], and GSE41998 [30]. Univariate (UVA) and multivariate (MVA) logistic regression analyses were used to determine if gene signatures derived from normal cell populations were capable of predicting pathological complete response (pCR).

Results

Comparison of human mammary subpopulation transcriptomic datasets

Several groups have independently obtained transcriptomic profiles of normal human breast cells and compared the genomic biology of these different cell types with human tumors [1921]. In these studies, normal mammary tissues obtained from female donors were FAC sorted using cell surface markers to enrich for specific mammary subpopulations before microarray analysis (Table 1; Fig. 1). While these initial studies were important, the datasets themselves were relatively small (n = 12 for Lim et al. [19], n = 72 for Shehata et al. [20], n = 18 for Prat et al. [21]), and few if any comparisons across studies were performed. Importantly, FACS-based cell fractionation can only enrich for specific subpopulations. Therefore, transcriptomic profiles reflect features of other contaminating cell types to varying degrees. As such, study-specific biases may be present in any single dataset; therefore, we used consensus information from all three FACS-enriched human transcriptomic datasets to reduce technical and study-specific biases.

Table 1 Human FACS-enriched normal mammary cell subpopulation studies
Fig. 1
figure 1

Flowchart of analysis. Normal mammary tissue biopsies were taken from female patients (a) and FACS-enriched into distinct mammary cell subpopulations (b). Transcriptome profiling was performed on each subpopulation using gene expression microarrays by three different studies (c). Within each study, genes highly expressed within each subpopulation were determined using a two-class SAM (d). Genes commonly and specifically enriched within each subpopulation across studies were determined to identify ‘enriched’ gene signatures (e). Each ‘enriched’ signature was refined by supervised hierarchical clustering to identify gene ‘features’ highly correlated across a diverse set of human breast tumors (f). These gene signatures were then used for clinical testing (g)

Following DWD normalization [26], an unsupervised cluster of the most variably expressed genes was performed using Gene Cluster v3.0 by selecting all genes with an absolute log2 expression value greater than three in at least four samples (212 genes) (Fig. 2a). In general, the four major array dendrogram nodes correspond to the four FACS-enriched mammary subpopulations, indicating that the most highly and variably expressed genes are similarly expressed across the different studies. Even when using all genes in the dataset, there is a high Pearson correlation within a given subpopulation across studies and low correlations to other subpopulations (Fig. 2b).

Fig. 2
figure 2

Comparison of mammary subpopulations across studies. a Unsupervised hierarchical clustering was performed with the normal human mammary subpopulation dataset using any gene that had a log2 absolute expression value greater than three in at least four samples. b Pearson correlations were determined between the average expressions of each study’s subpopulations using all genes. c The first three principle components were determined across the human mammary subpopulation dataset

On a per-sample basis, the first principle component separated the stroma and adult mammary stem cell (aMaSC) samples from the LumProg and MatureLum samples (Fig. 2c). The second principle component separated the stroma and aMaSC samples into distinct groups, while the third principle component separated the LumProg and MatureLum samples into distinct groups. The aMaSC subpopulation displayed the highest level of variation, which is likely attributable to varying degrees of contamination by other cell types.

Human mammary cell subpopulation enriched gene signatures

As shown in Fig. 2, there is a natural degree of variation between samples of a given subpopulation. We therefore developed gene signatures for each human mammary subpopulation by integrating consensus information across all three datasets (Table 1) to identify the highest confidence subpopulation-specific genes. First, genes highly expressed (FDR < 5 %) within each mammary subpopulation were found using a two-class (subpopulation X versus all others) SAM analysis [24] within each dataset [1921]. Second, the overlap of genes highly expressed within a particular subpopulation across studies was determined. Lastly, as it is possible in the above analysis to have the same gene in the signature of more than one subpopulation, genes that were identified to be significantly associated with more than one subpopulation were also removed. This resulted in a single, consensus Homo sapiens-enriched (HsEnriched) signature per subpopulation (Fig. 3a). The average Euclidean distance was determined using a 10-fold cross validation for each normal mammary subpopulation sample to centroids created using either the HsEnriched-derived gene signatures or to centroids created using the gene signatures derived separately from each human study (Supplemental Fig. 1). The HsEnriched centroids had a significantly reduced Euclidean distance (~70 %) to each mammary subpopulation (t test p < 0.0001), indicating greater specificity for the consensus HsEnriched signatures when compared with any individual dataset’s subpopulation signature.

Fig. 3
figure 3

Homo sapiens-enriched gene signatures. a HsEnriched gene signatures were identified for each mammary subpopulation. First, the overlap of genes highly expressed within each subpopulation across studies was determined. This overlapping gene set was further filtered to remove genes also identified as enriched in another subpopulation to limit the signature to genes specific to an individual subpopulation. The remaining genes comprised the HsEnriched gene signature for that subpopulation, as indicated by the shaded box. b The standardized average expression of the four HsEnriched gene signatures was calculated across three human datasets and displayed by intrinsic tumor subtype. c A nearest centroid predictor using the HsEnriched gene signatures was used to determine which epithelial features each tumor most represented. To reduce spurious findings, any tumor with a negative silhouette width was considered to have a weak association and was labeled as ‘unclassified’

We next evaluated the utility of these signatures for distinguishing human tumor subtypes. Figure 3b displays the standardized average expression of each HsEnriched signature across the human intrinsic breast tumor subtypes [7, 9] using over 3,000 tumors [9, 31, 32]. The aStr-HsEnriched signature was highest in claudin-low and normal-like tumors. Interestingly, claudin-low tumors also highly express the aMaSC-HsEnriched signature. High expression of the aMaSC-HsEnriched signature in claudin-low tumors is unlikely an artifact of stromal cells in these tumors since the Pearson correlation between the aStr-HsEnriched and aMaSC-HsEnriched signatures was −0.19 across the normal human mammary samples. The LumProg and MatureLum-HsEnriched signatures were most highly expressed in basal-like and luminal subtype tumors, respectively (Fig. 3b).

We noted a considerable degree of signature variation within a subtype, indicating that it is not necessarily the case that all tumors of a given subtype share features with the same normal cell type. A nearest centroid predictor with a 10-fold cross validation error rate of 4.8 % was created to individually determine which normal mammary epithelial subpopulation is most similar to each tumor. Samples with positive silhouette widths [27] were considered to have a strong association with their particular subpopulation, with all other tumors being categorized as ‘unclassified’ [33] (Fig. 3c). Specifically, 94 % of basal-like tumors had LumProg expression profiles. The claudin-low subtype had the highest percentage of tumors classified as aMaSC (18 %), although most claudin-low tumors were classified as having LumProg features (59 %). The HER2-enriched subtype was predominantly classified as having LumProg expression features. The luminal A and B subtypes were most similar to the MatureLum subpopulation.

Murine mammary cell subpopulation enriched gene signatures

Several groups have also profiled normal murine mammary cell subpopulation expression features using FACS [22, 23] (Table 2). In addition to highlighting conserved expression features across species [22], murine studies are uniquely positioned to enable comparisons with developmental states not easily accessed in humans, including early fetal development [23]. We were particularly interested in fetal mammary stem cells (fMaSC) [23], which is a distinct cell population not captured in any human study performed thus far (Table 3). Using the same approach that we used to derive the HsEnriched signatures, we created Mus musculus-enriched (MmEnriched) signatures for each murine mammary subpopulation (Fig. 4a) [22, 23].

Table 2 Murine FACS-enriched normal mammary cell subpopulation studies
Table 3 Gene set analysis of human and murine cell subpopulations
Fig. 4
figure 4

Mus musculus-enriched gene signatures. a MmEnriched gene signatures were identified for each mammary subpopulation. First, the overlap of genes highly expressed within each subpopulation across studies was determined. This overlapping gene set was further filtered to remove genes also identified as enriched in another subpopulation to limit the signature to genes specific to an individual subpopulation. The remaining genes comprised the MmEnriched gene signature for that subpopulation, as indicated by the shaded box. b The standardized average expression of the five MmEnriched gene signatures was calculated across a murine dataset and displayed by intrinsic tumor class. c A nearest centroid predictor using the MmEnriched gene signatures was used to determine which epithelial features each tumor most represented. To reduce spurious findings, any tumor with a negative silhouette width was considered to have a weak association and was labeled as ‘unclassified’

We calculated the standardized average expression of each MmEnriched signature across the murine intrinsic subtypes/classes (Fig. 4b) [14]. As in human tumors, the Str-MmEnriched signature was most highly expressed in Normal-likeEx and Claudin-lowEx; this common feature was anticipated given the high similarity of these two classes to their human subtype counterparts and their known enrichment for stroma-associated genes [14, 23]. The aMaSC-MmEnriched signature was most highly expressed in Class14Ex and to a slightly lesser extent in Wnt1-LateEx, Wnt1-EarlyEx, p53null-BasalEx, and Squamous-likeEx. The fMaSC-MmEnriched signature was most highly expressed in WapINT3Ex, which is consistent with the finding that Int3 (Notch4) inhibits mammary cell differentiation [34, 35]. The LumProg-MmEnriched signature was highest in PyMTEx and NeuEx. This finding was unexpected given that these two mouse classes have been shown to resemble luminal human tumors [13, 14]. Lastly, the MatureLum-MmEnriched signature was most highly expressed in Stat1Ex and Class14Ex. Both the Stat1 −/− and Pik3ca-H1047R mouse models, which define these two classes respectively, are often ER positive [36, 37], and these data suggest that they have MatureLum features. Class14Ex also exhibited significant expression of the aMaSC-MmEnriched signature, indicating that these tumors contain a mixture or share features of multiple cell types.

Consistent with Fig. 4b, 91 % of WapINT3Ex tumors were classified as having fMaSC features in a nearest centroid predictor analysis. Mouse luminal classes of breast carcinoma (Erbb2-likeEx, MycEx, PyMTEx, and NeuEx) were most similar to LumProg cells, which again were unexpected but consistent with previous findings [22, 38]. Wnt1-EarlyEx, p53null-BasalEx, and Squamous-likeEx tumors had primarily aMaSC features. Interestingly, Claudin-lowEx and to a lesser extent C3-TagEx tumors also had aMaSC features. All Stat1Ex tumors had MatureLum features, consistent with being ER positive [36].

LumProg and fMaSC features predict neoadjuvant chemotherapy response

Breast tumors respond heterogeneously to neoadjuvant chemotherapy treatment [15]. We hypothesized that cellular features of normal mammary subpopulations may identify tumors most likely to respond to neoadjuvant chemotherapy. To test this, we compiled a dataset of 702 neoadjuvant anthracycline and taxane chemotherapy-treated patients (Supplemental Table 2).

Although genes within each ‘enriched signature’ are highly correlated within their respective normal cell subpopulation, it does not necessarily follow that all genes within a given normal cell signature would be as coordinately regulated in tumors. Therefore, we subdivided each signature into smaller features (feature1, feature2, etc.) that are coordinately expressed in tumors, reasoning that such refined ‘features’ may be more clinically robust. All ‘enriched’ and refined ‘features’ were tested for their ability to predict pCR to neoadjuvant chemotherapy in a UVA (Supplemental Table 3). UVA significant signatures (p < 0.05) were then considered in a MVA with age, ER status, PR status, HER2 status, tumor stage, PAM50 subtype [39], and PAM50 proliferation score [39] to determine if any mammary subpopulation ‘features’ added novel information for predicting pCR (Supplemental Table 4).

Six normal mammary gene signatures were UVA and MVA significant (Supplemental Tables 3 and 4), with the 95 % UVA odds ratio of these six signatures and all other ‘enriched signatures’ displayed in Fig. 5a. Interestingly, the LumProg-HsEnriched and LumProg-HsEnriched-feature1 signatures, both of which were highly correlated (Fig. 5b), were significant in the UVA and MVA analyses, indicating that tumors with LumProg features are more likely to respond to neoadjuvant treatment. Importantly, this response was independent of proliferation, as highlighted by their low correlation to the PAM50-Proliferation gene signature (Fig. 5b).

Fig. 5
figure 5

fMaSC-enriched gene signatures. a The univariate logistic regression odds ratio predicting pathologic complete response to neoadjuvant anthracycline and taxane chemotherapy was determined using a 702 patient dataset, with the 95 % confidence interval shown as a forest plot. A single ‘*’ indicates that the signature was univariate significant, while ‘***’ indicates that the signature was both univariate and multivariate significant (p < 0.05). b Pearson correlations of multivariate significant gene signatures and proliferation were determined. c The standardized average expression of the fMaSC-MmEnriched signature and its two refined signatures were calculated across three human datasets and displayed by intrinsic tumor subtype. d Genes in the fMaSC-MmEnriched-refined1 signature. e Genes in the fMaSC-MmEnriched-refined2 signature

Interestingly, the fMaSC-MmEnriched signature refined into two distinctly opposite, highly significant signatures in both the UVA and MVA (Supplemental Table 3, 4; Fig. 5b, c). While the fMaSC-MmEnriched signature was highest in basal-like tumors, the refined signatures varied, with fMaSC-MmEnriched-feature1 (Fig. 5d) being highest in basal-like tumors and fMaSC-MmEnriched-feature2 (Fig. 5e) expressed in luminal tumors. Tumors with fMaSC-MmEnriched-feature1 expression were more likely to respond to neoadjuvant chemotherapy, while those tumors with fMaSC-MmEnriched-feature2 were more resistant. The fMaSC-MmEnriched-feature1 signature was very highly correlated with the LumProg-HsEnriched signatures (Fig. 5b), sharing four genes in common (Fig. 5d). These results support the hypothesis that subsets of genes within the larger ‘enriched signature’ are likely regulated by different biological mechanisms.

Discussion

Normal mammary gland physiology is supported by an underlying, complex cell hierarchy [25]. The simplest model treats differentiation from mammary stem cells to progenitor cells to mature cells as unidirectional, but recent observations indicate that bidirectional processes are also possible for normal and neoplastic cells [11]. This differentiation plasticity may allow tumors to acquire cell features foreign to the initial cell-of-origin or to lose native features through the accumulation of specific genetic aberrations [40].

Regardless of how different cellular traits are acquired, it is critical to identify the ‘current’ normal cellular features within a tumor, and therefore, we first analyzed the expression profiles of normal human and mouse mammary epithelial cell subpopulations [1923]. We chose to use nomenclature that maintains continuity with the literature. However, these terms should be considered provisional as the complete biological profiles of these FACS fractions are investigated [4]. Recent work by Prater et al. [41] found that mouse ‘LumProg’ cells (CD49f+, EpCAM+) have complete mammary gland repopulating potential, indicating that ‘LumProg’ may be a misnomer. Importantly, even if our understanding and naming of these cell subpopulations change, only the retrospective interpretation of the data presented here will be affected, not the data itself.

Using a meta-analysis approach, FACS-purified mammary epithelial cell subpopulation ‘enriched’ gene signatures were derived and a nearest centroid predictor was developed to identify which normal mammary subpopulation each human and mouse tumor most represented using over three thousand human patients and 27 mouse models of mammary carcinoma [14]. While these analyses imply a cell-of-origin for a given tumor, additional experiments (e.g., lineage tracing) will be required to unequivocally determine this. Nevertheless, these associations at the very least identify which normal mammary subpopulation a given tumor most represents in its current state.

With this in mind, several associations between both the human and mouse intrinsic subtypes and specific normal cell subpopulations were observed. First, human basal-like tumors have been referred to as ‘undifferentiated’, which is consistent with their exhibiting LumProg [19] and fetal MaSC features [23]. Three mouse classes have been identified to be human basal-like counterparts: MycEx, p53null-BasalEx, and C3-TagEx [14]. MycEx tumors were the most similar to the LumProg cell profile. By contrast, both p53null-BasalEx and C3-TagEx tumors had adult MaSC features. These results indicate that MycEx tumors share similar cell features as their human basal-like counterpart, making it an attractive mouse model for studying basal-like tumors with aberrant Myc signaling [10, 42]. Interestingly, neither p53null-BasalEx nor C3-TagEx tumors had strong LumProgs features, indicating that their association with human basal-like tumors is more likely driven by their underlying genetics [10].

Human claudin-low tumors had heterogeneous normal cell features. While most were similar to LumProg cells, the claudin-low subtype also had the largest percentage of tumors classified as adult MaSC. Given that claudin-low tumors are enriched with epithelial-to-mesenchymal transition features [9, 43, 44], our results suggest that these tumors may originate from the LumProg population prior to acquiring adult MaSC and/or mesenchymal features. Similarly, mouse Claudin-lowEx tumors were also strongly associated with the adult MaSC population, indicating that such tumors may be the closest analogs of the subset of human claudin-low tumors with adult MaSC features.

Human HER2-enriched tumors were the most similar to the LumProg subpopulation. This is a novel finding and may explain why both human basal-like and HER2-enriched subtype tumors show high TP53 mutation frequencies (>70 %) and widespread chromosomal instability [10]. These data could suggest that the normal LumProg cell is somehow extremely dependent on TP53 function. The murine Erbb2-likeEx class has been identified as a mouse counterpart for human HER2-enriched tumors [14] and was shown here to also have LumProg features.

When analyzing the human luminal A and B subtypes, a clear association with normal MatureLum cells was observed. The murine NeuEx class is a proposed counterpart for human luminal A tumors [14], yet these mouse tumors were most similar to normal mouse LumProg cells. The MycEx class was also identified to resemble human luminal B tumors [14]. As discussed, MycEx tumors have LumProg features; therefore, most mouse luminal A/B tumor models do not share the same normal cell features as their human tumor counterparts. These differences may reflect limitations of model system design, as tumors within these mouse classes are primarily driven by either the WAP or MMTV promoter. These differences in cell features, however, indicate that the trans-species associations observed previously [14] are possibly driven by the genetics of each mouse model. Nevertheless, broad molecular features are conserved between these human–murine counterparts [14]. Therefore, we propose that these mouse models retain significant preclinical utility provided that shared versus distinct molecular features are taken into account.

Neoadjuvant chemotherapy is a common approach for treating breast tumors, but only a relatively low percentage of patients have a pCR (~20 % overall). We tested the clinical significance of normal cellular features for predicting pCR using a combination of UVA and MVA logistic regression analyses. Human LumProg and mouse fetal MaSC expression features were identified as predictive of pCR sensitivity across all breast cancer patients. More specifically, LumProg-HsEnriched-feature1 and fMaSC-MmEnriched-feature1 were highly expressed in basal-like tumors. This is consistent with the clinical observation that basal-like tumors have better neoadjuvant chemotherapy response rates since higher expression of these normal cell signatures was associated with a higher likelihood of pCR. Distinct from these signatures, tumors with high expression of fMaSC-MmEnriched-feature2 were more resistant to neoadjuvant chemotherapy. Not surprisingly, this signature was most highly expressed in luminal A and B tumors, consistent with the clinical observation that these subtypes have lower chemotherapy response rates. Importantly, these signatures remained significant even after controlling for intrinsic subtype, proliferation, and clinical variables in the MVA analysis; thus these normal cell signatures add information even when tumor subtype and clinical features are known. It is presently unknown whether tumors with these features arise from a LumProg or fetal MaSC cell-of-origin or acquire these features during tumorigenesis. Whether these features are acquired or inherent, the ‘current’ cellular traits of a tumor are likely most important as these appear to be a major determinant of chemotherapy sensitivity. The biological explanation for why LumProg and fetal MaSC expression features predict tumor responsiveness to neoadjuvant chemotherapy will need to be explored further, but it is likely linked to the common genetic features of TP53 loss [45], RB-pathway loss [46], and high proliferation status [47], as well as other inherent characteristics of these cellular states. This work highlights the efficacy of studying the normal mammary gland cell hierarchy and development to provide insights into human tumor therapy responsiveness.