Introduction

Intrinsic subtyping by PAM50 profiling identifies distinct categories of breast cancer that differ in their tumor characteristics and behavior, while relying on gene expression in a bulk tissue sample. However, individual cancers vary in their adherence to a single prototype, and some might show evidence of admixture with an alternate subtype due to intratumor heterogeneity. Such admixture could affect prognosis and treatment response.

In a previous study, we used expression data for genes included in the PAM50 panel to develop a new metric, Distance Ratio Criteria (DRC), based on the ratio of Mahalanobis distance of a Luminal A (LumA) case from its assigned centroid to the nearest alternate subtype centroid [1, 2]. We showed that this metric could subdivide LumA cases according to purity of the LumA signature and thus identify distinct clinicopathological, molecular, and survival features based on the degree of subtype admixture. We focused on LumA cancers because admixture of this most favorable subtype with any other subtype could be presumed to worsen prognosis.

Here, we significantly extend previous work using semi-supervised non-negative factorization (ssNMF) on whole transcriptome data from a merged METABRIC/TCGA cohort of LumA cases to compute the degree of resemblance of an individual case to each of the four major breast cancer subtypes. Apart from gaining resolution due to the analysis of many more genes, the expanded cohort provides greater statistical power and allows us to explore the attributes of LumA cancers according to their most likely alternate subtype.

Methods

Study populations

We merged two publicly available breast cancer cohorts—Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort and The Cancer Genome Atlas (TCGA) BRCA provisional cohort—downloaded from cBioportal on August 7, 2019 [3]. Data were available from 3,061 total cases, with gene expression measurements for 15,747 overlapping genes, including 1,178 cases assigned to LumA (n = 674 METABRIC and n = 505 TCGA), covering 11,379 genes. We applied data normalization procedures to merge the cohorts; key features of the cohorts and pre-processing steps are presented in Table S1.

Previously reported algorithms for intrinsic subtype calls were used to assign each case to one of five PAM50 subtypes (Luminal A, Luminal B, HER2, Basal, and Normal) [4]. Re-computed PAM50 classifications were identical to those recorded in the source datasets. We excluded Normal subtype cases from both cohorts and Claudin-low subtype from METABRIC.

Semi-supervised non-negative matrix factorization

Non-negative matrix factorization (NMF) is a strong candidate among mathematical techniques to model transcriptomic data as an admixture of underlying metagenes [5]. NMF results can be interpreted as proportions (which cannot be negative) of the underlying components, unlike other popular techniques, such as singular value decomposition. We extended basic NMF by solving an additional optimization problem of linear classification of a patient’s data into one of four PAM50 subtypes simultaneously with the original optimization to minimize the overall data reconstruction error. This dual optimization is called semi-supervised NMF (ssNMF) [6], with the following objective function:

$$\mathop {\min }\limits_{A, B,S} \left| {\left| {X - AS} \right|} \right|_{F}^{2} + \lambda \left| {\left| {L - BS} \right|} \right|_{F}^{2}\, s.t.\, A > 0, B > 0, S > 0,$$

where \(X \in {\mathbb{R}}^{m \times n}\) represents the matrix containing expression of \(m\) genes for \(n\) patients, while the metagenes and their mixing coefficients for factorization rank \(k\) are given by \(A \in {\mathbb{R}}^{m \times k}\) and \(S \in {\mathbb{R}}^{k \times n}\), respectively, for the data reconstruction error (first) term. Additionally, the one hot-encoded labels for the four PAM50 classes are denoted by matrix \(L \in {\mathbb{R}}^{4 \times n}\). The basis matrix for the PAM50 label reconstruction term is given by \(B \in {\mathbb{R}}^{4 \times k}\). \(F\) represents the Frobenius norm.

In the above-stated optimization problem, two hyperparameters need to be set—tradeoff between the two optimization objectives λ > 0, and the number of metagenes k. Optimal hyperparameter values for the combined cohort (and for TCGA and METABRIC separately) were obtained based on the accuracy of PAM50 classification using fivefold cross-validation.

We interpreted the label reconstruction estimate \(BS \in {\mathbb{R}}^{4 \times n}\) for the four subtypes as the proportion estimates for subtype admixture. We normalized each column such that all four components (designated pLumA, pLumB, pHER2, and pBasal) sum to one. Since our primary focus was to quantify subtype admixture in PAM50-assigned LumA cases, analyses were performed only on LumA cases. Thus, the proportion of LumA subtype was our primary purity metric for PAM50-assigned Luminal A cases. In the combined cohort, the range of proportions for each subtype was divided into 100 equally spaced intervals to plot histograms. Within the top quartile of each subtype, an exclusive (eQ4) subset was identified, comprising cases not in the top quartile for any other subtype.

Clinical feature, molecular characteristics, and survival analysis

To test the hypothesis that admixed LumA cases had more adverse characteristics than pure ones, we compared clinical and molecular features across quartiles by proportion of LumA transcriptome (pLumA) using two-tailed t tests or exact chi-square tests. Clinical variables included mean age at diagnosis, percentage with nodal involvement, tumor size > 20 mm, stage > I, and ER, PR, or HER2 positivity (by immunohistochemistry and/or FISH). Molecular variables evaluated included the PAM50 11-gene Proliferation Score [7], PAM50-based risk of recurrence score [8], Oncotype DX score [9], percent of cases high-risk by MammaPrint [10], and the prevalence of selected somatic mutations. PAM50, Oncotype DX, and MammaPrint scores were all computed from normalized gene expression data using published formulas.

We analyzed the same hypotheses in the individual cohorts, comparing tertiles instead of quartiles for added statistical power. For METABRIC, additional analyses were possible on tumor grade, HER2 copy number gain determined by HER2 SNP6 DNA microarray, proliferation status determined by AURKA expression, and IntClust assignments [11]. For TCGA, additional analyses were done on the number of clonal populations estimated by PyClone, accounting for variations in copy number and benign cell contamination [12].

Overall survival for the combined cohort was analyzed for PAM50-assigned LumA cases stratified by quartiles of pLumA using Kaplan–Meier curves as well as hazard ratios from unadjusted and adjusted Cox-proportional hazards models. Factors for adjustment included age, tumor stage, and size for combined cohort and TCGA. Similar analyses for METABRIC also adjusted for tumor grade. Associations of tumor characteristics and survival with DRC and Shannon entropy were computed to provide comparison to the results from ssNMF. Reported results are consistent with the REMARK guidelines for prognostic tumor marker studies [13].

Alternate subtype analysis

Additional analyses compared cases that were exclusively in the highest quartile for pLumA versus those exclusively in the highest quartile for one of the three alternate subtypes. The set eQ4-LumA, representing relatively pure LumA cancers, included 275 cases; eQ4-LumB, eQ4-HER2, and eQ4-Basal comprised 182, 157, and 219 cases, respectively.

Results

The four proportional distributions, one for each PAM50 class, for the combined TCGA/METABRIC cohort of 1,178 Luminal A cases are shown in Fig. 1. Most of the cases had high Luminal A composition with non-zero proportions for other subtypes. The highest quartile for each subtype proportion is highlighted, as is the portion of that quartile comprising cases not in Q4 for any other subtype and thus exclusive Q4.

Fig. 1
figure 1

Frequency distributions for proportion of each intrinsic subtype, for all Luminal A breast cancers in combined cohort (TCGA and METABRIC), based on ssNMF analysis of the whole transcriptome. Light plus dark blue-shaded area represents the highest quartile; dark blue areas represent the subset of cases that are exclusively in the highest quartile for each alternate subtype.

Clinical and molecular features in the combined cohort for pLumA

Table 1 compares clinical-pathological and molecular features of the assigned Luminal A cases according to quartile for pLumA. Compared to relatively “pure” Q4 cases, Q1 cases were on average two years older (P trend = 0.027), less likely to be PR positive, and more likely to be HER2 positive by IHC or FISH. In contrast, purity was not associated with an increase in ER positivity. However, more Q4 cases adhered to a surrogate definition of Luminal A subtype (ER + or PR + and HER2 − by IHC) compared to Q1 cases. Triple-negative cases (n = 108) were too sparse to permit meaningful conclusions. Q1 status was significantly associated with higher prevalence of lymph node involvement, higher stage, and larger tumor size. These most admixed cases had substantially higher PAM50 proliferation and recurrence scores and substantially worse scores for both Oncotype DX and MammaPrint gene panels. As pLumA increased, the prevalence of TP53 mutation decreased almost threefold, and mutation of PIK3CA and CBFB—two gene alterations associated with LumA subtype—increased significantly.

Table 1 Characteristics of Luminal A breast cancers in the combined cohort (TCGA, METABRIC), stratified by quartile of pLumA subtype purity based on transcriptome

Survival analysis in the combined cohort for pLumA

Figure 2 shows Kaplan–Meier plots of overall survival for Luminal A cases in the combined cohort stratified into quartiles by Luminal A proportions. There was a statistically significant difference (P < 2 × 10–6) between Q1 and Q4 cases identified by pLumA. Median overall survival times for Q4 versus Q1 were 232 and 139 months, respectively, and Q1 cases had an estimated 10-year survival probability of only 0.58 (95% CI 0.51–0.65), compared to 0.76 (95% CI 0.69–0.82) for Q4. Table 2 shows the hazard ratios for overall mortality in the combined cohort stratified into quartiles by Luminal A purity. In unadjusted models, the mortality risk for Q1 cases was more than double the risk for Q4 cases. In models adjusted for age, tumor stage, and size, hazard ratios were statistically significant but generally lower, as expected due to adjustment for some but not all mediating risk factors. Adjustment for cohort (TCGA versus METABRIC) did not change the hazard estimates.

Fig. 2
figure 2

Overall survival of Luminal A breast cancer cases in combined TCGA and METABRIC cohorts, stratified by quartile of transcriptome-based purity measured as pLumA. Q1 = red, Q2 = pink, Q3 = purple, and Q4 = blue

Table 2 Hazard ratios for overall survival from Cox-proportional hazards modeling, according to quartiles for Luminal A purity; Luminal A cases in the combined TCGA-METABRIC cohort

Shannon entropy and pLumA were essentially equivalent as measures of subtype purity, based on associations with clinical/molecular characteristics and survival (Table S2, Figure S1). However, as an overall metric of similarity in ssNMF proportions across subtypes, entropy cannot be used to discern which specific subtypes are indicated in the admixture. DRC had weaker associations with tumor characteristics and survival (Table S3, Figure S1).

Comparison of pure Luminal A versus cases with a specific alternate subtype

Table 3 presents the differences in clinical and molecular characteristics between pure LumA cases and those with a specific alternate subtype. Compared to pure LumA cases, those with predominant LumB admixture were on average 2.6 years older and less likely to be PR positive, with no significant difference in ER or HER2 status. Although these admixed cases showed only small, non-significant increases in node positivity and stage, they were significantly more likely to have tumor size greater than 20 mm and had higher scores for proliferation, recurrence, MammaPrint, and Oncotype DX gene expression. Table 3 also shows that LumB-admixed cases had a slightly higher prevalence of TP53 mutation and lower prevalence of mutated PIK3CA and CBFB.

Table 3 Comparison of purest Luminal A breast cancers in the combined cohort (TCGA, METABRIC) to Luminal A cases exclusively in the highest quartile for admixture with an alternate subtype

Cases with predominant HER2 admixture, compared to pure cases, were on average older by three years, less likely to be ER or PR positive, and nearly threefold more likely to be HER2 positive (Table 3). HER2 admixture, rather than Basal admixture as might be expected, accounted for nearly all triple-negative cases in the combined cohort. HER2-admixed cases were associated with higher stage and tumor size and higher scores for proliferation, recurrence, MammaPrint, and Oncotype DX gene expression. Finally, these cases were threefold more likely than pure ones to have TP53 mutations, while fewer had mutated PIK3CA.

In contrast to the other alternate subtypes, cases with predominant Basal admixture were on average 2.7 years younger than pure LumA, were more likely to be HER2 positive and showed no significant differences in ER or PR status as determined by immunohistochemistry (Table 3). Notably, fewer cases with Basal admixture adhered to a surrogate clinical definition of LumA subtype (ER+ or PR+ and HER2−) than pure Luminal A cases. Triple-negative cases were too rare to allow comparison. Basal-admixed cases were more likely to have positive nodes, higher stage, and larger tumor size. Basal-admixed cases had a lower mean proliferation score but higher mean PAM50 recurrence score. Basal admixture was associated with substantially higher Oncotype DX risk scores and higher likelihood of being in the high-risk category by both MammaPrint and Oncotype DX risk stratification. Finally, Basal-admixed cases had no significant difference in TP53 mutations, but substantially lower prevalence of PIK3CA and CBFB mutations. Figure 3 shows mean EGFR expression, a canonical marker for the basal phenotype, cross-classified by pBasal and pLumA quartiles. EGFR expression increased within each pBasal quartile regardless of pLumA level and was highest in the cases that were most admixed, that is, those where both pBasal and pLumA were high.

Fig. 3
figure 3

Mean expression of EGFR, a basal-associated gene, increases within each pLumA quartile as pBasal increases

Figure 4 shows Kaplan–Meier plots of overall survival for pure LumA, and cases with predominant LumB, HER2, and Basal admixture. There was a significant difference between pure LumA and LumB-admixed cases (P = 0.030) and a more pronounced difference when the admixture was with HER2 subtype (P < 0.001). However, there was essentially no difference in survival between pure LumA versus Basal-admixed cases (P = 0.515); notably, survival for the latter cases was actually equivalent or slightly better than pure LumA before crossing over to worse after 10 years. The median survival times for pure LumA, LumB-admixed, HER2-admixed, and Basal-admixed cases were 228, 169, 190, and 161 months, respectively. The corresponding 10-year survival probabilities were 0.72 (95% CI 0.65–0.79), 0.65 (95% CI 0.56–0.74), 0.66 (95% CI 0.58–0.82), and 0.64 (95% CI 0.53–0.74), respectively.

Fig. 4
figure 4

Comparison of overall survival of purest Luminal A breast cancer cases in combined TCGA and METABRIC cohorts, to Luminal A cases in the exclusive highest quartile for admixture with an alternate subtype

Table 4 shows the results of Cox-proportional hazards models for overall mortality of pure LumA cases versus those admixed with other subtypes. The hazard ratios for LumB-admixed, HER2-admixed, and Basal-admixed cases relative to pure LumA were 1.43, (P = 0.025), 1.68 (P = 0.001), and 1.19 (P = 0.424), respectively. With adjustment for age, tumor stage, and size, the hazard ratios for respective categories decreased to 1.13 (P = 0.491), 1.27 (P = 0.180), and 0.89 (P = 0.639).

Table 4 Hazard ratios for overall survival from Cox-proportional hazards modeling, comparing Luminal A cases exclusively in the highest quartile for purity (referent) versus Luminal A cases exclusively within the highest quartile for admixture with an alternate subtype; combined TCGA-METABRIC cohort

We repeated analyses in TCGA and METABRIC separately to check for consistency and evaluate variables unique to each cohort. The associations between pNMF and clinical characteristics were generally similar for the two cohorts, although trends were weaker in TCGA (Tables S4 and S5). Data available only in METABRIC showed that cases with low pLumA were substantially more likely to be high-grade and have HER2 copy number gain. In both cohorts, low pLumA was associated with significantly increased risk of mortality compared to more pure cases (Figure S2), with hazard ratios of 1.65 (95% CI 1.28–2.14, P < 0.001) and 1.91 (95% CI 1.08–3.63, P = 0.002) for METABRIC and TCGA, respectively.

In TCGA we observed no association between pLumA and the number of subclone populations estimated by PyClone, and no significant differences between pure and admixed cases when compared by closest alternate subtype (Figure S3A and B). In contrast, we observed predicted relationships between admixture and IntClust grouping in the METABRIC cohort; 97% of the purest LumA cases were classified within IntClust groups 3, 4, 7, and 8, which were previously associated with LumA tumors, whereas the more admixed cases had more diverse distribution (Fig. 5A), indicating that the transcriptomically admixed tumors had some molecular characteristics linked to alternate subtypes. Figure 5B shows the proportions within each Integrative Cluster, of METABRIC Luminal A cases belonging exclusively to the highest quartile for alternate subtypes. Cases with predominant HER2 admixture were enriched in groups 5 and 8, which are typically associated with the HER2 subtype, but no admixed Basal cases were assigned to IntClust 10, which is associated with triple-negative breast cancer.

Fig. 5
figure 5

A The distribution of METABRIC Luminal A cases across IntClust groups, comparing cases in the highest tertile for pLumA with those in the lowest tertile. B The proportions within each Integrative Cluster of METABRIC Luminal A cases belonging exclusively to the highest (purest) quartile for pLumA, pLumB, pBasal, and pHER2. Counts are shown in each vertical bar

Discussion

In this analysis, we demonstrate that semi-supervised non-negative matrix factorization enabled us to measure the degree of adherence of an individual breast cancer case to each of the four major intrinsic subtypes based on its whole transcriptomic profile, thus providing a novel way to evaluate the relationship of subtype purity to tumor characteristics and behavior. We found that cases assigned by PAM50 as Luminal A exhibited a wide range of adherence to LumA purity and that lower purity was strongly associated with numerous clinical and molecular features linked to worse prognosis. Indeed, cases within the lowest quartile of adherence to the LumA subtype had double the mortality rate of cases within the highest quartile. We also found that LumA cases whose closest alternative subtype was either LumB or HER2 had tumor features consistent with those subtypes and survival that was significantly lower than pure LumA cases. We note that survival for pure LumA versus the admixed cases did not diverge until at least three years of follow-up, suggesting that subtype admixture could explain the established observation that while hormone-positive breast cancer patients have better initial survival, a subset is more susceptible to late recurrence [8, 14].

Unexpectedly, LumA cancers with the highest resemblance to the Basal subtype did not consistently display features typically associated with Basal-like (or triple negative) breast cancers. These patients, as predicted, were younger than those with pure LumA and were more likely to have larger tumors and nodal metastasis. However, cases with high Basal admixture had the same likelihood of ER and PR positivity, suggesting that coexisting basal components have a luminal phenotype. Additionally, basal-admixed cases had lower proliferation compared to pure LumA, and overall survival was statistically indistinguishable. Paradoxically, the highest expression of EGFR, a marker classically associated with basal phenotype and poor prognosis, was observed in cases with the highest level of both Basal and Luminal A transcriptomic profile. These data suggest that cases that are predominantly Luminal A but have Basal-like admixture are quite distinct from typical Luminal A or Basal cases, based on both molecular and clinical characteristics. Furthermore, the discordance between various gene expression-based scores for risk of recurrence and patient survival, which was observed only for Basal-admixed cases, could indicate that the risk estimation is less accurate in this subgroup. Due to such counterintuitive findings, we speculated that these cases might represent admixture with the Luminal Androgen Receptor Subtype (LARS) of triple-negative breast cancer, which displays significant ER and PR expression, but we found no associations between LARS signature gene panels and degree of basal admixture, and only rare occurrence of triple-negative status [15, 16].

These results support the conclusion that bulk sampling of tumors for genomic analysis can provide an opportunity to expose intratumoral heterogeneity, such as intrinsic subtype admixture [17]. While our approach expands the notion of how substantial genomic diversity within LumA cases actually is, the scale at which this diversity manifests itself is not immediately clear. We can envision three possibilities: first that all cells within a tumor express the same admixed profile; second, that subtype adherence varies from cell to neighboring cell; or third, that multiclonality leads to larger clusters of cells expressing divergent profiles. The robustness of the PAM50 classifier for predicting clinical outcomes implies that each subtype represents a favorable genomic profile or pathway for subclonal expansion, thus favoring the third hypothesis. The true nature of this subtype admixture could be elucidated by emerging but relatively costly methods, such as single-cell RNAseq or high-dimensional spatial profiling [18].

Intratumor heterogeneity involving breast cancer subtypes has been indirectly implicated to explain differences in outcomes when PAM50 and IHC subtype classifications are discordant in a primary tumor [19], when subtypes are discordant between a synchronous primary tumor and metastases [20], or when comparing treatment response for HER2 cases with or without ER positivity [21]. Moreover, the ASCO/CAP criteria for subtype classification only require ER expression greater than 1% of cells or HER2 overexpression in greater than 10%; and for in situ hybridization-based assays, a count in 20 cells is sufficient with no percentage threshold having been established [22, 23]. Compared to tumors with higher levels of ER expression, tumors with low levels of ER positivity (1–10%) are more likely to be classified as basal-like and less likely to be responsive to endocrine therapy [24]. Similarly, HER2-positive cases with a smaller proportion of amplified cells are less responsive to HER2-targeted therapy [21]. Examples of more direct characterization of intratumor heterogeneity include the discovery of lumino-basal cells in ER-positive tumors, [25], discordant multiregional DNA sequencing [26], discordant IHC staining across tissue microarray cores or whole slides [27], single-cell HER2 FISH analysis, [28], and early results from high-dimensional spatial profiling [18, 29].

We previously reported that a simpler subtype admixture metric, based only on PAM50 genes, was also associated with divergent tumor characteristics and behavior among LumA cases [1]. Camp and coworkers used principal components analysis to derive quantitative metrics based on PAM50 gene expression and found that it could also uncover subgroups with survival and treatment response that was independent of assigned subtype [30]. However, the current method used the entire transcriptome to compute a continuous rather than categorical metric that demonstrates stronger associations and provides construct validity by showing that admixed cases have features resembling their closest alternate subtype. Other strengths of this study include the large number of genes with expression levels available for analysis, the large size of the combined cohorts, and the similarity of the results between the two cohorts. However, there is a need for further validation of our findings in additional independent populations.

In summary, we have developed a metric based on whole transcriptome data that can stratify LumA cancers based on subtype purity and thus provide information that is potentially predictive with respect to prognosis and treatment response, as exemplified by the recently reported results from a trial of trastuzumab deruxtecan in patients with “HER2-Low” and predominantly ER-positive breast cancer [31]. Extensions of this work could include examination of the metagenes resulting from NMF to discover pathways that are up- or down-regulated by subtype admixture, and identification of smaller gene sets for enhanced clinical prediction modeling. In addition, our method can be used to test the association of admixture with treatment response, for any assigned subtype.