Genetic similarity among tumors and metastases
Previously, we examined the genome-wide gene expression profiles of five primary breast tumor/matched metastatic pairs and noted an overall high degree of similarity within a pair . To further examine the degree of relatedness of breast tumors and their metastases, we performed correlation analysis using thousands of genes, and hundreds of pre-defined gene expression signatures/modules  incorporating a large set of tumors and paired metastases. Intra-class correlation (ICC) values were determined between pairs of samples using multiple classification/grouping methods: (1) different pieces of the same primary tumor (“intrinsic pairs”), (2) tumors and their matched metastases [all metastases, or further separated into either lymph node (LN) or distant], (3) tumors and their matched metachronous metastases, (4) sets of synchronous metastases from the same patient, (5) tumors from different patients grouped by intrinsic subtype, and (6) metastases from different patients (Fig. 1a). On average when using all expressed genes, there was high concordance between two pieces of the same primary tumor (ICC = 0.9 [0.89–0.91]), while pairs of tumors and their metastases exhibit lower concordance values (0.82 [0.8–0.83]). As observed by the metachronously paired tumor-metastasis samples, gene expression did not change substantially over time. The autopsy patient data (0.72 [0.68–0.75]) suggest that normal organ RNA may be the variable most responsible for the decreased similarity between tumor and metastasis pairs. This hypothesis was supported by increased ICC values of 20 matched pairs of laser-captured tumors and LN metastases  (0.9 [0.85–0.94]).
Individual gene measurements can be fraught with “noise.” Thus, to further test the relationship between tumors and metastases, ICC values were identified using a compendium of 298 different gene expression signatures/modules , where each module is a summary measure of tens to hundreds of genes. The overall ICC values were higher than individual genes (thus showing greater robustness for gene signatures) and the breast tumor–metastasis pairs showed high conservation of pathways (Fig. 1b). The signatures with the most variability between tumors and matched metastases were associated with extracellular matrix (ECM) proteins. These genes may be microenvironment-induced or may be due to different amounts of fibroblasts found in tumors as compared to metastases (Supplemental Table 1).
Association of subtypes and sites of metastasis
Since the majority of genes maintain their RNA expression levels when growing as either primary tumors in the breast or as metastases, we sought to determine if the different intrinsic subtypes showed a predilection for metastasis to specific organs using genomic data arising from primary tumors only. Therefore, we combined four public microarray datasets with Distance Weighted Discrimination , providing 855 tumors with documented first site of relapse (Supplemental Table 2) [15–18]. Principal components analysis found that the overall variation of gene expression was due to the biology of the tumors, and not by cohort/source or microarray platform (Supplemental Fig. 1). Status for ER, progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) was recorded for 852, 537, and 499 tumors, respectively, and of the 482 tumors with defined status for all three markers, 110 were triple negative (TN); Kaplan–Meier analyses for site of relapse with these markers are shown in Supplemental Fig. 2. For all sites of relapse, ER/PR negativity was associated with increased metastases, except for bone, in which both ER+ and ER− tumors recurred. Clinical HER2+ and TN status were associated with liver and brain/lung relapse, respectively.
Next, each tumor’s intrinsic subtype was calculated for this combined data set using the PAM50  and the claudin-low subtype predictors  (Supplemental Table 3). Of the 855 tumors, 76 were identified as normal breast-like, and since this tumor classification is reflective of mostly normal breast tissue , these tumors/samples were excluded from further analyses, leaving a dataset of 779 tumors. Based on the site of first relapse data for liver, lung, brain, and bone, Kaplan–Meier plots were generated, and we determined that intrinsic subtype was correlated with site of relapse (Fig. 2, Supplemental Fig. 3). Compared to luminal A, basal-like and HER2-enriched tumors showed the highest hazard ratio (HR) of relapse to any site (basal-like vs. luminal A hazard ratio [HR] 2.1, P < 0.0001; HER2-enriched vs. luminal A HR 2.0, P < 0.0001) followed by luminal B (HR 1.69, P < 0.001) and claudin-low (HR 1.47, P = 0.051) tumors. Important findings included: (1) bone metastasis was the most common—regardless of subtype (Table 1), (2) brain relapse occurred most frequently in non-luminal samples, (3) liver relapse was associated with HER2-enriched tumors, and (4) lung relapse occurred often within the claudin-low and basal-like subtypes. In all analyses, luminal B tumors were more metastatic than luminal A tumors, thus providing a useful stratification within ER+ tumors.
Undifferentiated tumors and brain metastases
In 2009, Bos et al.  utilized two human breast cancer cell lines, CN34 and variants of the MDA-MB-231 human breast cancer cell line (a claudin-low cell line), along with gene expression data from human breast tumors, to identify 17 genes whose expression correlated with brain relapse (BrMS). Given the clear associations observed for the intrinsic subtypes and sites of metastases, we hypothesized that the BrMS would correlate with basal-like and/or claudin-low subtypes. ANOVA from two different datasets supported this hypothesis (Fig. 3a, b). A lung metastasis signature (LMS)  is also associated with intrinsic subtype (Fig. 3c, d).
Recently, a genomic method to quantify breast epithelial cell differentiation status, known as the Differentiation Score (DS) predictor  was developed. This predictor is based on the genomic signatures of FACS purified populations of mammary stem cells, luminal progenitors, and mature luminal cells of the normal human breast . The scoring of the DS predictor is based on the premise that mammary stem cells are the least differentiated cells in the breast and they give rise to luminal progenitors, which then produce mature luminal cells; for the DS, higher scores represent greater differentiation along this axis that starts with the mammary stem cell signature and culminates in mature ER+ luminal cells. In this spectrum, claudin-low tumors are the least differentiated, followed by basal-like, HER2-enriched, and ending with luminal B and A tumors . Since claudin-low and basal-like tumors were associated with brain relapse, we postulated that the more undifferentiated a tumor is on this axis, the more likely it would be to metastasize to the brain. To test this hypothesis, gene expression data from parental and organ-tropic (brain, lung, and bone) MDA-MB-231 cell lines were obtained from the Gene Expression Omnibus, and their DS calculated and plotted on the DS axis (Fig. 4a). Shown on the same scale are the 779 breast tumor dataset (Fig. 4b), cancer cell lines of various tissue origins (NCI60)  (Fig. 4c), and the MDA-MB-231 series [16, 20, 22] (Fig. 4d). Overall, claudin-low and luminal breast cancer cells lines show the same relative differences in differentiation status as is seen in primary tumors. Importantly, the MDA-MB-231 cells from the NCI60 and Massagué studies showed nearly identical DS, and the brain-tropic MDA-MB-231 cells were significantly less differentiated than the parental cell line.
To identify other features shared between low DS tumors and brain metastasis, we analyzed the NCI60  cell line series. Interestingly, DS were found to be similar in claudin-low breast cancer cell lines, central nervous system (CNS), and melanoma cell lines, a tumor type known to aggressively spread to the brain (Fig. 4c). To identify genes that mediate cerebral colonization, significance analysis of microarrays (SAM) was performed on the NCI60 data set by comparing these three cancer cell line types versus the rest. Two-hundred and sixty-five genes were identified as being highly expressed (FDR = 0%) in claudin-low, CNS and melanoma cell lines; Ingenuity Systems Pathway Analysis found that “cellular movement” was the top biological function associated with these genes (Supplemental Fig. 4).
The triple-negative SUM149PT breast tumor-derived cell line contains two distinct populations of breast cancer cells, which can be separated by FACS to yield one population with basal-like and another with claudin-low-like features and a lower DS . To test if lower DS correlates with increased migration, we fluorescence-activated cell sorted (FACS) the SUM149PT cell line into CD49f+/Epcam−/low and CD49f+/high/Epcam+ subpopulations, performed Boyden chamber migration assays, and determined that the less differentiated (i.e., lower DS) SUM149PT CD49f+/Epcam−/low cells were significantly (P < 0.001) more migratory than the more differentiated Epcam+ population (Supplemental Fig. 5).
Differentiation Scores and metastasis
We next sought to better understand the information that DS provides for predicting site of metastasis. Since there is a range of differentiation within each intrinsic subtype (Fig. 4b), we tested if the least differentiated basal-like/claudin-low tumors were more metastatic than the more differentiated basal-like/claudin-low tumors. Kaplan–Meier analysis and log-rank tests determined that the least differentiated half of these tumor subtypes were associated with significantly more relapse to brain (P = 2E−03, log rank-test) and lung (P = 2.4E−02). This same approach applied within luminal and HER2-enriched tumors found no association of DS with bone or liver relapse, thus this association appears specific for brain and lung relapses, although it should be noted that the least differentiated luminal and HER2-enriched tumors do not have low overall DS.
To visualize the information that DS and intrinsic subtypes provide for predicting site of metastasis, we plotted the DS of the 779 tumors versus the HR for each site of metastasis (Fig. 5a). The tumors were then ordered based on DS and all genes (11,068) hierarchical clustered (Fig. 5b). Interestingly, tumors with the lowest DS have a much higher HR for brain and lung metastases, and this risk drops off quickly as differentiation increases. Importantly, this analysis identified a subset of tumors within the largely ER− claudin-low and basal-like tumors that aggressively metastasize.
Stem cell signatures correlate with brain and lung metastases
Several studies have shown an association of stem cell characteristics and metastatic proclivity [25–27]. Therefore, the 855 tumor dataset was used to test if several previously published stem cell signatures contained within our set of 298 gene modules  were associated with site of relapse. Univariate Cox proportional hazards models identified that many of the signatures with the strongest associations for brain (and lung) relapse were either expressed in normal brain and/or have been identified as essential components of embryonic stem cells and tumor initiating cells [26, 27] (Supplemental Table 4). Of the 13 embryonic stem cell signatures analyzed in Ben-Porath et al. , all were significantly associated with relapse to brain/lung, 11 with LN metastasis, 10 with liver, and 5 with bone. Nearly all the signatures that predicted for brain relapse correlated with low DS, and those not strongly correlated with DS were correlated with proliferation. Some of these signatures further identified subsets of basal-like and claudin-low tumors most likely to metastasize to the brain (log-rank test: PRC2_targets; P = 0.0090, MM_WapINT3; P = 0.0001). Thus, ES cell signatures, DS, and proliferation appear to be strong predictors of CNS and lung metastases, and in general, the signatures most predominant for brain/lung relapse were weakly expressed in tumors that spread to the bone.
Univariate and multivariable survival analyses
The ability to predict the presence and/or location of a tumor recurrence could influence the location and frequency of radiographic surveillance for patients with a history of breast cancer. Therefore, we sought to identify the most informative signature, or combination of signatures that predicts metastasis to specific sites. First, we performed univariate survival analyses for multiple signatures, including the many described above and our previously published VEGF/hypoxia signature . As shown in Table 2A, all signatures tested were highly prognostic overall and, interestingly, both BrMS and LMS signatures predicted lung and brain relapse, providing evidence that metastases to these two organs utilize similar genetic mechanisms. Second, we performed multivariate analysis using the backward stepwise procedure and observed that subtype information (i.e., subtype calls or risk of relapse categories based on subtype [ROR-S]) was selected in each evaluation (Table 2B). For liver relapse, specifically, knowing the subtype call instead of the ROR-S risk category was found particularly informative; indeed, the risk of liver relapse of the HER2-enriched subtype was 4.0 times higher compared to the luminal A subtype despite that the HER2 status (as determined by gene expression) was also included. In addition to intrinsic subtype information, other signatures were found statistically significant in the various MVA final models, such as the upregulated genes of the BrMS in brain relapse, or the VEGF/hypoxia signature and the downregulated genes of the LMS in lung relapse. Interestingly, the BrMS and VEGF/hypoxia-signature were found highly correlated with DS (Pearson = −0.68), and correspondingly, the BrMS, DS, and VEGF/hypoxia-signature identify a subset of basal-like/claudin-low tumors that spread to the brain (P < 0.05). Thus, when each metastatic site is individually examined, a unique combination of signatures is chosen that includes intrinsic subtype (individual subtype or ROR-S) as well as another signature or two, ultimately resulting in the optimal set of variables for predicting relapse to that organ.