There is growing evidence that interaction of stromal and immune cells with normal or malignant epithelial cells is pivotal for the development and progression of cancer. Several reports indicate that tumor-infiltrating leucocytes may represent an essential pathophysiological factor in the development and progression of breast cancer [13]. Their prognostic impact, however, remains unclear. Lymphocyte infiltration (LI) is often seen in breast cancer and has been suggested as a marker of host antitumor immune response, but its importance in terms of pathophysiology and prognosis or treatment prediction remains controversial. The presence of B cells is already seen with premalignant breast tumors [4], while T-cell infiltration is associated only with high-grade ductal carcinoma in situ and invasive carcinoma [5] and has been reported to range from 1% to 45% of the total cellular mass [6].

In rapidly proliferating tumors LI has been shown to be a good prognostic indicator, correlating with lymph-node negativity, smaller tumor size, and lower grade [7]. Similarly, Ménard and colleagues have shown that lymphocyte infiltration of breast cancer had a strong positive prognostic value in patients younger than 40 years; no association was seen among patients 40 years or older, however, suggesting a correlation with estrogen receptor (ER) status or specific breast cancer subtypes [8]. A positive correlation of human epidermal growth factor receptor 2 (HER2) amplification/overexpression, LI and expression of lymphocyte-associated genes has been described that was associated with a more favorable outcome [9]. Only a small fraction of tumor-associated lymphocytes display activation markers, however, and there is no definitive proof of cytotoxic activity of these cells against the tumor in vivo [10]. In this context the expression of specific oncoproteins such as HER2 or p53 is supposed to be immunogenic [11].

The search for prognostic or predictive signatures using microarray analysis in bulk breast cancer specimens reveals several genes that are associated with immune cells; for example, interferon-regulated genes [12, 13], B-lymphocyte marker [12], as well as T-lymphocyte-associated genes [13]. In this context, whether these observations are due to an imbalance of host-associated markers and tumor tissue or due to a real biological phenomenon remains unclear. Data from gene expression profiling of breast cancer cell lines showed that a considerable number of immune-response-related genes exhibit significant variable expression across the basal cell subtype [14, 15], suggesting that immune response genes might play a crucial role even in the absence of host cells.

Most recently, Finak and colleagues identified a good-outcome cluster from gene expression profiles of tumor stroma that was isolated by laser-captured microdissection. This cluster contained 22 different genes 'enriched for elements of the T helper type 1 (TH1) immune response' of which the authors verified selected markers by immunohistochemistry [16].

Overall, the impact of monocytes, B lymphocytes and T lymphocytes on prognosis is still a matter of debate. The purpose of our study was therefore to accurately identify different clusters of immune-cell-associated genes in bulk breast cancer samples by a large-scale analysis of microarray datasets, and to precisely analyze the correlation between the resulting metagenes and specific breast cancer subtypes. Finally, we evaluated the prognostic impact of these metagenes in defined breast cancer subgroups.

Materials and methods

Microarray data

A database of 1,781 primary invasive breast cancers including all samples from 12 Affymetrix HG-U133 microarray datasets was established: Frankfurt [17, 18] (Additional data file 1), Uppsala [19], Oxford – Untreated [20], Stockholm [21], New York [22], London [23], Rotterdam [24, 25], Oxford – Tamoxifen and Villejuif [26], Expression Project for Oncology [27], Frankfurt-2 [28] (Additional data file 1), and MDA133 [29]. Characteristics of the individual datasets are presented in Additional data file 1. Follow-up information was available for 1,263 patients. The median follow-up time was 79 months. Seventy-two percent of all samples and 74% of those samples with follow-up were ER-positive.

Only Affymetrix HG-U133A microarrays were included for full comparability of all the probes on the arrays. Data were downloaded from the Gene Expression Omnibus website [30]. Affymetrix expression data for different immunological cell types and tissues were obtained from Su and colleagues [GEO:GSE1133] [31]. Affymetrix expression data were analyzed using the MAS5.0 algorithm [32] of the affy package [33] from the Bioconductor software project [34]. Expression data were log2-transformed and were normalized across each individual array by a scaling factor S so that the magnitude (sum of the squares of the values) equals one.

Metagenes for feature reduction

A high feature-to-sample ratio is one of the most important problems in microarray research leading to an inflation of α values [35]. Unsupervised clustering was applied for feature reduction based on the assumption that the expression of a large number of genes is highly interdependent. This can be attributed to the expression of sets of genes in different cell types in the sample and to differentiation programs/pathways associated with specific expression profiles. Genes that did not show a correlation with other genes above a certain threshold (0.7) were suspected to represent noise, and were discarded from further analysis. To identify metagenes for the principal vectors, we selected those clusters that contained at least 10 elements and a minimal average correlation of 0.7 – resulting in 199 total ProbeSets. Metagene expression values were determined by calculating the mean of the normalized expression values of all ProbeSets in the respective cluster.

Assessment of ER, HER2, proliferative status and tumors with stem-cell-like characteristics of the samples

To allow comparative analysis between different datasets and since standard pathology for ER and HER2 was not available for all samples, the receptor status was determined based on Affymetrix expression data as previously described [36]. A stem-cell-like (SCL) metagene was used as described previously [37, 38]. This metagene was derived from 159 highly correlated Affymetrix ProbeSets and contains 35 out of 37 (95%) previously reported markers of SCL breast cancers, undifferentiated breast cancers and basal-like breast cancers [3943].


To validate the presence of lymphocytes in those samples that show a high expression of the respective metagenes, we performed immunohistochemistry using specific antibodies. CD3 (clone F7.2.38; Dianova, Hamburg, Germany) and CD20 (clone B-Ly1; Dianova) were used as markers for T lymophocytes and B lymphocytes, respectively.

Tissue samples of primary invasive breast cancer cases from the University of Frankfurt were obtained with informed consent and approval of the institutional review board of the University of Frankfurt. Briefly, paraffin sections (2 μm) were mounted on Superfrost Plus slides, dewaxed in xylene and rehydrated through graduated ethanol to water. Antigens were retrieved by microwaving sections in 10 mM citrate buffer (pH 6.0) for 20 minutes at 800 W. Blocking was performed using antibody dilution buffer (DCS Diagnostics, Hamburg, Germany) at room temperature for 15 minutes. Antibodies were subsequently diluted 1:100 individually in this buffer. Sections were incubated with antibodies for 1 hour at room temperature.

For negative controls, the primary antibodies were replaced with PBS. For secondary antibody incubations and detection, the Dako REAL Detection System Alkaline Phosphatase/RED (Dako, Glostrup, Denmark) was used following the protocol of the supplier and sections were counterstained with Mayer's hematoxylin. Samples from the Frankfurt dataset were ranked according to visual inspection of the amount of stained lymphocytes with the respective antibody in a blinded analysis. The rank order was subsequently compared with that based on the metagene expression using Spearman rank correlation.

Statistical analyses

All reported P values are two sided and P < 0.05 was considered to indicate a significant result. Subjects with missing values were excluded from the analyses. Fisher's exact test was applied for associations between categorical parameters. Spearman rank correlation was used to compare metagene expression and results from immunohistochemistry. The Kruskal–Wallis H test was used to analyze the relationship of the expression of immune metagenes and pathological lymphocyte infiltration scores from the independent validation dataset from London.

Survival intervals were measured from the time of surgery to the time of death from disease or of the first clinical or radiographic evidence of disease recurrence. Data for women in whom the envisaged end point was not reached were censored as of the last follow-up date or at 120 months. We constructed Kaplan–Meier curves and used the log-rank test to determine the univariate significance of the variables. Hazard ratios were determined by Cox regression.

To examine simultaneously the effects of multiple standard parameters and lymphocyte-specific kinase (LCK) metagene expression on survival, a Cox proportional-hazards regression model was applied among ER-negative samples. The effect of each variable was assessed with the use of the Wald test and described by the hazard ratio with a 95% confidence interval. The model included binary variables for lymph node status (lymph node-negative or N1), histological grading (G1 or G2 vs. G3), age (≤ 50 years vs. > 50 years), tumor size (≤ 2 cm vs. > 2 cm), and HER2 status (by microarray [36]). All analyses were performed using SPSS 15.0 (SPSS Inc., Chicago, IL, USA) and R 2.6.2 software [44].


Unsupervised hierarchical clustering of genes in individual datasets as well as combined datasets revealed a large cluster of genes with functions in immune cells. This cluster of approximately 600 Affymetrix ProbeSets was consistently obtained in all datasets with overall correlations of 0.2 to 0.3. We hypothesized that the observed coordinated expression of subsets of these genes might represent surrogate markers for the amounts of different types of immune cells in the analyzed samples. In addition, coordinated expression might result from the induction of signaling pathways and specific differentiation programs in the tumor cells themselves and/or accompanying stromal tissue.

The expression of 569 Affymetrix ProbeSets from the immune-related cluster was analyzed in a combined cohort of 1,230 samples to tease out relationships of these genes (see Additional data files 2, 3 and 4). To identify metagenes for the principal expression vectors we selected those clusters that contained at least 10 elements and a minimal average correlation of 0.7, resulting in 199 total ProbeSets as shown in Figure 1a. Seven metagenes were derived as mean values of all ProbeSets in the respective clusters (Figure 1b). The functional annotation of the immune-system-related metagene clusters is presented in Table 1 (a detailed list of all 199 ProbeSets is given in Additional data file 5).

Table 1 Functional annotation of the immune-system-related metagene clusters
Figure 1
figure 1

Identification of immune-system-related metagenes. (a) To identify metagenes for the principal expression vectors we selected those gene clusters that encompassed at least 10 elements and displayed a minimal average correlation of 0.7 from the larger data matrix of 569 ProbeSets (see Additional data file 3). Expression of these selected 199 ProbeSets among the 1,230 breast cancer samples is shown. HCK, hemopoietic cell kinase; LCK, lymphocyte-specific kinase; MHC, major histocompatibility complex; STAT1, signal transducer and activator of transcription 1. (b) Seven metagenes were derived as mean values of all 199 ProbeSets from the seven clusters.

Expression of the metagene clusters in different immunological cell types

To check the biological plausibility of the identified metagenes as markers for cell types and/or an immunological state, we analyzed their expression in different types of immune system tissues and cell types (Figure 2). As expected, the IgG metagene cluster seemed to be specific for B cells and those tissues containing high amounts of these cells (tonsils, lymph nodes, bone marrow). The hemopoietic cell kinase metagene cluster displayed highest expression in peripheral blood CD14 monocytes and bone-marrow-derived CD33 myeloid cells, in line with the well-known function of the hck gene in this lineage. In contrast it is important to note that T cells of both the CD4 and CD8 types are devoid of the expression of this metagene (while some lower levels of expression are detected in the B-cell lineage). Inversely to hemopoietic cell kinase, the LCK metagene is expressed only in T cells but no expression is observed in monocytes and the myeloid lineage. The MHC-II metagene is only expressed by antigen-presenting cells but not in T cells, while high expression of the MHC-I metagene is observed in all cell types as expected. The differences in the interferon and signal transducer and activator of transcription 1 (STAT1) metagenes are smaller than those observed between different tumor samples, which might suggest considerable expression of those interferon-induced genes by the carcinoma and/or stromal cells of the tumor.

Figure 2
figure 2

Expression of the metagene clusters in immunological cell types. (a) The 199 ProbeSets from Figure 1a were used to cluster 44 samples of isolated cells and tissues with immune-system-related functions that were profiled on Affymetrix U133A arrays by Su and colleagues [GEO:GSE1133] [31]. In each case, two samples for the following cell/tissue types are presented from left to right: fetal liver (1,2), K-562 (3,4), whole blood (5,6), CD33 myeloid (7,8), CD14 monocytes (9,10), CD34 (11,12), B lymphoblasts (13,14), CD56 natural killer cells (15,16), CD4 T cells (17,18), CD8 T cells (19,20), MOLT-4 (21,22), Raji (23,24), HL-60 (25,26), Daudi (27,28), CD105 (29,30), CD71 (31,32), BDCA4 dendritic cells (33,34), CD19 B cells (35,36), thymus (37,38), tonsil (39,40), lymph node (41,42), bone marrow (43,44). Details about the respective samples are given in Additional data file 10. HCK, hemopoietic cell kinase; LCK, lymphocyte-specific kinase; MHC, major histocompatibility complex; STAT1, signal transducer and activator of transcription 1. (b) Representation of the seven metagenes that were derived from the 199 ProbeSets as in Figure 1b.

Relationship of expression of immune-system-related metagenes with ER status, HER2 status and presence of stem-cell-like markers and lymphocyte infiltration

To analyze the relationship of the immune system metagenes and standard parameters, unsupervised clustering analysis of all 1,781 samples using the immune-related metagenes, ER, HER2 as well as a metagene of SCL markers was performed (Additional data file 6, Supplemental figure S2a). The results suggested that considerable amounts of immune cells are present among all different subtypes of tumors. Similar results were obtained in the analysis of scatter plots comparing five metagenes representing the major clusters (LCK, IgG, MHC-II, interferon, STAT1) and ER and HER2 status (Additional data file 6, Supplemental figure S2b). The scatter of LCK and IgG (as well as MHC-II) metagenes showed a correlation (R2 = 0.52 and R2 = 0.62, respectively) that was not observed between the interferon and IgG metagenes (R2 = 0.07). This could suggest a parallel infiltration by both T cells and B cells into those tumors that are characterized by high expression of both metagenes. On the other hand, the interferon and STAT1 metagenes are also correlated (R2 = 0.52), which might represent an interferon response of tumor cells or other cell types in the respective samples. In general, no clear relationship with ER and HER2 status was seen in these scatter plots. ER-negative tumors, however, display a somewhat higher expression of the IgG and STAT1 metagenes.

To verify the actual presence of lymphocytes in those samples that show a high expression of the respective metagenes we performed immunohistochemistry using specific antibodies. CD3 and CD20 were used as markers for T lymphocytes and B lymphocytes, respectively. Using 10 samples from our own dataset we observed Spearman rank correlations of 0.79 (P = 0.006) between CD3 and the LCK metagene and of 0.64 (P = 0.048) between CD20 and the IgG metagene, respectively. Figure 3a presents a sample with high expression of both the LCK and IgG metagenes that was characterized by high numbers of T cells and B cells in the sample. For an independent validation of the results we used a dataset from London (Desmedt and colleagues' dataset GUYU [26]). For 35 samples of this dataset, pathological information on lymphocytic infiltration was available. As shown in Figure 3b, a higher expression of all metagenes was detected in those samples with higher scoring for lymphocytic infiltration.

Figure 3
figure 3

Verification of microarray results by histological examination. (a) Example of the verification of lymphocytic infiltration by immunohistochemistry (Frankfurt dataset). Consecutive sections of a tumor sample with high expression of both IgG and lymphocyte-specific kinase (LCK) metagenes stained with antibodies against either CD20 or CD3 to detect B lymphocytes and T lymphocytes, respectively. (b) Validation of the correlation of immune-system-related metagenes and lymphocytic infiltration in independent data. Expression of different metagenes compared with pathological information on lymphocytic infiltration (LI score) from the London dataset (Desmedt and colleagues [26], n = 35). P values determined using the Kruskal–Wallis H test. STAT1, signal transducer and activator of transcription 1.

Prognostic value of immune-system-related metagenes in subgroups of breast cancer patients

There are somewhat differing data in the literature on the frequency of lymphocyte infiltration as indicated by pathological analysis. While earlier studies reported frequencies of 20% (n = 382) [45] up to 45% (n = 78) [46], more recent studies have reported proportions of 16% to 17% (n = 1,919) [8] and 24% (n = 675) [47]. Specific detection of B-lymphocyte infiltration has been reported for 20% of invasive breast carcinomas [4, 48]. Bearing these data in mind we used the upper quartile (25%) of the samples with highest expression of the respective metagenes to define a cutoff point for sample stratification. In addition, verification of the robustness by applying simple median splits of the cohorts led to similar results (data not shown).

Follow-up information was available for 1,263 out of the 1,781 samples; 929 of these samples were ER-positive and 334 samples were ER-negative. We used multivariate Cox regression among these 1,263 patients to analyze whether the seven metagenes provide prognostic information independent from one another (Additional data file 7). Only for the LCK metagene was a significant result obtained in this analysis (hazards ratio = 0.6, 95% confidence interval = 0.39 to 0.89; P = 0.013), while merely a trend was observed for MHC-I (P = 0.11) and STAT1 (P = 0.13) metagenes. To identify those samples where expression of the LCK metagene has the largest impact on prognosis we performed Kaplan–Meier analyses of disease-free survival in different tumor subgroups stratified by ER, HER2, and SCL status. As shown in Figure 4a, the LCK metagene had the highest prognostic value among the 334 ER-negative samples with a univariate hazard ratio of 1.81 (95% confidence interval = 1.22 to 2.71, P = 0.003). This high prognostic value was observed in all ER-negative samples independently of their expression of SCL markers (Figure 4b) or their HER2 status (Figure 4c). In addition a high prognostic value of LCK metagene expression was also found among those 86 ER-positive samples that were also HER2-positive (ER+/HER2+, P = 0.038) (Figure 4d).

Figure 4
figure 4

Prognostic value of the lymphocyte-specific kinase metagene in subgroups of breast cancer patients. Samples of the combined dataset were stratified according to the highest quartile of expression of the lymphocyte-specific kinase (LCK) metagene. Kaplan–Meier analyses of disease-free survival were performed in different tumor subgroups according to estrogen receptor (ER), human epidermal growth factor receptor 2 (HER2), and stem-cell like (SCL) status. (a) The LCK metagene had a highly significant prognostic value among ER-negative samples. This high prognostic value was observed in all ER-negative samples independently of (b) their expression of SCL markers or (c) their HER2 status. (d) In addition, a high prognostic value of LCK metagene expression was also found in ER-positive HER2-positive samples.

To demonstrate that the LCK metagene was an independent prognostic factor and not a surrogate marker for other factors, we performed a multivariate Cox regression analysis including all clinical variables. These parameters included lymph node status, age, pathohistological grading, tumor size, HER2 status, as well as expression of the LCK metagene. For 124 out of the 334 ER-negative samples and for 37 out of 86 of the ER+/HER2+ samples, respectively, all of the parameters were available. As presented in Tables 2 and 3 regarding the ER-negative and the ER+/HER2+ samples, respectively, only the LCK metagene remained a significant factor for disease-free survival in this analysis, with a hazard ratio of 2.16 (95% confidence interval = 1.15 to 4.03, P = 0.017) and 4.17 (95% confidence interval = 1.38 to 12.6, P = 0.011).

Table 2 Multivariate Cox regression of lymphocyte-specific kinase metagene and standard parameters among estrogen receptor-negative tumors
Table 3 Multivariate Cox regression of lymphocyte-specific kinase metagene and standard parameters among estrogen receptor-positive HER2-positive tumors


The impact of host factors such as immune cells, stromal environment and chemokines on the development and maintenance of breast cancer has frequently been hypothesized, but still remains a matter of debate (reviewed by Dranoff [49]). In the present study we identified seven clusters of immune-system-related metagenes by large-scale microarray analysis and showed an association with different immunological cell types. The redundant information from several highly correlated ProbeSets allows the construction of robust metagenes that can be used as surrogate markers for the amount of different immune cell types in the breast cancer samples. The relationship of these immunological metagenes with other parameters of the tumors in the combined datasets seems to be complex since no simple associations were found.

High expression of the LCK metagene predicted for better disease-free survival among all subgroups of ER-negative tumors and outperformed all standard parameters in multivariate analysis. Moreover, a positive prognostic value of LCK metagene expression was also observed for those ER-positive tumors with HER2 overexpression. Our results are supported by several other recent studies. Ménard and colleagues have shown that lymphocyte infiltration of breast cancer had a strong positive prognostic value in patients younger than 40 years; however, no association in patients 40 years or older was shown [8]. Although the ER status was not analyzed in their study it is well known that younger age is associated with higher numbers of ER-negative tumors. Alexe and colleagues drew a similar conclusion analyzing only one dataset (Rotterdam dataset, n = 286) [9]. They applied a variety of different clustering procedures to this dataset and proposed to analyze HER2 samples separately. They identified 651 genes among HER2-positive samples by principal component analysis that stratify these samples into two groups. One of the groups was characterized by immune-system-associated genes and improved survival.

Teschendorff and colleagues used three microarray datasets from different platforms and applied a recently developed bioinformatical method to identify subgroups among 186 ER-negative breast cancers. They identified a cluster of ER-negative tumors that display higher expression of six immune-system-related genes and were associated with a better prognosis [50]. Most recently, Finak and colleagues used laser-captured microdissection to analyze the stromal compartment of 53 breast tumors [16]. They identified a good-outcome cluster that was enriched in immune-system-related genes and predicted improved survival in four datasets from different platforms (n = 1,021 total). This cluster contained 22 different genes, 16 of which were also present in our complete immune response cluster. Eight of these 16 genes are included in our LCKmetagene and two genes in our MHC-I metagene.

Calabrò and colleagues, in a computational screening approach to dissect the effect of LI on published ER gene signatures, recently showed that LI is associated with longer survival in ER-negative patients but shorter survival in ER-positive patients [51]. Moreover, Schmidt and colleagues identified B-cell and T-cell metagenes (corresponding to the IgG and LCK metagenes in our study) by hierarchical clustering of 200 untreated breast cancer samples [52]. In contrast to our results with no prognostic value of the B-cell metagene in either ER-positive or ER-negative subgroups (Additional data file 8), these authors identified the B-cell metagene as the most important prognostic factor outperforming the T-cell metagene. Several reasons might account for these discrepancies. Schmidt and colleagues used three patient cohorts all containing only node-negative patients without any adjuvant therapy. Despite these very homogeneous cohorts, in one of them the prognostic value of the B-cell metagene was restricted to the subset of highly proliferating tumors (the discovery cohort Mainz). This specific cohort was characterized by a lower proportion of ER-negative tumors (22%). The study of Schmidt and coworkers clearly demonstrate a prognostic value of lymphocyte metagenes (B-cell and T-cell metagenes). In contrast to the cohorts in the study by Schmidt and colleagues, our sample collective was rather heterogeneous – containing node-positive samples and many patients treated with adjuvant therapy (see Additional data file 2). The difference in our results might therefore be related to different cohorts and treatments as well as a potential predictive value of LI for the response to adjuvant therapy. This possibility could be important since the response rates to neoadjuvant chemotherapy are generally higher for ER-negative tumors [53], the very subgroup in which we observed the prognostic value of the LCK metagene. One hundred and ninety-eight of the samples from our combined datasets were pretherapeutic biopsies from patients treated with neoadjuvant chemotherapy (MDA133 and Frankfurt-2 datasets) [28, 29]. In an exploratory analysis, six out of eight samples (75%) with high expression of both IgG and LCK metagenes achieved a pathological complete response – in contrast to only 45 out of all 198 samples (22.7%, P = 0.002; Additional data file 9). The ER status, however, might have a confounding effect in this analysis since all six samples were ER-negative. When the samples were further stratified according to the ER status, only a trend to significance (P = 0.057) was observed in the ER-negative subgroup. Still, these data suggest that the beneficial effect of the expression of the LCK metagenes in our analyses might at least in part be related to a predictive role in chemotherapeutic treatment.

Despite the observed prognostic and predictive value of LI in our analyses, the molecular mechanism behind this phenomenon is not fully clear. Casares and colleagues have reported that tumor cells dying in response to anthracyclines can induce an antitumor immune response that depends on cytotoxic T cells and dendritic cells [5456]. These results are in line with the better response to anthracycline-containing neoadjuvant chemotherapy we have observed. While lymphocytes may secrete cytokines resulting in an antitumor response [57], however, they might also shift the balance of the cytokine milieu toward angiogenic factors [58] and inflammatory cytokines that seem to promote tumor progression [59, 60]. On the one hand, the tumor-associated lymphocytes might be a marker of an immune response against the tumor. On the other hand, these lymphocytes could be attracted by the tumor cells and generate a functional niche by interaction with the undifferentiated cancer cells. Moreover, whether either modulation of immune response alters the clinical course of breast cancer patients or whether efficacy of specific anticancer treatment approaches depends on the existence of defined tumor host factors and are therefore predictive in some way should be clarified. As a matter of fact it is clear that immune-system-related markers are frequently part of many prognostic and predictive signatures, even though a specific biological role cannot be assigned to date.


Many prognostic gene signatures has been reported to date that seem to have high rates of concordance in their outcome predictions [61]. In a very recent study, however, Wirapati and coworkers demonstrated that all of these signatures uniformly identify the same group of low-proliferating ER-positive tumors as having a good prognosis [62]. In contrast, all ER-negative tumors are assigned to the poor prognosis group together with high-proliferating ER-positive tumors by all prognostic signatures. An important result of the present work is therefore that the LCK metagene may actually separate the ER-negative group into those tumors with better or worse prognosis.