Introduction

Breast cancer is a complex heterogeneous disease which has traditionally been subclassified depending, amongst other factors, on the expression of different receptor proteins, such as estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) [1]. These 'biomarkers' allow us to tailor the level of clinical intervention. While ER-positive the second positive should be deleted tumours receive hormone therapies [2] and HER2-positive cancers receive targeted therapies such as trastuzumab and lapatinib [3], 'triple negative' cancers lacking these markers currently have no targeted therapies and cause a disproportionate number of breast cancer deaths [4]. In addition to the traditional classifications using these biomarkers, in recent years, whole genome DNA microarrays have been utilised to further classify this disease, initially into five molecular subtypes based on gene expression profiles, namely luminal A and luminal B (ER-positive tumours), HER2 (HER2-positive tumours), basal and normal-like tumours [5, 6] and subsequently into at least ten further molecular subtypes using both copy number and gene expression data [7].

It is crucially important to identify which breast cancer patients are at risk of developing a more aggressive phenotype so as to tailor the level of clinical intervention. Prognostic biomarkers, such as ER and HER2, can be used to assess the inherent likelihood of a patient exhibiting a particular outcome. However, within the subtypes defined by these classical markers, there is a wide spectrum of survival requiring the identification of additional novel prognostic markers. Also, the triple negative subtype has no such prognostic biomarkers currently in clinical use.

There is a great deal of transcriptomics data currently available to facilitate the identification of novel molecular biomarkers associated with breast cancer and its subtypes. Huge studies such as the 2,000 breast tumour profiles by Curtis et al. [7] greatly aid in our understanding of breast cancer and facilitate the identification of novel intrinsic subtypes. The diverse nature of these datasets and the variability of the different microarray platforms themselves can affect the statistical power of such studies. Moreover, it is necessary to test the prognostic ability of markers in diverse datasets to avoid dataset-specific affects.

It is clear that the selection of markers could benefit greatly from the integration of datasets from multiple studies to increase confidence in the selected markers. To this end, we have developed an easy-to-use interface for our algorithm which allows identification of subsets of genes that are associated with disease progression in breast cancer or its subtypes, that is, a set of putative prognostic markers. This algorithm integrates gene expression data from DNA microarray studies and corresponding clinical data (hormone status, survival time, tumour grade, patient age and so on). In particular, it allows investigation of prognostic markers in the context of disease-free survival (DFS), distant disease-free survival (DDFS) and overall survival (OS).

Over the last decade, our understanding of the function that small non-coding RNAs known as microRNAs (miRNAs) play in an array of fundamental biological processes in both plants and animals has increased dramatically [8]. These short endogenous non-coding RNAs act primarily by negatively regulating the expression of target mRNAs through translational inhibition and/or mRNA degradation [8]. The complexity of post-transcriptional control of gene expression by miRNAs remains a significant challenge. Indeed, miRNAs have the potential to alter entire pathways due to their ability to target multiple genes simultaneously [9]. The association of miRNAs with breast cancer has been well established [10, 11]. In fact, miRNAs have been identified as prognostic markers in breast cancer [12] and associated with breast tumours defined by their HER2 or ER/PR status [13].

Approximately 50% of known human miRNAs are intronic (miRBase release 18, November 2011). Of these, 341 or roughly one third of human miRNA host genes are hybridized by probes on the U133plus2 Affymetrix gene chip. A number of studies have reported that many intronic miRNAs show significantly correlated expression profiles with their host genes [14, 15]. Estimates of the number of miRNAs whose expression profiles are significantly correlated with their host gene are as high as 70% [16]. The expression of these miRNAs can, in some instances, be inferred from the expression of their host genes and can, therefore, be evaluated as putative prognostic markers in breast cancer and its subtypes using gene expression data.

We evaluated our approach using two commercially available gene expression-based prognostic tests in breast cancer, namely oncotype DX and MammaPrint. We also applied the algorithm to examine the expression of 58 receptor tyrosine kinases (RTKs) in the basal-like subtype of breast cancer. Using the 21 genes from oncotype DX and the 70-gene MammaPrint signature, we demonstrated the robustness of our approach and confirmed the prognostic value of these signatures. In the case of oncotype DX, we showed that the predictive strength of this test is centred on the five proliferation genes within the 21 gene set. We also identified six RTKs associated with poor prognosis in the basal breast cancer subtype. The feasibility of using miRNA host gene expression as a surrogate for miRNA levels was tested using known miRNA prognostic markers, mir-93 and mir-205. Although these markers were only identified in small patient cohorts, BreastMark was able to confirm the robustness of these prognostic markers across a far larger and diverse patient dataset. A web application for using this algorithm is currently available [17].

Methods

Gene expression data

Gene expression data sets were downloaded from the Gene Expression Omnibus [18] or authors' websites in the form of raw data files, where possible. Only breast cancer datasets with survival information and at least 48 patients were included. Large datasets were chosen for this analysis so as to avoid the sampling effects associated with small datasets. A cut-off of 48 was chosen as all smaller breast cancer datasets either lacked detailed clinical data or had too few samples (approximately 30 samples or less). In total, 4,738 samples across 26 datasets incorporating 12 different microarray platforms were utilised to develop the BreastMark system (Table 1). Table 2 contains a breakdown of the clinical information available with each dataset. Where raw data were not available, the normalised data as published by the original authors were used. In the case of the raw data for the Affymetrix datasets (.cel files), gene expression values were called using the GeneChip (GC) robust multichip average method [19] and data were quantile normalised using the Bioconductor package, affy [20]. For the dual-channel platforms, data were loess normalised [21] using the Bioconductor package limma. Hybridisation probes were mapped to Entrez gene IDs to gene centre the data [22]. The Entrez gene IDs corresponding to the array probes were obtained using Biomart [23, 24] and the Bioconductor annotation libraries. Probes that hit multiple genes were filtered out. If there were multiple probes for the same gene, the probe values were averaged for that gene. This resulted in expression data for a total of 20,017 Entrez gene IDs across 4,738 samples.

Table 1 Datasets used in this analysis
Table 2 Clinical data summary

microRNA expression data

miRNAs are frequently located within the introns of protein coding genes and in exons of non-coding transcripts. miRNA expression can be detected using conventional microarrays through host gene expression for intragenic miRNAs or by direct probe matching for intergenic miRNAs. A total of 1,987 samples were processed on U133A Affymetrix arrays, while 973 were processed on U133plus2 Affymetrix arrays (2,960 in total). U133A and U133plus2 microarrays have 22,277 probe sets in common. Using this information, it is possible to infer the expression of 341 miRNAs across 2,960 samples [25] (based on miRBase version 13.0, Ensembl version 54_36p). As with the gene centred data, this information was also combined with the available clinical data for survival analysis.

Breast cancer subtypes

The R package genefu [26] was used to classify the 4,739 breast cancer samples into the luminal A, luminal B, HER2, normal-like and basal molecular subtypes using the ssp2003 [5], ssp2006 [27], and pam50 [28], classifiers.

Survival analysis

We have combined detailed clinical data from each of the studies used here, including one or more of DFS, DDFS and OS. The software allows for each of these three survival end points to be analysed separately. Median expression was used to dichotomise the data, allowing stratification into high and low groups within each of the 26 individual datasets. Once a sample was assigned to a particular group, the 26 datasets were combined and a global pooled survival analysis was performed in real-time. It is important to treat each dataset separately when determining which group a sample belongs to, as the expression of these genes will vary greatly across the different experiments/platforms. In essence, each dataset is split into high and low in singularity to negate study-specific effects. Survival curves are based on Kaplan-Meier estimates and the log-rank P-value is shown for difference in survival. Cox regression analysis was used to calculate hazard ratios. The R package 'survival' was used to calculate and plot the Kaplan-Meier survival curve. All calculations were carried out in the R statistical environment [29].

Software parameters

The software incorporates all the clinical data made available by the original authors. This allows the data to be analysed based on one or more common clinical parameters including patient age, tumour size, lymph node status, tamoxifen treatment, chemotherapy treatment, ER status, HER2 status, PR status and tumour grade. The software also allows the upper or lower quartiles of the expression of the gene of interest to be used to determine high and low groups within each of the 26 individual datasets.

Web server

The interface is available on a publicly accessible web server [17] and is updated quarterly. The software uses CGI to link the web server with the R/perl based algorithm. All calculations are carried out in real-time.

Validation of BreastMark using the OncotypeDX gene signature

The 21 gene signature used by oncotype DX in predicting patient prognosis was downloaded from the original paper [30]. This panel of prospectively selected genes comprises 16 prognostic genes normalised relative to the expression of five reference genes. The 16 prognostic genes are broken down into five categories: proliferation, invasion, HER2, estrogen and 'other'. The likelihood of breast cancer relapse in patients was based on a recurrence score (RS) algorithm constructed and tested on a cohort of 668 patient samples. The higher the RS, the poorer the patient outcome observed. This algorithm weights each of the five categories based on the influence they have on disease recurrence. For example, the proliferation group is weighted most highly and, therefore, the expression of these genes influences the RS the most. Each of the 16 oncogenes were queried in our dataset to test the effect each gene has on survival using the three above-mentioned survival end-points for prognosis, namely DFS, DDFS and OS. It is expected that the genes with the greatest influence on the RS would have the highest hazard ratios and the lowest P-values. Sample numbers will vary depending on the number of platforms with expression information available for a particular gene.

Validation of BreastMark using the MammaPrint gene signature

The 70-gene prognostic signature was downloaded from the original paper along with their correlation with prognosis [31]. It was possible to obtain unique Entrez gene IDs for 61 of these genes (there is more than one copy of PEC1, IGFBP5 and DIAPH3 (three) in the 70 gene list and five others have no Entrez gene ID). As with the oncotype DX signature, each gene was analysed separately within our datasets using the three survival endpoints, DFS, DDFS and OS. Although looking at these genes individually does not represent the full power of this prognostic signature, this dataset should still be enriched for prognostic markers. Additionally, the positive and negative correlation coefficients published by the original authors should be consistent with our observed hazard ratios of less than or greater than 1, respectively. Sample numbers will vary depending on the number of platforms with expression information for a particular Entrez Gene ID.

Receptor tyrosine kinases

We compiled a list of 58 RTKs from the literature. Using our algorithm, we identified which of the RTKs were associated with survival within the basal molecular subtype using the ssp2003, ssp2006 and PAM50 molecular classifiers (see above). DFS, DDFS and OS were used as the survival endpoints. A P-value of < 0.05 in a minimum of two out of three classifiers was considered significant. The data were dichotomised using three cut-offs, median expression, greater than the 75th percentile referred to as the 'high' cut-off and less than the 25th percentile referred to as the 'low' cut-off.

Results

In order to test our gene-centred survival meta-analysis, we looked at the genes used to predict breast cancer prognosis by two commercially available tests, oncotype DX [32] and MammaPrint [33]. Although the genes in these tests are not used in isolation to predict disease outcome, it is reasonable to assume that the genes chosen within these tests would include a number of prognostic markers whose expression in our meta-analysis would correlate with good and poor outcome. As there is currently no large-scale robust signature for miRNAs in breast cancer, we tested our approach on known individual miRNAs which have previously been shown to be prognostic markers. All calculations were carried out using the BreastMark web application [17].

The robustness of BreastMark is tested using the 21 genes from OncotypeDX

Oncotype DX is a 21-gene signature (16 oncogenes and five controls) selected using prior knowledge from the literature, which in combination with the developer's algorithm, predicts patient outcome in lymph node-negative (LNN), ER-positive breast cancer [32]. It uses a RS calibrated against approximately 670 patients with known clinical outcome to predict patient survival. Patients with a low score do well, and those with a high score do poorly. The 16 genes are classified into five groups: proliferation, invasion, HER2, ER and other. The algorithm takes gene expression data from 16 oncogenes, normalises the expression against the five controls and weights the 16 oncogenes depending on the effect they have on the RS. The genes are weighted as follows 1.04 × proliferation group score + 0.47 × HER2 group score - 0.34 × ER group score + 0.1 × invasion group score + 0.05 × CD68 score - 0.08 × GSTM1 score - 0.07 × BAG1 score. Genes from the proliferation group, such as Ki67 and Survivin, have the highest weighting and, therefore, the greatest effect on the RS.

Each of the 16 oncogenes were analysed separately within BreastMark using median expression as a cut-off, selecting LNN, ER-positive patients only and using DFS survival as the survival end point to ensure comparability. This information is summarised in Table 3, along with the effect they have on the RS. The 16 genes were also analysed using DDFS and OS as the survival end points [see Additional file 1 Tables S1 and S2] and are consistent with our observations for DFS survival. A hazard ratio (HR) of greater than 1 indicates a negative effect on survival and a HR of less than one has a positive effect. The higher the HR the greater the effect the gene has on survival. As can be seen from Table 3, our results are largely consistent with the weightings calibrated for oncotype DX. The proliferation markers which have the highest weightings, and therefore the largest effect on the RS, have the highest HRs and are highly statistically significant. In contrast, those genes which have only a marginal effect on the RS (CD68, GSTM1 and BAG1) are not significant and have HRs close to one.

Table 3 BreastMark results for the Oncotype DX 21-gene signature for LNN, ER-positive patients using DFS as the survival end point

Combining the markers (grouping samples where both markers have greater than median expression) identifies patients who will do particularly poorly. The Kaplan-Meier plot for Ki67 in shown in Figure 1(a) (n = 902, HR = 1.68, P = 4.44e-05). The Kaplan-Meier plot for Ki67 and Survivin combined, that is, comparing the survival of patients with greater than median expression of both Ki67 and Survivin against the rest is shown in Figure 1(b). These patients have a worse prognosis than Ki67 alone, that is, they have a higher HR (a HR of 1.99 versus a HR of 1.68). The same occurs when you also combine MYBL2 with Ki67 and Survivin (Figure 1(c)). These patients have an even worse prognosis with an even greater HR (n = 902, HR = 2.00, P = 2.01e-07). However, the same is not true when you combine other markers with the proliferation markers. Figure 1(d) shows Ki67, Survivin and PGR combined (n = 902, HR = 1.537, P = 9.2e-03). The HR is lower and the difference in survival is less significant. In fact, when you combine most of the other oncogenes from the signature, no improvement in prognostic power or decrease in the significance of the HR is observed (data not shown). This suggests that not only are all of the genes from this prognostic signature not necessary, but that potentially our algorithm provides a useful reductionist approach to these more complex prognostic signatures, allowing us to eliminate superfluous markers and highlight those genes that are of the greatest relevance.

Figure 1
figure 1

Prognostic role of the Ki67, Survivn MYBL2 and MMP11 in breast cancer. These figures were generated using BreastMark at http://glados.ucd.ie/BreastMark/index.html. (a) Kaplan-Meier estimates of survival, demonstrating high expression of Ki67 is associated with poor prognosis in breast cancer (n = 902, HR = 1.68, P = 4.44e-05). (b) Kaplan-Meier estimate of survival, demonstrating that high expression of Ki67 and Survivin in combination have a greater effect on survival (n = 902, HR = 1.99, P = 5.60e-08). (c) Kaplan-Meier estimate of survival, demonstrating that high expression of Ki67, Survivin and MYBL2 in combination have an even greater effect on survival (n = 902, HR = 2.00, P = 2.01e-07). (d) Kaplan-Meier estimate of survival, demonstrating how the invasion marker MMP11 does not improve the prognostic ability of Ki67 and Survivin (n = 902, HR = 1.54, P = 9.20e-03). HR, hazard ratio.

BreastMark is consistent with the MammaPrint gene signature

Similar to the oncotype DX assay, MammaPrint [33] is a commercially available test for breast cancer recurrence. In contrast, it was developed via a hypothesis-free method from a gene expression profiling study rather than from a prospectively chosen list of known oncogenes. The study used 78 LNN patients specifically to identify a prognostic signature in their gene expression profiles using a supervised classification method. Each of the approximately 25,000 probesets present on those microarrays were correlated with disease outcome and only those genes that were significantly associated with disease outcome were retained to create an optimised list of prognostic markers. Each of the 70 genes had a positive or negative correlation coefficient depending on their association with good or poor prognosis, respectively.

Again, as with oncotype DX, even though the genes from the 70-gene signature are not predicted to act independently, the 70 genes when analysed independently, should correlate with good and poor prognosis based on the correlation coefficients identified in the original MammaPrint study. Genes with positive and negative correlation coefficients should have HRs less than and greater than one, respectively. As we expect, this is what we see with these genes in LNN samples, using a median cut-off and DFS survival as the survival endpoint (DDFS and OS show similar results in Additional file 1 Tables S3 and S4, respectively). Of the 61 genes from the MammaPrint signature for which we had Entrez gene IDs, 53 had HRs consistent with the correlation coefficients from the original study (Table 4). Of the other eight genes, four had HRs close to 1, and were not statistically significant, and the other four were not present in the dataset or present in too few samples. Although not all of the 53 consistent genes were statistically significant, 33 are significantly associated with survival when analysed independently with BreastMark.

Table 4 BreastMark results for the MammaPrint gene signature for LNN patients using DFS as the survival end point

miRNAs associated with prognosis in breast cancer

Decreased expression of miR-205 has previously been associated with poor prognosis in breast cancer, and miR-93 is highly expressed in high-grade tumours, that is, in tumours of patients who do poorly [10, 34]; however, these studies were relatively small in scope (20 and 93 patients, respectively) [10, 34]. To confirm these observations in a larger dataset and to test our approach, we examined the association of the host genes of these miRNAs with prognosis. The results for miR-205 and miR-93 can be seen in Figures 2(a) and 2(b), respectively. Following BreastMark analysis, high expression of the host gene of miR-205 is indeed associated with good prognosis (HR = 0.768, P-value = 0.02, n = 581) and high expression of host gene of miR-93 is associated with poor prognosis (HR = 1.34, P-value = 1.48e-04, n = 1,563). This confirms that miR-205 and miR-93 are robust markers of good and poor prognosis, respectively.

Figure 2
figure 2

miR-205 and miR-93 are associated with prognosis in breast cancer. These figures were generated using BreastMark at http://glados.ucd.ie/BreastMark/index.html. (a) High expression of miR-205 is associated with good prognosis in breast cancer (HR = 0.768, P-value = 0.02, n = 581). (b) Low miR-93 expression is a marker of poor prognosis in breast cancer (HR = 1.36, P-value = 1.48e-4, n = 1563).

Receptor tyrosine kinases associated with poor survival in the basal molecular subtype

RTKs are a large family of proteins involved in cell signalling with particular roles in growth, differentiation, adhesion, motility and death of cells [35]. A total of 58 kinases have been classified as receptor type and are listed in Additional file 2. Each of these kinases was assessed in the basal molecular subtype based on the three classifiers (ssp2003, ssp2006 and PAM50). Six of the kinases were significantly associated with poor prognosis in the basal subtype (EPHA5, FGFR1, FGFR3, VEGFR1, PDGFRβ and TIE1). The results are summarised on Table 5. As expected, the RTKs as a group have the potential to act as prognostic markers in this difficult-to-treat subtype of breast cancer. In particular, PDGFRβ would appear to be a strong marker of poor prognosis as it is significant across all three of the survival endpoints. This is not entirely unexpected as elevated levels of PDGFRβ have previously been associated with enhanced cell migration and invasion in breast cancer [36].

Table 5 Receptor tyrosine kinases associated with poor survival in the basal molecular subtype

Discussion

BreastMark provides a user-friendly tool for examining putative prognostic markers in breast cancer. The value of the approach used here is based on its simplicity of operation and the statistical power gained through the combination of a large cohort of patients when compared to single microarray experiments. While it is not the first application which combines multiple public breast cancer datasets and performs a cross-dataset survival analysis [3739], it is the first application which allows users to combine multiple prognostic markers across multiple microarray platforms without requiring complex adjustments for batch effects across different experiments/platforms. We are, therefore, not reliant on the suitability of the data transformation method chosen. Also, as the database is gene-centred, rather than probe-centred, we are not limited to the gene coverage of a particular platform. However, we are unable to examine the effects that splice variants may have on survival. While the analysis of splice variants is possible with some of the platforms used in this analysis, it is limited as most of these platforms predate the publication of the complete human genome. In summary, BreastMark allows the analysis of approximately 20,000 unique Entrez gene IDs in up to 4,739 samples. While some compromises were made in making the data gene centred, which negated the continuous nature of the gene expression information, our comparison with MammaPrint and oncotype DX shows our approach to be robust.

In the case of oncotype DX, our results suggest that some of the 16 oncogenes in the signature may not be necessary. It would appear that the five proliferation markers are sufficient for determining patient outcome, as these are the only genes with high HRs and are highly significant. This is consistent with previous findings [4043]. In fact, combining the proliferation markers within BreastMark allows us to identify patients who will do even more poorly. However, when we combine the proliferation markers with most of the other 11 non-proliferation genes, the HR decreases and the Kaplan-Meier plots become less significant. This suggests that not only are all of the genes from this prognostic signature not required, but that our algorithm provides a useful reductionist approach to these complex prognostic signatures. This facilitates the elimination of superfluous markers and highlights those genes that are of the greatest relevance. Although MammaPrint uses a different approach to identify patients who will have a poor outcome, the use of our approach could substantially reduce the number of genes required in this prognostic signature, thus reducing the cost and the complexity of this signature.

After confirming the robustness of our algorithm we used it to examine the potential for inferring the prognostic ability of miRNAs from the gene expression data and to look at RTKs in the basal sub-type of breast cancer. The attraction of miRNA biology to cancer researchers arises from the potential of miRNAs to alter an entire pathway or, indeed, pathways. miRNAs have been heavily studied in breast cancer; however, their role as prognostic markers is not well characterised. There are only a few large-scale studies which incorporate miRNA profiling and detailed clinical data [10, 44]. Despite the huge efforts required to compile these studies, their sample numbers are only in the hundreds and, therefore, not only do they have limited statistical power, they are also restricted in their ability to assess the rarer breast cancer subtypes. However, there is a wealth of gene expression data available with detailed clinical information which can be exploited by inferring miRNA activity from host gene expression.

Again, our approach gene centres the data and allows us to examine miRNAs as prognostic markers in breast cancer as a whole and within the molecular subtypes. We were able to confirm the results of smaller studies [10, 45], which demonstrated that reduced expression of miR-205 (n = 20) and increased expression of miR-93 (n = 93) are associated with poor prognosis in breast cancer. As both of these studies were relatively small, their findings in isolation would be considered preliminary evidence. It should be noted, however, that not all miRNAs and host genes are co-expressed [14] and care needs to be taken when interpreting the results from BreastMark. This issue cannot be resolved until such time as there is a clearer picture of which miRNAs are co-expressed with their host genes (current estimates put it at approximately 70% [16]) and if those that are not significantly co-expressed do so in a disease/tissue specific manner or whether the miRNAs themselves are subject to some level of post-transcriptional regulation.

Tyrosine kinases are a large family of proteins involved in cell signalling with respect to growth, differentiation, adhesion, motility and death [35]. Of the 90 tyrosine kinases identified, 58 have been classified as receptor type. These 58 receptors can be further sub-divided into 20 families [46]. A number of families of RTKs have been implicated in the development of many cancers, including HER and IGFR families and so on through over-expression, amplification and/or aberrant signalling of the RTKs [47]. Using BreastMark, we were able to identify six RTKs that can be associated with poor prognosis in the basal subtype of breast cancer. These RTKs are putative markers of poor prognosis and are potential drug targets in this difficult-to-treat subtype of breast cancer. For example, increased expression of PDGFRβ has been associated with enhanced cell migration and invasion in breast cancer [31]; BreastMark identifies PDGFRβ as a marker of poor prognosis and this RTK has been shown to be inhibited by imatinib in phase I clinical trials [48]. In addition, imatinib has been investigated in advanced breast cancers expressing PDGFRβ [49]. Also, BreastMark identifies FGFR1 as a marker in the basal subtype of breast cancer, which has been previously shown as a marker of poor prognosis in the luminal subtypes [50].

Conclusions

In this study, we have developed a simple user-friendly tool for examining putative gene/miRNA prognostic markers in breast cancer. The value of this tool is both in the simplicity of its design and the robustness of its approach. It is designed with non-bioinformatic research groups in mind and will be of great value in the preliminary assessment of putative biomarkers in breast cancer as a whole and within its molecular subtypes.