Introduction

Breast tumours are routinely subclassified according to microscopic morphology, immunohistochemical staining, and stage. On the basis of this clinical information and patient age, an estimate of prognosis may be derived [1, 2]. Most clinicians make recommendations regarding the need for adjuvant chemotherapy on the basis of this estimate.

However, breast cancer is a heterogeneous disease, and differences in prognosis and response in distinct molecular subgroups need to be taken into account. Improvement in the accuracy of prediction of prognosis without systemic treatment or with endocrine treatment alone would allow avoidance of non-beneficial chemotherapy in a significant proportion of women [3]. Additionally, experience with neoadjuvant chemotherapy has demonstrated resistance in a significant proportion of primary breast tumours [4]. These patients derive no downstaging benefits from neoadjuvant chemotherapy. Furthermore, chemosensitivity in the neoadjuvant setting is associated with superior long-term survival (relative to chemoresistance) [5] and therefore may represent a marker of survival benefit from chemotherapy. Identification of a chemoresistant profile would allow further tailoring of treatment by enabling selection of tumours unlikely to respond and therefore unlikely to derive a survival benefit.

Several studies have demonstrated that gene expression microarray profiling may be useful in improving prediction of prognosis [69] and treatment response [1016]. These studies employed non-dissected surgical [79], core-cut biopsy [12], or FNA (fine needle aspiration) samples [10]. However, breast tumours are non-homogenous in nature. They include inflammatory and vascular elements but most significantly (by proportion) connective tissue components [17]. The proportions of these components vary according to tumour type and sample type and also across a single tumour [17]. In studies involving surgical samples, those used for profiling can be selected as those with the highest proportional malignant cell content. In studies involving biopsies, this is not possible and the researcher is required to set an arbitrary minimum percentage tumour limit.

The impact on expression profile of variation in the proportion of tumour cells and the nature of the non-tumour components have been largely unexplored. In this paper, we examine the effect of percentage tumour content on expression profile within a study designed to derive an expression profile predictive of response to adriamycin and cyclophosphamide (AC) neoadjuvant chemotherapy. We also consider methods for improvement of molecular profile-based prediction of response to primary chemotherapy by classification of samples according to cellular makeup or by the incorporation of sample tumour content information into the predictor.

Materials and methods

Patients and samples

Patients were recruited and treated at the Royal Marsden Hospital (RMH), London, UK. Eligible patients were those undergoing neoadjuvant AC chemotherapy treatment at doses of 60 and 600 mg/m2, respectively, three times a week, for a clinically measurable breast tumour. The study was approved by the RMH Clinical Research and Ethics Committees (study number. 1,947), and written consent was obtained in all cases. Patients had been allocated neoadjuvant treatment for one of several standard indications, including locally advanced or inflammatory breast cancer, high tumour-to-breast size ratio, and tumours located close to the nipple.

Diagnosis was confirmed histologically by core-cut biopsy. All patients on hormone replacement therapy at diagnosis were advised to discontinue this treatment. Patients who demonstrated at least a partial clinical response received six cycles of treatment prior to local treatment. Patients in whom there was no or only marginal response after three or four cycles proceeded directly to local treatment or were commenced on alternative systemic treatment (docetaxel).

Clinical size of tumour (largest diameter and a diameter perpendicular to this) was recorded prior to commencement and at completion of treatment. Clinical response was categorised as follows: no palpable abnormality after treatment, complete clinical response (cCR); more than 50% reduction in the product of the bidimensional measurements, partial response (PR); less than 50% reduction in the product of bidimensional measurements was recorded as no change (NC); and residual ill-defined thickening after a good response, minimal residual disease (MRD). Those cases in which there was no residual invasive carcinoma at surgery were classified as a complete pathological response (pCR). Good responders were defined as pCR, cCR, or MRD; poor responders were defined as PR or NC. These categories were chosen on the basis of our previous study, which showed that patients with 'good' response had superior overall survival relative to those with 'poor' response [18]. A proportion of patients undergoing a complete clinical and radiological (on ultrasound) response received radiation only as local treatment. Therefore, some of the cCRs may represent undocumented pCRs.

Research 14-gauge core biopsies were collected prior to commencing treatment and snap-frozen in liquid nitrogen. When consented to, a repeat sample was taken at 21 days after the first cycle of chemotherapy. All samples were thereafter coded using a study number as an identifier. Frozen cores were embedded in OCT (optimum cutting temperature embedding compound) and sectioned at -20°C in a cryostat. Sections (5 μm in thickness) were taken for haematoxylin and eosin staining to assess histological character superficially from the core as soon as 'full-face' was reached. The percentage of cells comprising invasive malignant disease and non-malignant components (that is, in situ disease, inflammatory infiltrate, non-malignant ductal/lobular structures, and fibroblastic involvement) were recorded by consensus between two breast pathologists. For patients in whom multiple biopsies were available, the biopsy with the highest invasive content was used for microarray analysis. Biopsies with less than 20% invasive cancer content were excluded from the study.

RNA extraction and amplification

Cores were extracted from OCT as described by Ellis and colleages [19] and pulverised using a pestle and mortar on a bed of dry ice and subsequently in 1 ml of Trizol reagent (Invitrogen, Carlsbad, CA, USA) with a 'Polytron' homogenizer. Standard 'Trizol' RNA extraction was carried out without use of a carrier according to the manufacturers' instructions. Samples not giving distinct 18S and 28S peaks on an Agilent Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA) trace were excluded from the study. Multiple cores from the same patient were handled separately for RNA extraction. A single round of T7 linear RNA amplification was carried out using the RiboAmp kit (Arcturus, Sunnyvale, CA, USA) with a starting amount of 1 μg when available or 50% of the available RNA.

Reference RNA was generated from a pool of RNAs extracted from 20 independent breast cancer surgical samples.

cDNA array hybridisation

Microarray analyses used in-house (Breakthrough Breast Cancer Research Centre, London, UK) human arrays spotted with DNA derived from 4,600 IMAGE cDNA clones in duplicate. This set of genes represents a subset of the 5,808 Cancer Research UK (London, UK) gene set that was designed to include a high proportion of genes documented as being involved in carcinogenesis or tumour biology. To improve 'coverage' of genes involved in breast cancer, a list of discriminatory genes cited in important microarray studies on clinical breast cancer samples at time of production was compiled [7, 2022], and the array was supplemented with these. One half to two micrograms of sample RNA and a matched amount of reference RNA was labeled using the Powerscript Labeling kit (Clontech, Mountain View, CA, USA) in combination with Amersham Cy dyes (GE Healthcare, Little Chalfont, Buckinghamshire, UK). A single dye swap experiment was performed for each clinical sample. Slides were scanned using a GenePix 4000B (Axon Instruments Inc., Union City, CA, USA) scanner and GenePix version 4.0 software.

Data analysis

Most of the data analyses were carried out with the S-plus statistical software package (Insightful, Seattle, USA) and purpose-written scripts (T Dexter, Breakthrough, UK). Raw expression values were transformed to Log2 ratios (sample/reference). The loess function [23] was used to remove biases due to the spot position and spot intensity. Flagged spots were treated as missing values. Log ratio values from duplicate spots and hybridisations were averaged. Genes with consistently low intensity and those that exhibited little variation across samples were removed from the analysis. After the above pre-processing, 1,286 genes remained for the prediction analysis. Samples were clustered both by complete linkage and flexible beta (beta = -0.5) agglomerative algorithms with (1 - correlation) as a distance measure [24, 25]. The correlations were estimated using Spearman's rank method.

The nearest neighbour class prediction algorithm (euclidean distance) was used for all classifications because of its simplicity and good performance on microarray data as reported by Dudoit and Fridlyand [26]. We elected to use seven nearest neighbours throughout to give more stable error estimates and greater robustness than would result with smaller neighbourhoods. The weighted Kolmogorov-Smirnov statistic was used to rank genes for discriminatory information [27]. Each predictor was built starting with the highest ranked two genes from the training set, and then genes were added to the predictor in decreasing rank order until the error rates no longer decreased.

Owing to the limited number of samples, the misclassification error was estimated by leave-one-out cross-validation (LOOCV). In this approach, the class (that is, response) of each sample was predicted in turn, using the other samples as the training set. To avoid selection bias, the genes that were used as predictors were re-selected for each of these leave-one-out classifications [28]. To estimate the probability of the misclassification error arising by chance, a permutation p value was determined as suggested by Radmacher and colleagues [28]. In this procedure, the LOOCV estimate was determined 1,000 times for permuted class labels; the fraction of these that gave an equal or lower error estimate than that with true labels is taken as the p value. We refer to the latter as a 'label permutation'.

To assess the significance of lower misclassification error estimates for selected subsets of the samples, we needed to control for the smaller sample sizes. To this end, we defined a 'subset permutation' p value as the fraction of random subsets (1,000 matched for size and class proportions) of the full set of samples that gave rise to LOOCV error estimates equal to or lower than the selected subset. The correct class labels were used for subset permutation. The minimum error for each permutation across the top 2 to 10 ranked genes was used to calculate the p values for both of the above types of permutation to avoid bias. Although this is an arbitrary range, the minimum error rates in all the permutations had increased by a cutoff of 10. Significance analysis of microarrays (SAM) analyses were performed using standard software [29]. Pre-treatment samples only were used for prediction analyses.

Results

Patient and tumour characteristics

RNA of adequate amount and quality was available from 43 tumours before treatment. Of these, 23 (53%) demonstrated a 'good' and 20 (47%) a 'poor' clinical response. For 16 of these tumours, a paired 21-day sample was also available. These 'on treatment' biopsies are included in the analysis of reproducibility shown below to increase the number of paired samples but are excluded from the analysis of response prediction. The good responses comprised 16 (37%) that underwent a cCR, three (7%) that exhibited ill-defined thickening (MRD) at the end of treatment, and four (9%) that underwent a pCR (all cCR or MRD). The patient and tumour characteristics are shown in Table 1. The only feature differing between good and poor responders was tumour size, with pre-treatment size greater in poor compared with good responders (Mann-Whitney, p = 0.03). Pre-treatment size did not relate to expression profile (data not shown).

Table 1 Patient, tumour, and pre-treatment biopsy characteristics (43 samples used in prediction analysis)

Biopsy characteristics

A total of 147 cores were sectioned in the course of this study, including 104 that were sectioned at three levels (levels approximately 50 μm apart). Of these 104 cores, only 16 cores showed more than 10% absolute variation in invasive tumour content across all three sections and only one core showed more than 20% variation. Only four cores were found to have tumour at some levels and none at others; all four contained not more than 15% invasive tumour at the lowest level and were therefore excluded from the study. This suggests that the histological composition did not vary widely over the width of the core. The histological result from the lowest section for a given biopsy was taken as that most representative of the remaining biopsy used for profiling and hence the level on which percentage invasive tumour was assessed. The distribution of percentage invasive content of core (in cases of multiple cores, that core used for prediction analysis) did not relate to response as assessed by Wilcoxon rank sum test (p = 0.3) or t test (p = 0.27).

Percentage invasive content in core biopsies included in the prediction study varied from 20% to 95% (median 50%). For most core biopsies, the majority of the non-malignant tissue consisted of connective tissue. Inflammatory infiltrate was also scored on the lowest section as nil, mild, moderate, or severe. Of all 147 core biopsies, only six (4%) were scored as having a 'severe' inflammatory infiltrate at one or more levels.

Basic validation of array data

To assess consistency of the expression profile between repeated biopsies, the 43 pre-treatment and 16 paired post-treatment samples were clustered (Additional file 1). The post-treatment samples were included in this validation study only to enhance numbers and were not used in subsequent predictive analyses. All but one pair (b223A and B) of duplicate biopsies taken at the same time point relative to treatment (in total, five pre-treatment and five post-treatment pairs) clustered as nearest neighbours despite some variation in percentage tumour between the pairs (Additional file 3). Of 16 pre/post-treatment pairs, 14 clustered with samples from the same tumour (Additional file 1). To validate our class prediction methodology, a supervised analysis was undertaken to obtain a gene list for prediction of oestrogen receptor (ER) status in pre-treatment samples (ER-α was excluded from the analysis). ER status was correctly assigned in 40 of 42 cases on LOOCV. Discriminatory genes are listed in the supplementary information (Additional file 4).

Effect of percentage tumour on expression profile

Before performing prediction analysis, an exploratory analysis was undertaken to assess the impact of variation in the histological content of biopsies on the expression profile. A SAM regression analysis was performed to establish whether it was possible to identify genes that correlated with the proportion of tumour cells in the 43 pre-treatment core biopsies. One hundred forty-four genes were significant at a false discovery rate of not more than 5%. Table 2 shows the most correlated genes from this analysis. Positive values for 'score' (bold) indicate positive correlation with high percentage tumours, and negative values (underlined) indicate negative correlation with proportion of tumour cells in the samples. More genes correlated positively with the stromal content (n = 128) than with the tumour content (n = 16), possibly reflecting the greater molecular heterogeneity of tumour types across the samples than that of their associated stromas. This is also reflected in the higher false discovery rates for 'tumour-associated genes'.

Table 2 SAM (significance analysis of microarrays) analysis of genes correlated to percentage tumour content.

Response prediction using pre-treatment biopsies

Prediction analysis was initially undertaken on the full set of 43 biopsies. The optimum misclassification error estimate (LOOCV) for the whole sample set was 28% using a three-gene predictor (Figure 1). To explore the effect of biopsy tumour content (and its associated influence on the expression profile) on error of response prediction, we selected three overlapping subsets of tumour samples which were more homogeneous in terms of percentage tumour content: (a) ≥ 50% (50%–95%) invasive tumour (25 samples), (b) ≤ 50% (20%–50%) invasive tumour (24 samples), and (c) 35%–60% invasive tumour (24 samples).

Figure 1
figure 1

Leave-one-out cross-validation (LOOCV) error rates for response prediction. Variation in the LOOCV error estimate with the number of genes used in the predictor. Misclassification rates are plotted for the whole data set (n = 43) and subsets according to percentage tumour content.

The subset consisting of 35%–60% represented a group centered around the median sample in terms of percentage tumour content and of a size similar to or the same as the other two groups.

The minimum error of classification for each of the subsets (8%–13%) was lower than that for the superset of all samples (28%) (Figure 1), suggesting that homogeneity of tumour content rather than tumour content per se might be an important factor for response prediction. However, the comparison of the subset error estimates with that for all samples is not controlled for the different sample sizes involved. To address this, we determined a 'subset permutation' p value for each of the subsets (see Materials and methods). We also estimated the probability that the subset errors arose by chance (a 'label permutation' as discussed in Materials and methods). Error estimates and corresponding p values for the 'subset permutation' and the 'label permutation' support the hypothesis that homogeneity of biopsy tumour content improves response prediction with the nearest neighbour algorithm (Table 3). The identity of the genes used in the prediction for each of these subgroups is presented in Table 4. The mean differential expression between good and poor responders for these genes is given in the supplementary information (Additional file 5).

Table 3 Misclassification error estimates (leave-one-out cross-validation) for response prediction.
Table 4 Response prediction gene lists.

We found, in common with other studies [20, 22, 3032], that ER-positive and ER-negative tumours were very distinct in molecular terms (Additional file 4), which may confound response prediction. Therefore, response prediction was performed on ER-positive samples alone (16 good, 13 poor responders). Error rates, however, were high (31% with 8 genes, 31% with 25 genes). Within this subset, the effect of heterogeneity of tumour content on prediction appeared to be more pronounced (error rates: 31% (all samples, n = 29) versus 4.5% (≤ 60% tumour, n = 22)).

Incorporation of histological information into the predictor

To explore further the effect of biopsy composition on prediction error, we attempted to create a single predictor for all samples by adding information about tumour content to the predictor as an extra dimension or 'histology gene'. The impact of variation in biopsy tumour content on prediction error was thought to be due partly to the fact that genes that discriminate between good and poor responders in high-tumour-content biopsies are often poor discriminators in low-tumour-content biopsies (and vice versa) as is evident from Table 4. To overcome this complication, 10 genes were selected that showed 100% support in the subset permutations (Table 4). By adding the proportion of tumour content to these 10 genes as an 11th dimension, we aimed to move the samples apart in 'prediction space' such that differences in tumour content also contributed to the distances between samples. Thus, samples with similar expression profiles become nearest neighbours of similar proportion of tumour content as well. In the nearest neighbour predictor, like is matched with like in terms of proportion of tumour content as well as expression profile.

It was necessary to rescale the histology gene's contribution to the distance between samples because the standard deviation of most genes lay between 0.5 and 1.5 (log2 ratio scale). Therefore, the percentage tumour figures were standardised by subtracting the average and dividing by the standard deviation. The 'histology gene' or 11th dimension was then rescaled from 0 to 4 standard deviations, and the effect on cross-validation error determined (Figure 2). After an initial drop in error, as the scale approached 1.0, the error rose beyond the initial error as the influence of the histology gene became too dominant. To test whether the effect of the histology gene was specific to the particular set of 10 genes used above, we tested the effect of adding the histology gene to 1,000 predictors, each containing a random permutation of between 8 and 16 genes drawn from the top 20 genes ranked by a combined score (Additional file 6 a-c). In 91% of these, the addition of the histology gene resulted in a lower error; in 49%, the error drop was greater than 10%.

Figure 2
figure 2

Variation in error rate according to scale of 'histology gene'. An 11th ('histology') gene was added to a 10-gene predictor as an extra dimension. This gene was rescaled from 0 to 4 standard deviations, and the effect on cross-validation error calculated.

Discussion

In this study, we set out to explore the possibility that the transcriptional profile(s) of breast tumours relates to sensitivity to neoadjuvant chemotherapy. Such a profile might be useful in understanding the molecular mechanisms determining response or resistance and would provide the basis for a predictor of chemotherapy response.

However, the biopsy material used in this study had a complex cellular composition. In planning this study, consideration had been given to two approaches for ensuring that the percentage invasive tumour within biopsies was 'sufficient' and homogeneous. Cell selection, either by gross dissection or laser capture microdissection [33], allows enrichment of the invasive tumour component. We chose, in common with others [6, 7, 11, 13, 14, 31], to set a minimum threshold for 'percentage malignant cells' within a given biopsy. The median percentage invasive tumour for the core biopsy samples was 50%, and the range of figures that were included in the study was 20% to 95%. In many biopsies, the dominant non-tumour component, connective tissue, was admixed with epithelial components, making enrichment of the malignant compartment difficult using gross dissection alone.

We hypothesised that this 'contamination' of biopsies by significant and variable amounts of non-tumour components might confound tumour classification. Indeed, at least 10% of the genes (144 genes) remaining after preprocessing of the data were found to correlate with cellular composition using a SAM regression analysis. Genes relatively overexpressed in low-percentage tumours included established stromal-related genes (Table 2) (for example, Collagen (type XV, alpha-1) and Cadherin 5 (type 2, VE-cadherin)). The exact source (histological compartment) of production of a given RNA could be further confirmed by FISH (fluorescent in situ hybridisation) or by comparing gene expression in microdissected stromal and tumour compartments.

However, paired biopsies, despite differences in proportional non-tumour content (Additional file 3), co-aggregated on cluster analysis (Additional file 1), suggesting a dominant 'tumour profile' despite variation in the proportion of stroma. Furthermore, prediction of ER status was not confounded by marked variation in percentage tumour content. This is in keeping with a number of studies that have shown that strong differential expression of a relatively high proportion of genes correlates with ER status [6, 7, 20, 3032] which would be expected to result in domination of the expression profile despite variation in the contribution by the non-tumour component. Furthermore, it has not been demonstrated that the ER expression signature is derived entirely from tumour cells. However, variation in the proportion of stroma may be sufficient to mask more subtle aspects of the tumour expression signature.

We found that the error rate for response prediction for the whole sample set was poor (28%) but was improved by increasing the homogeneity of cellular composition by subsetting on the basis of histological composition (error rate, 8%–13%) (Table 3). The misclassification rates for the subsets were determined with LOOCV and therefore represent high variance estimates, and the sample numbers are modest. We have used permutation analysis as support for the error estimates; ultimately, however, validation with an independent data set would address these issues.

We did not find any evidence to suggest that highly stromal biopsies result in higher prediction error. However, stromal-tumour content appears to affect the selection of genes that are used in the predictor for each histological subset, resulting in different but overlapping lists of predictive genes (Table 4). Some genes discriminated response in the 'high percentage' and not the 'low percentage' samples (for example, PBEF1). This may simply be a dose effect whereby discriminatory tumour-associated genes are no longer differential in 'low percentage samples' due to low signal. Alternatively, discriminatory genes that are expressed in both tumour and non-tumour compartments may lose discriminatory potential in tumours with a significant stromal contribution to the molecular signature. Genes that are discriminatory in low but not high 'percentage samples' (for example, SOD1) could be expressed only at the tumour-stroma interface in stromal and/or tumour cells. Certainly, breast tumour-induced changes in stromal expression have been previously documented [34]. Furthermore, it has been shown that tumour gene upregulation can occur specifically at the tumour-stroma interface [35, 36]. Finally, it is also likely that the volume and configuration of stromal tissue within a tumour are a reflection of the tumour molecular subtype.

Thus, this analysis resulted in three distinct response-prediction genes lists that partially overlapped. PBEF, reported in one study to act as an inhibitor of apoptosis in neutrophils [37], appeared in the predictor for response for 'high percentage' tumours and has also been reported as a component of a three-gene predictor of AC sensitivity by another group [13]. In both studies, expression of this gene was higher in the resistant tumours than in sensitive tumours. SOD1 was found to be relatively overexpressed in resistant tumours ('low' and 'mid percentage') in keeping with a proposed role in the neutralisation of free radicals, one means by which anthracyclines are thought to inflict cellular damage [38]. NDRG1, upregulated in poor responders ('high percentage'), is induced by hypoxia and may reduce p53 expression [39]. Hypoxia may be a marker of poor vascularisation of tumours and therefore possible limitation of drug access.

Although here we have defined histological subsets, we have shown that it might be feasible to build a predictor that extracts information about the cellular composition of the biopsy from expression data. We found that adding standardised percentage tumour values as a 'histology gene' (Figure 2) to multigene predictors (Table 4) reduced the error rate significantly, supporting the idea that it may be possible to devise a predictor that operates regardless of biopsy composition. This would avoid the need for microdissection to enrich for malignant cells. Furthermore, if stromal or stromal-interface gene expression does carry discriminatory information, then microdissection would result in loss of predictive information.

Several groups have attempted to define a multigene predictor of chemoresponsiveness [1015], using either clinical or pathological definitions of response. The reported error rates in these studies, as assessed on an independent data set or by LOOCV, range from 5% to 30%. Comparison of gene lists and error rates across these studies and with ours is hampered by the fact that the treatment regimens, response definitions, and microarray platforms differ and that the histological composition of the samples in most studies was not presented.

A number of samples used in our study have also been profiled using an Affymetrix platform (Santa Clara, CA, USA) [40]. In this independent analysis, 12 samples from tumours displaying either cCR or pCR and six samples from tumours with residual tumour greater than 70% were used to define a classifier (samples with less than 40% tumour were excluded from the study). The error rate of prediction on LOOCV was 33% (p = 0.4 on permutation).

Therefore, to date, the reported error rates associated with expression array prediction of response, certainly for anthracycline combination chemotherapy, remain too high for clinical utility. This may be due in part to the fact that breast cancer is an extremely complex and heterogeneous disease that operates multiple mechanisms of chemotherapeutic response and resistance which are not consistent across different subtypes, particularly in the case of combination regimens incorporating agents that act by multiple mechanisms. Such complexities mandate that study designs be optimised in terms of biopsy quality and sample size.

Conclusion

The response prediction using all pre-treatment biopsies was modestly effective. However, the percentage of invasive cancer cells within a sample influenced the expression profile. Response prediction on subsets of samples more homogeneous in terms of cellular composition was associated with lower error rates. We believe that it is essential that consideration be given to biopsy composition in planning future studies of this type using methods such as those discussed above. Larger studies are required to establish whether optimal accuracy of response prediction may be achieved by the development of profiles specific to immunohistochemical breast cancer subtypes.