Background

The phenotype of a cell is determined by its transcriptional repertoire, a result of combinations of transcriptional programs partly set during lineage determination and partly activated in response to intrinsic and extrinsic stimuli. Microarray hybridization experiments permit a quantitative analysis of this transcriptional repertoire in response to defined experimental conditions. A particularly interesting case of study is given by the transcriptional repertoire of human tumors. Here, the objective is usually the search for cancer subtypes for individualized prognosis and/or therapy. The questions most frequently asked are whether samples can be automatically grouped, in the absence of additional information, into biologically relevant phenotypes; and whether transcriptional programs can be unveiled that can explain such phenotypes. It must be noted that this situation (sample clustering and relevant gene extraction) is difficult mainly due to three reasons [1]: the sparsity of the data (samples), the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant (low signal-to-noise ratio). It has been pointed out that, due to the low signal-to-noise ratio, the quality and reliability of clustering may degrade when using standard hierarchical clustering algorithms or similar approximations [2]. Similarly, model-based clustering methods encounter problems due to the sparsity of the set and its high dimensionality, leading to overfitting during the density estimation process [3]. Additional difficulties are encountered during the selection of features (genes) relevant to the sample cluster structure, since most clustering methods produce non-overlapping gene clusters. This behaviour may distort the extraction of biologically relevant genes in cases where expression patterns overlap several classes of samples or experimental conditions, a reflection of the dependence of the expression of most genes on multiple signals and their participation in more than one regulatory network.

Three main strategies have been taken in sample-based clustering: unsupervised gene selection, interrelated clustering and biclustering [1]. The first views gene selection and sample clustering as basically independent processes, the second dynamically uses the relationship between gene and sample spaces to iteratively apply a clustering and selection engine, while the third tries to cluster both genes and samples at the same time in a reduced space. For the first one, principal components analysis (PCA)[4] has been proposed. PCA, a well known dimensionality reduction technique, has been criticized because the sample projection in the low-dimensional space is not guaranteed to yield optimal sample partitions, particularly when the fraction of relevant genes specific to each cluster is small. As for the second approach, several novel methods have been proposed recently based on various greedy filtering techniques (for a review see [1]), but it has been suggested that they may group the data based on local decisions [1]. Finally, different biclustering methods have also been applied to this situation [58], but a difficulty with most biclustering tools is that they generate non-overlapping partitions.

Here we apply Factor Analysis (FA) [9], a multivariate tool related to PCA, coupled to clustering algorithms in sample space, t-test scores in gene space and data mining procedures. Q-mode (i.e. in sample space) FA is a latent variable modelling tool [9] that assumes that the observed gene expression levels are the result of a linear combination of an unknown number of independent underlying global transcriptional programs, called latent variables or factors (Figure 1). The contribution of each factor to the expression levels of the genes in each sample is given by the elements of the loadings matrix (arrows in Figure 1). Each sample contains, in addition, a given amount of expression that cannot be modelled by the latent variables, for example due to the presence of noise. FA models the covariance of a data matrix, as opposed to PCA, which attempts to summarize the total variance. Covariance in the mRNA expression levels has been shown to occur in proteins involved in related pathways and functions, as well as in proteins co-locating to the same organuli in the cell, and may be indicative of common regulatory mechanisms at the expression level[10]. By contrast, the specific variance in the expression of a given gene, not associated with the rest of the genes in the sample, is most likely related to artefacts in the chip or in data handling. We couple FA dimensionality reduction to clustering algorithms [11] to obtain clusters in sample space. For gene extraction, a multiple-testing corrected t-test (the so-called q-value) is employed. Finally, the genes assigned to each cluster are used for the detection of pathways predominantly activated in that cluster by finding statistically significant the GO [12] or GenMAPP terms associated to each cluster.

Figure 1
figure 1

Graphical overview of Q-mode Factor Analysis (FA) [9]. Each sample is described by a vector x i , containing the expression levels for all genes in the chip. The complete expression for all samples is contained in the matrix X = {x i }. The expression levels of each sample are assumed to be generated by a linear combination of a small number of underlying transcriptional programs, the latent (non-observable) variables, contained in the set of vectors {F i }, forming matrix F. The relative contribution of each program is given by the thickness of the arrows connecting factors and samples, stored in variables lij, altogether forming the loading matrix L. Each lij element can be understood as the correlation coefficient between the expression levels of the sample and the corresponding latent variable. Residuals are kept in vectors {ε i }, giving rise to matrix E. Note that small loadings connecting a given sample (i.e., X 4 with the factor model implies large residuals.

We first tested the approach by using a published dataset of microarray experiments in yeast [13], and then applied it to the analysis of human prostate cancer samples [14]. The yeast dataset is particularly relevant because the biochemistry of S. cereviseae is relatively well understood in comparison with other eukaryots, and the data set has been previously analyzed with other clustering techniques. From the application to the prostate cancer dataset, a number of significant gene outcomes highlighted by the algorithm have been corroborated experimentally a posteriori by expression analysis on an independent set of samples. The biological interpretation of the results lead us to propose that two major pathways are predominantly activated in organ-confined, non-metastatic prostate cancer: those regulated by androgen receptor and by receptor tyrosine kinases. We close this paper by discussing the implications of these findings.

Results and Discussion

Testing FADA with the yeast expression dataset

Our procedure is coded in a software package, FADA. We first tested FADA by analyzing the dataset of Gasch et al [13], who studied the transcriptional responses of S. cerevisiae to a variety of stress stimuli. The main results are in Table 1, and the genes most significantly associated to the clusters are in Table 1 of the Supporting Information. We discuss in what follows only the most salient features of the analysis for this set, corresponding to clusters 1, 2, 6, 8 and 16, from a biological viewpoint.

Table 1 Results of the analysis of the yeast dataset [13]. The different clusters found by FADA are shown, together with the significant GO terms associated to them. The samples belonging to each one of the clusters are also shown. The first column shows the cluster number; the second shows the conditions associated to that cluster; columns 3 to 5 show the Z-score of the GO terms associated to the cluster (see Methods) at the Cellular Component (CC), Biological Process (BP) and Molecular Function (MF) levels; columns 6 to 8 show the corresponding GO terms.

Cluster 1 encompasses responses to heat shock, DTT (late), sorbitol (early response), stationary culture (late), and overexpression of Msn2p and Msn4p. The significant GO terms [12] automatically detected by FADA indicate that this grouping is related to a common environmental stress response (Table 1) (ESR or CER in the case of S. cerevisae, CESR in the case of S. pombe [13, 15, 16]). Inspection of the top selected genes (Table 1, supplementary material) confirms this assignment, as most of them are known to be transcriptionally regulated through the stress response element (STRE), recognized by Msn2p and Msn4p [17, 18]. Thus, Cluster 1 corresponds largely to a "core" ESR, induced by a variety of stimuli, including "early" time points of osmotic stress and "late" time points of DTT treatment and stationary culture. A relatively late induction of ESR by DTT has been noted previously, with suggestions that ESR could be a secondary response to the exposure of this reducing agent [13]. Conversely, hyperosmotic shock is known to induce a rapid and strong expression of ESR [13, 15].

Clusters 2 and 16 correspond to responses to three oxidizing agents: hydrogen peroxide, which generates peroxides and hydroxyl radicals; menadione, a generator of superoxide; and diamide, a thiol reducing agent. FADA groups together responses to H2O2 and diamide (Cluster 2), while defining a distinct group for responses to menadione (Cluster 16). It is well known that several organisms use distinct sensing and response systems to discriminate among different degrees of oxidative injury. In S. pombe, reponses to low concentrations of H2O2 and to diamide depend on the b-Zip transcription factor pap1, while reponses to higher concentrations of H2O2 utilize a different transcription factor, atf1 [19]. The S. cerevisiae homologue of S. pombe pap1 is Yap1p, a transcription factor regulated by oxidation, formation of intramolecular disulfide bonds at its carboxy terminus and nuclear translocation upon exposure to H2O2 and diamide [2022]. On the other hand, of the b-ZIP proteins in S. cerevisiae, Yap3p is most similar to atf1 in its carboxy terminus, suggesting that both atf1 and Yap3p could be subject to a similar redox regulation. Interestingly, YAP1 is upregulated in Cluster 2 (H2O2 and diamide), but not in Cluster 8 (menadione), while Yap3 is upregulated in the latter Cluster (Table 1 of Supplementary Data). Moreover, several of the genes most relevant to Cluster 2 are known to respond to mild oxidative stress, and are controlled by Yap1p [23]. The statistically significant GO-terms selected are related to "oxidoreductase" and "peptidase" activities. This includes genes regulating the thioredoxin and glutathione biosynthesis, genes for heat shock proteins, and a large number of genes involved in proteasome function and ubiquitin-dependent protein degradation (Table 1 of Supplementary Material).

Cluster 6 includes cultures at late times of nitrogen starvation. Many of the relevant genes in this group code for enzymes for the utilization and enhanced transport of poor nitrogen sources, such as allantoin or urea (Table 1 of Supplementary Material). Other upregulated genes include those required for different stages of meiosis (chromosome pairing, recombination and segregation; anaphase; or nucleokinesis), sporulation, autophagy, or genes that regulate vesicle and peroxisome structure and dynamics. Among these genes are also transcriptional regulators with major roles in the control of several of these processes, such as UGA3, DAL81 (allantoin metabolism), or IME1, RIM101 and SPO1 (meiosis and sporulation). This is consistent with the development of a classical response to nitrogen starvation in the absence of fermentable carbon sources, which leads to meiosis and sporulation [2427]. FADA also suggests that this response to nitrogen starvation becomes most prominent at relatively late times, when it can be distinguished from the early, relatively non-specific response to nutrient deprivation [25, 26]. In fact, FADA finds "transcription" and "sporulation" as significant GO-terms (Table 1).

Cluster 8 aggregates samples from early stages of both early response to aminoacid and nitrogen starvation. FADA finds a significant overrepresentation of genes for amino acid biosynthetic pathways (Table 1), consistent with the fact that deprivation of nutrients, including nitrogen and carbon sources, is recognized by several sensing systems regulating rapamycin-sensitive TOR kinase [28]. This lipid-dependent kinase derepresses translation of the GATA transcription factor Gcn4p [29, 30], which controls expression of many genes, including enzymes involved in amino acid biosynthesis [31]. Thus, the selection of genes in Cluster 8 is consistent with known Gcn4p-dependent responses to nutrient and nitrogen starvation [31].

Altogether, these results indicate that the automatic analysis provided by FADA yields results consistent with the known biochemistry of yeast.

Application of FADA to the prostate cancer dataset

We next applied FADA to the dataset published by Welsh et al. [14], for the analysis of transcripts associated with prostate cancer. Samples were classified into two major branches: samples from cultured cells, and samples from tissues, which in turn could be further bifurcated into two well-supported branches, one corresponding to samples enriched for carcinomatous cells and one for non-neoplastic prostate cells (Figure 2). The first-level grouping into cultured vs. non-cultured samples most likely reflects the profound impact of culturing procedures on the transcriptional profiles of the different cell types. Within the cultured cells subgroup, samples were generally clustered according to cell type, with haematopoietic cell lines forming well-clustered groups and epithelial and fibroblastic prostate-derived cells clustering together with endothelial cells. A separate cluster was formed by the androgen-sensitive epithelial cell line LNCaP, the prostate cancer cell lines included in the study. The genes most significantly contributing to each sample cluster were analyzed for their participation in the pathways contained both in GenMAPP [32], and GO (Tables 2 and 3). Since pathway categorization is a difficult problem, as partition of the global interaction network in "parts" inevitably introduces artefacts, we also proceeded to a detailed, gene-by-gene inspection of the most discriminative genes based on inspection of literature data.

Figure 2
figure 2

Dendrogram for the Welsh dataset [14]. The dashed line indicates the thresholding used to define the clusters.

Table 2 Results of the analysis of the Welsh dataset for up-regulated genes. The different sample clusters found by FADA are shown, together with the significant GO and GenMAPP terms associated to them. The first column shows the cluster number; the second shows the samples associated to that cluster; columns 3 and 4 show the z-score of the GenMAPP and GO terms associated to the cluster (see Methods); columns 5 to 8 show the corresponding GenMAPP and GO terms selected.
Table 3 Results of the analysis of the Welsh dataset for down-regulated genes. The different sample clusters found by FADA are shown, together with the significant GO and GenMAPP terms associated to them. The first column shows the cluster number; the second shows the samples associated to that cluster; columns 3 and 4 show the Z-score of the GenMAPP and GO terms associated to the cluster (see Methods); columns 5 to 8 show the corresponding GenMAPP and GO terms selected.

Cluster 1 corresponds to the LNCaP cluster. It is placed in a branch distinctly separated from the rest of the cultured prostatic cells. LNCaP cells were originally derived from metastatic prostate cells, presumably of epithelial origin [33] and respond to androgens through its cognate receptor [34]. FADA found significant overrepresentation of upregulated genes coding for proteins that participate in electron transport and ATP generation, both when using GenMAPP and GO annotations (Figure 3, 4 and Table 2). Other sets of genes likely relevant to the LNCaP cluster, but not highlighted in the pathway mapping protocol, are those for proteins in steroid metabolism and signalling, such UDP glycosyltransferases B15 (Table 2 of the Supporting Information). Cluster 2 includes mesenchimal, epithelial, and endothelial cells. This cluster shows a bias for genes and pathways involved in ubiquitin and proteasome-dependent protein degradation, cell cycle regulation, inflammatory responses and cell-matrix interaction. Cluster 3 (hematopoietic cells) showed a significant bias in genes and pathways involved in cell cycle regulation and RNA processing. The selected genes included known markers of differentiation of B cell, T cell or myelomonocytic lineages. Examples are genes for immunoglobulin, histocompatibility antigens, haematopoietic-specific cytokines and their receptors, and regulatory proteins known to play significant roles in such lineages in processes such as signal transduction or cytoskeletal dynamics.

Figure 3
figure 3

Expression levels for the 20 most relevant genes selected in each cluster for the Welsh dataset. Gene descriptions can be found in Table 2 of the Supporting Information. A) (See Figure 3) Up-regulated; B) Down-regulated. (See Figure 4)

Regarding Cluster 4 (prostate tumor tissue), GenMAPP mapping finds significant overexpression of enzymes related to fatty acid metabolism (Table 2). Other genes and KEGG pathways with a significantly biased association with cluster 4 are those for ribosomal function and fatty acid synthesis (Table 2 of the Supporting Information). The upregulation of these two functions in prostate cancer has been noted previously [14, 35]. In addition, GO mapping finds overrepresentation of genes for proteins directly involved in steroid receptor recognition, including androgen receptor and estrogen receptor β. This is confirmed by a survey of the list of selected genes, where one can find a number of proteins involved in steroid signalling, including the coactivators GRIP1 and NRIP1, and genes that have been described as transcriptional targets of these pathways [36], such as the secreted proteases KLK2 and KLK3, and protein IQGAP, involved in cytoskeletal dynamics [37], or the enzymes fatty acid CoA-ligase or androgen-regulated short chain dehydrogenase (Table 2 of the Supporting Information). A second group of genes significantly contributing to this cluster are those for cell surface polypeptide growth factor receptors, associated signalling molecules and regulators, and known transcriptional targets for these pathways. These include the receptor tyrosine kinase partner ERBB3 (HER3), the calmodulin-dependent kinase activator CAMKK2, or the signalling modulators RAPGA1 and PDE3B (Table 2 of the Supporting Information).

Finally, the highest ranking genes for samples from normal prostate tissue (Cluster 5) correspond, according to GO, to proteins involved in the control of cytoskeletal architecture and dynamics in muscle cells (Table 2). GenMAPP finds a significant overrepresentation of muscle-associated functions. The implication is that, in these experiments, normal prostate tissue samples possibly are strongly enriched for muscle cells. This strong overrepresentation of genes corresponding to a smooth muscle phenotype suggests that the non-neoplastic tissues used correspond to areas of prostate hyperplasia or adenoma derived from the transition zone, in which smooth muscle cells are often major contributors [38]. In practical terms, this suggests that these experiments may be used with caution in the comparison of tumor epithelial cells with corresponding normal epithelial counterparts.

In recent years, several transcriptional profiling studies have been performed in prostate cancer, aimed at the identification of novel tumor markers [14, 3941] or prognostic signatures [4244]. So far, only one study has systematically searched for overrepresented biochemical pathways in a meta-analysis of four previously published prostate cancer transcriptional profiling studies [45]. This study used KEGG as reference pathway database, which is biased towards metabolic pathways [46]. Our study, however, focuses on GenMapp and GO terms, and therefore on the identification of signalling pathways.

Signalling pathways in prostate cancer and their experimental validation

In order to validate the pathways found to be overrepresented in prostate tumor samples, we used real-time RT-PCR. We chose for our analysis the genes for hepsin, KLK3 (PSA), ERBB3 (HER3), IQGAP2, and POR/ARFAPTIN2. Hepsin was found to be overexpressed in most tumor samples, and validated by immunohistochemical analysis [14]. This gene has been shown to be overexpressed in prostate cancer by several other groups. KLK3 (PSA) is the marker par excellence of prostate epithelial activity and cellular bulk, and detection of its serum protein levels is the best available marker for monitoring prostate cancer [47]. HER3 is a receptor for the paracrine growth factor neregulin-1, and a transmembrane protein that tethers the ligand to its dimerization partners, the receptor tyrosine kinases HER2 and HER4 [48], and known to play important roles in the development and progression of the malignant phenotype in breast cancer [49]. The abnormal expression and activity of HER2 has been studied extensively in the context of prostate cancer [50], being found overexpressed in advanced tumors, either metastatic or homone-independent, but infrequently in primary, organ-confined tumors. More controversial is the information available on the role of HER3, with reports of its overexpression in prostate cancer together with HER2, HER4, or both [51, 52], but also of its overexpression only in metastatic tumors, in particular of a truncated form corresponding to the extracellular domains of HER3 [53]. Furthermore, several transcriptional profiling analyses have found overexpression of this gene in prostate cancer. IQGAP2 is a calmodulin-binding protein that participates in cell signalling and modulation of cytoskeletal dynamics [37], and its activity has been reported to be positively [54] and negatively associated with neoplastic phenotype. POR1/ARFAPTIN2, a Rac1-interacting protein [55], is a regulator of cytoskeletal dynamics that so far has not been associated with any particular type of neoplasia.

The results of semiquantitative real-time RT-PCR on our samples indicate that hepsin is significantly overexpressed in 14 out of 14 cases, IQGAP2 in 8 of 14, and HER3 in 10 of 14 cases (Figure 5). Other genes analyzed, such as KLK3 (PSA), HER2 or the steroid receptors androgen receptor, estrogen receptor α or estrogen receptor β are less frequently overexpressed in these tumors. Levels of desmin transcripts were determined as an index of the contribution of stromal cells, suggesting that the overexpression of the analyzed genes are detected in tumor samples even in the presence of substantial stromal contamination (Figure 5). Of particular interest is the observed upregulation of HER3 in prostate tumor tissues relative to normal tissues. The HER3/ErbB3 protein has impaired intrinsic kinase activity [56], and it appears to function in signal transduction by tethering the ligand to other members of the HER family of receptors, with preference for HER2/ErbB2 [57]. Increased levels of expression of HER3 are seen in many tumors that express HER2 [58], and it is widely assumed that the signalling and/or oncogenic functions reside in the corresponding heterodimer, rather than in either individual receptor [59, 60]. Recent experimental evidence further highlights the importance of HER3 in conferring a malignant phenotype and a hormone-refractory state to prostate epithelial cells [61]. Thus, whenever HER3 is expressed it is reasonable to expect co-expression of at least one other member of the HER family. Therefore, we determined by real-time RT-PCR the relative expression in our prostate tissue samples of the genes for all four members of the HER family of receptor tyrosine kinases. Our results show that HER4 is expressed at increased levels in 10 of 14 prostate tumor samples (Fig. 5A, B), whereas HER2/ErbB2 and EGFR are overexpressed in 3 of the 14 samples analyzed. Seven samples simultaneously overexpressed HER3 and HER4, of which 2 overexpressed all four members of the HER family (Fig. 5A, B). None of the samples overexpressed the pairs HER3 and HER2, or HER3 and EGFR, without overexpressing at the same time one of the other members of the family (Fig. 5A, B).

figure 4

Figure 4

Figure 5
figure 5

Validation of genes selected by FADA from the Welsh et al. dataset [14] as overexpressed in prostate cancer. (A) RT-PCR was applied to 14 paired prostate tumor – normal prostate samples to determine the expression levels of a selection of genes shown by FADA as significantly overrepresented in prostate cancer (HPN, KLK3, IQGAP2, POR1 and HER3), and additional genes relevant to this tumor (genes for the receptor tyrosine kinases EGFR, HER2, HER4, and genes for the steroid hormone receptors AR, ERα and ERβ). The expression values for each gene, previously normalized with respect to the S14r expression level in each sample, are shown as ratios of the normalized values in prostate cancer vs. values in the matching normal prostate tissue. Quantitation of desmin expression levels was used to assess the degree of contribution of stromal components in the samples analyzed. Values equal to or above 100-fold are shown as 100. (B) Heatmap representation of the same data (color scale as shown below). (C) Real-time PCR analysis for HER3 transcript levels of laser microdissected tumor and normal samples, compared with relative transcript levels in enriched (non-microdissected) tissues from the same cases. (D) Immunohistochemical analysis of HER3 on paraffin-embedded prostate tissue sections arranged in tissue microarrays (see Methods). Left, low magnification image (×100) of one case, with weak staining for HER3 in normal glands (n), and a strong staining in tumor epithelial cells (t). Right, higher magnification (×400) of a second case.

As mentioned in the Methods section, both tumor and normal tissues were carefully chosen to have similar representation of epithelial compartment. However, to further ensure that the observed expression of HER3 was not due to a dilution effect of normal epithelial cells by stroma, we performed real-time PCR analysis of laser microdissected samples. For this, we selected four samples that had shown overexpression of HER3 in the enriched tumor samples described above, and two that had levels that did not differ significantly from non-tumor containing (normal) matched tissues. Of the four samples in which the enriched tumor tissue had shown increased levels of HER3 transcript, three microdissected samples overexpressed HER3 (Fig. 5C). In two of the microdissected samples, HER3 transcript levels were equal in normal and tumor microdissected epithelia, and this also corresponded to samples in which HER3 levels did not differ significantly between enriched tumor and normal prostate tissues (Fig. 5C). This analysis showed that overexpression of HER3 in prostate tumor tissues is not due to simple enrichment of epithelial cells in comparison with non-tumor tissues. To further confirm the cell type expressing HER3 in prostate tissues, immunohistochemical analysis with a monoclonal antibody to HER3 was performed on 16 prostate samples, arranged in duplicate 1-mm diameter cores in tissue microarrays, in which both tumor and normal glands were present. HER3 protein was found clearly overexpressed in tumor epithelia in 13 of the 16 cases (81.2%), showing juxtamembrane and finely granular cytoplasmic patterns (Fig. 5D). In all cases, normal epithelia showed weak reactivities for HER3 (Fig. 5D).

In summary, our transcriptome re-analysis, validated by real-time RT-PCR of non-microdissected and microdissected samples and by immunohistochemical analysis, significantly reinforces previous immunohistochemical studies that reported high levels of expression of HER3 and HER4 in primary prostate cancer [51, 52].

Conclusion

We have shown that the method presented here for the analysis of expression microarray data permits the classification of samples into meaningful categories and, simultaneously, to identify a subset of genes and their assignment to pathways most significantly contributing to the corresponding phenotypes, while allowing for a given gene to participate as significant in more than one cluster of samples. The analysis of the yeast dataset validates the approach. Our results are consistent with biochemical pathways known to be activated in the different stress conditions analyzed, and the clustering of samples reflects the underlying similarity of the biochemical responses. In the application to the prostate cancer dataset, we have found that two pathways, one modulated by androgen receptor and a second one by signals that originate from cell surface growth factor receptors, are prominently active in the organ-confined, non-metastatic prostate cancer samples analyzed. The latter pathway has been reported to be spuriously active in at least a subset of prostate tumors that have progressed to invasive and hormone-independent states [62]. Our results suggest that such altered activation may already be present in primary tumors. Although a prevailing model for prostate tumor progression is that acquisition of the capacity for metastatic and hormone independent growth proceeds through selection of rare populations of cells concealed among primary tumor cells, there is also evidence that a transcriptional program for metastasis may already be present in the bulk of primary tumors at the time of diagnosis [63, 64]. Our analysis would be more consistent with the latter model.

Finally, we have unveiled and validated several markers highlighted by the analysis of the prostate cancer dataset. While several of these genes were identified in the original analysis of the data [14], others are revealed here, notably HER3, IQGAP2 and POR1, the biologically most relevant being HER3. With an external dataset, we have found that prostate cancer samples frequently co-overexpress HER3 and HER4, accompanied less frequently by increased expression of EGFR or HER2. Overexpression of HER2 and consequent increased signalling have been associated with advanced prostate cancer, development of hormone independent state and poor prognosis [65, 66], but is infrequently observed in primary tumors [67, 68]. On the other hand, our results suggest that, in primary prostate cancer, HER3, together or not with HER4, rather than receptor complexes involving HER2, could play important roles in the biology of these tumors.

Materials and methods

Datasets

The S. cereviseae dataset consists of transcriptional responses of the yeast S. cerevisiae to environmental stress [13]. It originally consists of spotted array measurements of 6152 genes in 173 experimental conditions that include temperature shocks, hyper and hypoosmotic shocks, exposure to various agents such as peroxide, menadione, diamide, dithiothreitol, amino acid starvation, nitrogen source depletion and progression into stationary phase. Log-ratios were preprocessed following several steps: first data from genes with missing values were filtered out, and their missing values estimated with LSimpute [69] using the 'Adaptive' method. Next, ratios were computed from the log-ratios and quantile-normalized (experiment-wise) using the normalizeQuantile function from the R package [70], so that all experiments had the same average sample distribution. Finally, ratios were log transformed again.

The prostate cancer dataset chosen is described in [14]. It was originally obtained by hybridizations on Affymetrix U95A oligonuleotide arrays with probes for a total of 55 samples. Intensity values were preprocessed following several steps: first intensity data were thresholded, with intensities below 10 fixed at 10 and values above 16000 fixed at 16000. The thresholded values were log-transformed and then centered by the median of all experiments. Finally, genes were subjected to z-transformation (per gene basis).

Determination of genotypically coherent groups of samples

Q-mode Factor Analysis (FA) [9] seeks to find an underlying orthogonal factor model of an original X-matrix nxm (where n are the number of samples and m the number of mRNA levels measured) of the form:

X = LF + E

L is the loadings matrix of size nxk, where k is the number of factors, and F the scores matrix of size kxm, while E is the residual matrix, which contains both the specific variance of the individual genes and the errors in the model (see Figure 1). We used the so-called principal factor solution to solve this factor model. Specifically, in a first step, and based on the correlation matrix R derived from X, communalities (i.e. the proportion of the variance explained by common factors) were computed from the multiple squared correlation coefficient between the ith variable and the rest. These communalities replaced the diagonal entries of the correlation matrix, which was subjected to diagonalization. New communalities were computed from the loadings at the chosen dimensionality, obtained by scaling the eigenvector matrix (P), as follows:

L = P Λ 1/2

The new communalities again replaced the diagonal entries, and the process was iterated until convergence. Finally, we proceeded to rotate the factor loadings by means of a varimax rotation [9]. The effect of this rotation is to maximally align each of the samples with one factor in order to simplify the factor model and make it more readily interpretable. Phyletic trees were derived by clustering samples in loadings space at the optimal dimensionality using average linkage [4, 11]. When needed, bootstrap values were computed by selecting random subsets of 90% of the genes [71]. Distribution of trees and frequency of each branch in the original tree were recorded using CONSENSE, program included in the PHYLIP package [72].

Selecting genes associated to each cluster

Once sample clusters are defined, these are used to identify groups of genes contributing heavily to the specific character of different groups. Each gene on the list is subjected to a Student's t-test that measures the differential expression of the gene in the cluster as compared with the rest of the samples. t-test scores were transformed to q-values, which include multiple testing correction. The q-value is similar to the well known P-value, except that it is a measure of significance in terms of the false discovery rate, rather than the false positive rate [73]. Genes with a q-value < 10-4 were taken as differentially expressed for that particular cluster.

Assigning pathways to gene clusters

The association between selected genes and biological functions was established by determining the hypergeometric distribution of genes on the annotation databases GO [12] or GenMapp [32]. With this distribution we computed the probability that at least x genes annotated within a given biological function according to GO (or GenMapp) in a cluster of size n (the total number of genes per cluster selected in the previous step) can be obtained by chance, given a population of N genes under consideration and given A, the total number of genes within N with that particular annotation. These P-values are obtained according to:

An aggregated score for each cluster from the significant P-values (i.e., those below 10-2) is computed as follows:

s0 = ∑-ln p(x; N, A, n)

The significance of this score is established by simulation. We randomly selected 100 samples of size n genes each (the number of genes per cluster selected according to the q-value) and computed a new s-score (s r ) for each one. The Z-score is finally computed as:

z = (s o - <s r >)/σ r

Z-scores > 2.0 are taken as indicative of significant association between the samples in the cluster and the set of pathways uncovered.

We should emphasize that in spite of the apparent intricacy of the computational procedure, the computational complexity is similar to other biclustering methods, and operates within a highly constrained parameter space: in the factor analysis part of the program only the percentage of variance employed should be set, yielding a reduced number of dimensions or latent variables, usually below 5; the number of clusters is automatically determined in this space from the c-index, and has no free parameters, and the selection of genes relevant for each cluster only depends on the cutoff employed in the q-value.

Real-time RT-PCR

We used RT-PCR with either TaqMan probes or by SYBRGreen incorporation to determine the expression levels of selected genes on samples unrelated to the original study by Welsh et al. In each instance, the tumor sample and its matching normal counterpart were obtained from the same case, upon removal by radical prostatectomy. Serial sections from all normal counterparts to the tumor tissues were stained and analyzed to confirm that normal prostate glands and epithelial cells were present in near-normal patterns, and that they contained less than 1% of cells or structures with carcinomatous appearance. In addition, samples were chosen such that the tumor and normal counterparts in each case had approximately equal representations of the epithelial compartment, as assessed microscopically. RNA was isolated from corresponding frozen serial sections, and controlled for quality on a 2100 BioAnalyzer instrument (Agilent, Palo Alto, CA). For each sample, 0.5 μg of total RNA was reverse transcribed by priming with random hexamers at 42°C for 50 minutes, followed by treatment with RNase at 37°C for 20 min. The resulting cDNAs were used as templates in PCR reactions with gene-specific primers. Real-time PCR was performed on ABI PRISM 7700 (Applied Biosystems, Foster City, CA) or DNA Engine Opticon (MJ Research, Waltham, MA) instruments. TaqMan probes and their corresponding primer sets were obtained from Applied Biosystems. Thermal cycler conditions were 95°C for 10 min and 40 cycles of 95°C for 15 sec and 60°C for 1 minute for TaqMan assays. In the case of SYBRGreen reactions, the conditions were 95°C for 15 min, and 40 cycles of 95°C for 15 sec, 55°C for 30 sec and 72°C for 30 sec. All determinations were performed in triplicate and in at least two independent experiments. Since the relative amplification efficiencies of target and reference samples were found to be approximately equal, the ΔΔCt method was applied to estimate relative transcript levels. Levels of ribosomal S14r amplification were used as an endogenous reference to normalize each sample value of Ct (threshold cycle) and normal tissues were used as calibrators for their tumoral counterparts in each case. The final results, expressed as n-fold differences in target gene expression were calculated as follows:

nTARGET = 2-[(Ct target - Ct reference)TUMORAL - (Ct target - Ct reference)NORMAL]

Laser capture microdissection

Prostate tissues were obtained by punch sections of radical prostatectomies and snap-frozen in isopentane at -50°C embedded in OCT-containing cryomolds. 8 μM cryosections were mounted onto plastic membrane-covered glass slides (PALM Mikrolaser Technology, Bernried, Germany), fixed for 3 minutes in 70% ethanol, stained with Mayer's hematoxilin, dehydrated, air-dried for 10 minutes and stored at -80°C until used. Laser catapulting microdissection was performed with a PALM MicroBeam Systems instrument. 2 to 5 × 104 normal or carcinomatous epithelial cells were collected and estimated to be >99% homogeneous by microscopic visualization.

Total RNA from microdissected samples was isolated using the PicoPure RNA Kit (Arcturus Engineering, Santa Clara, CA), with an additional DNase I digestion step (Qiagen, Valencia, CA).

Immunohistochemistry

Sixteen paraffin embedded prostate samples were evaluated for HER3 expression by immunohistochemistry on a tissue microarray. The cases were represented in duplicated 1-mm diameter cores and always included normal prostatic glands adjacent to neoplastic foci in at least one of the cores. Three μM sections of the microarray were deparaffinized, rehydrated and subjected to antigen retrieval in a pressure cooker with citrate buffer at pH 6.0 for 5 min. Slides were cooled for 15 min, washed in water and incubated overnight at 4°C with anti-HER3 mouse monoclonal antibody (Upstate Biotechnology, Lake Placid, New York). Endogenous peroxidase was quenched and slides were incubated for 30 minutes with secondary antibody (Envision, DAKO, Gostrup, Denmark). Reactions were detected after development with diaminobencidine and H2O2 for 3 min. Slides were counterstained with Harri's hematoxilin, dehydrated and mounted. As a negative control, the primary antibody was substituted for isotype-matched mouse IgG.

Access to the program

The complete procedure has been coded in a Fortran-77 program, called FADA. Remote access to the program has been enabled by setting up a web-server where the program can be executed [74].