Invasive ductal breast carcinoma (IDC) is the most common malignancy of the breast, accounting for ~80% of all invasive breast tumors [1]. Although an issue of much controversy over the years, there is now general agreement and overwhelming histopathological and genetic evidence indicating that most invasive breast carcinomas appear to develop gradually from defined precursor lesions [2]. However, it also became clear that progression toward more aggressive phenotypes is not obligatory [3]. It is further evident that many genetic abnormalities underlying tumor progression are probably phenotypically silent.

Numerous molecular genetic changes have been reported as relevant in human breast carcinogenesis, such as anomalies affecting cell proliferation, apoptosis and invasion [4]. Preinvasive breast lesions such as high-grade ductal carcinoma in situ (DCIS) are known to have acquired a myriad of genomic and transcriptomic changes, but as their name implies they are not yet invasive. The development of the ability to invade surrounding tissues is perhaps the most critical event in cancer progression. Among proposed invasion-related genes with reported altered expression in tumor cells are cell adhesion molecules, proteases and cytoskeletal molecules that may influence motility [5]. Identifying the key and most common gene expression abnormalities involved in the transition steps from preinvasion to a fully invasive phenotype is an extremely important topic of research and the main objective of the present report. Studies on this area may provide clues to better diagnose premalignant lesions at high risk of progression and may aid in achieving a better understanding of critical early molecular mechanisms involved in breast cancer evolution.

Serial analysis of gene expression (SAGE) is a comprehensive profiling method that allows for global, unbiased and quantitative characterization of transcriptomes [6, 7]. SAGE provides a statistical description of the mRNA population present in a cell without prior selection of the genes to be studied, and this constitutes a major advantage. In this sense, only open systems can identify expressed genes that have not yet been cloned or partially sequenced. A second major advantage is that the information generated is digital in format, and can be directly compared with data generated from any other laboratory or with data available in public databases such as the Cancer Genome Anatomy Project

To perform a comparative SAGE analysis of normal, preinvasive and invasive lesions, we used a modified t test that we have recently developed [8]. This method has the advantage of taking into account both the intra-sample and inter-sample variability, identifying 'common patterns' of gene changes systematically occurring across samples. Most of the tests developed for measuring differential expression in SAGE data focus on capturing the first type of variation correctly, but tend to neglect the second type [9, 10]. The aim of the present study was to provide a statistically robust global gene expression analysis on the progression of breast cancer using the described statistical approach comparing breast normal and tumor SAGE libraries obtained from public databases combined with additional SAGE libraries recently generated in our laboratory.

Materials and methods

SAGE libraries

To perform the comparative analysis of different stages of breast cancer progression, we combined SAGE libraries available in public databases with breast cancer libraries generated and sequenced at our own laboratory. To this end, 12 SAGE libraries of breast tissues (four normal breast tissues, six DCIS tissues and two IDC tissues) were downloaded from the Cancer Genome Anatomy Project – SAGE Genie database (libraries generated at the Polyak Laboratory, Dana-Farber Cancer Institute, Boston, MA, USA). We used 11 additional breast cancer SAGE libraries generated by ourselves, at an approximate resolution of 100,000 SAGE tags per library. All IDC SAGE libraries used in this study were from lymph node-negative, estrogen receptor-positive and progesterone receptor-positive tumor samples, with a tumor size classification of T1 or T2 (i.e. T1–T2 N0 M0). Table 1 summarizes all the SAGE libraries used in this comparative analysis.

Table 1 Breast-specific serial analysis of gene expression (SAGE) libraries

SAGE methodology

For the SAGE libraries generated in our laboratory we followed standard methods. Briefly, total RNA was extracted from snap-frozen tissues using TRIzol (Invitrogen, San Francisco, CA, USA). SAGE library construction was performed with the I-SAGE kit (Invitrogen) according to the manufacturer's protocol and introducing only minor modifications. The anchoring enzyme was NlaIII and the tagging enzyme used was BsmFI. Concatemerized ditags were cloned into pZERO-1 and sequenced with an ABI 3700 DNA Analyzer (Applied Biosystems, Foster City, CA, USA).

SAGE data processing

SAGE tags were extracted from sequencing files using the SAGE2000 software version 4.0 (a kind gift from Dr K. Kinzler, John Hopkins School of Medicine, Baltimore, MD, USA). Tag abundances for all libraries were normalized to a total of 100,000 tags (at which level a tag present 10 times has an abundance of 0.01%). Tag to gene assignments as well as additional annotations using public databases (e.g. Gene Ontology, Locus Link, Unigene cluster) were performed, using web-based SAGE library tools developed by ourselves In our comparison we used only tags with only one reliable assigned gene.

Statistical analysis of SAGE libraries

To compare the 23 SAGE libraries, we utilized a modified t test recently developed by us [8]. This analysis allowed us to identify SAGE tags with significantly different expression levels (P < 0.01) between normal tissue and DCIS and between DCIS and IDC. Tags with total counts of less than three in all libraries were filtered out before the analysis. In order to enable visualization and illustration of our analyses, we utilized the TIGR MultiExperiment Viewer (MeV 2.2) software (The Institute for Genomic Research, Rockville, MD, USA). This tool was employed for normalization and average clustering of the SAGE data.

The aim of the heat maps presented is simply to organize and illustrate the data by graphical means. Briefly, the normalization included logarithmic transformation followed by median centering by samples and genes. We used standard average hierarchical clustering techniques to classify and illustrate further the differences found by the modified t test, showing the clusters of differentially coexpressed genes between the normal tissue, DCIS and IDC groups.

Results and discussion

Generation and analysis of SAGE libraries

The primary goal of our study was to identify the most commonly occurring transcriptome changes in the transition from normal breast epithelium to DCIS and invasive carcinoma. To this end, SAGE data obtained from 11 breast cancer libraries generated in our laboratory (1,090,815 tags) were combined and compared with data available in the public domain (661,863 tags), thus generating a dataset of almost 1.7 million breast cancer and normal specific tags, representing approximately 25,157 transcripts from a total of 23 libraries (Table 1).

Our statistically stringent analysis revealed 52 transcripts commonly deregulated across the board when comparing normal tissue with DCIS (Fig. 1), and 149 transcripts when comparing DCIS with IDC (P < 0.01) (Fig. 2) (see additional data files 1 and 2 for additional information with statistical cutoff at P < 0.05). Selected genes based on relative abundance, highly statistical differences and high fold changes between compared groups are sorted and represented in Tables 2 and 3.

Table 2 Most frequent differentially expressed genes between normal breast epithelium and ductal carcinoma in situ (DCIS)
Table 3 Most frequent differentially expressed genes between ductal carcinoma in situ and invasive ductal carcinoma (IDC)
Figure 1
figure 1

Hierarchical clustering of the most commonly different expressed genes between normal breast tissue and ductal carcinoma in situ (DCIS) groups (P < 0.01). Color scale at bottom of picture is used to represent expression level: low expression is represented by green, and high expression is represented by red.

Figure 2
figure 2

Hierarchical clustering of the most commonly differentially expressed genes between ductal carcinoma in situ (DCIS) and invasive ductal carcinoma (IDC) groups (P < 0.01). Color scale at bottom of picture is used to represent expression level: low expression is represented by green, and high expression is represented by red.

As expected, we detected various ribosomal genes among the most abundant transcripts in all the breast SAGE libraries, and these genes were highly upregulated in the invasive carcinomas. This agrees with the previous global expression profiles and with the comparisons of cancers and the corresponding normal tissues in general [7, 11, 12]. To simplify illustration of the data, ribosomal genes are not included in the figures and tables.

Global comparison of normal tissues and DCIS

Among the 52 transcripts detected as differentially expressed in DCIS (P < 0.01), 36 were downregulated transcripts and 16 were upregulated transcripts in these lesions when compared with normal breast epithelial cells and mammary epithelial organoids (Fig. 1 and Table 2). We defined and classified the 52 genes differentially expressed into the nine functional categories [13] shown in Fig. 3a. Interestingly, we found that 38% of these transcripts are related to the cell cycle (15%), signal transduction (8%) and apoptosis (15%).

Figure 3
figure 3

Classification in functional categories of affected transcripts. (a) Differentially expressed between normal breast tissue and ductal carcinoma in situ (DCIS) (P < 0.01). (b) Transcripts differentially expressed between DCIS and invasive ductal carcinoma (IDC) (P < 0.01).

As expected, our analysis of DCIS versus normal breast epithelium revealed numerous similarities with SAGE data reported previously [12, 14], but more importantly it also provided novel information. The expression of numerous genes was significantly downregulated in DCIS, including: transmembrane 4 super family member 1 (TM4SF1), nuclear factor kappa light polypeptide (NFKB1A), pre-B-cell (PBEF), RAS dexamethasone-induced (RASD1), tumor necrosis factor receptor superfamily member 10b (TNFRSF10B), and tumor necrosis factor α-induced protein (TNFAIP). All these transcripts were also observed downregulated in previous reports [12, 14] (Table 2). On the contrary, our analysis revealed additional clusters of genes significantly downregulated in the DCIS group that were not previously reported by others: clusterin/apolipoprotein J (CLU), nuclease sensitive element binding protein 1 (NSEP1), lipopolysaccharide-induced TNF factor (LITAF/PIG7), basic leucine zipper/W2 domains 1 (BZW1), and cyclin L1 (CCNL1) (Table 2).

Clusterin was one of the most dramatically downregulated genes (-63.9-fold; P = 0.0036) in DCIS libraries. This gene encodes a heterodimeric, highly conserved, secreted glycoprotein. Alterations in Clusterin expression and/or protein maturation are linked to changes in tissue growth or regression, which may be related to specific proapoptotic or antiapoptotic protein isoforms [15]. Clusterin was reported as overexpressed during tissue and cell involution, and was downregulated in esophageal squamous cell carcinoma and prostate carcinoma, suggesting that this expression alteration could be a general phenomenon during tumor progression [16, 17]. On the contrary, and in contrast to these and our observations, Redondo and colleagues reported increased Clusterin expression in breast carcinoma samples [18]. The reason for this discrepancy is unclear at this point. The role of Clusterin in cell survival, cell death and neoplastic transformation remains controversial [15].

Another commonly observed downregulated gene in DCIS libraries was NSEP-1 (-36.2-fold; P = 0.0001). Also known as YB1, NSEP-1 is a member of the highly conserved Y-box family of proteins, which regulate the transcription of several genes associated with cell death including both fas, a cell death-associated receptor, and the tumor suppressor gene p53 [19]. The decrease in expression of NSEP-1 transcripts could play an important role in the early stages of breast carcinogenesis in order to overcome cell proliferation controls.

Interestingly, and as previously observed, we also detected significant downregulation of various cytokines and chemokines: interleukin enhancer binding factor 2 (ILF2), interleukin 13 receptor alpha 1 (IL13RA1), leukemia inhibitory factor (LIF), cardiotrophin-like cytokine (CLC), chemokine C–C ligand 2 (CCL2), and chemokine C–X–C ligand 1 (CXCL1). All these cytokines and chemokines are highly expressed in normal mammary epithelium and are dramatically downregulated in the DCIS samples. These differentially expressed genes were detected within a range of 0.02 <P < 0.05 by means of the modified t test analysis. These small secretory molecules, although usually linked to inflammatory processes, could also play important autocrine and/or paracrine roles in the physiology of normal mammary epithelial cells in particular because receptors for these cytokines are also normally found expressed in normal breast epithelial cells [20]. Some of these molecules (e.g. CXCL1, LIF) appear to play important roles in the normal periodic cycles of growth and involution of the mammary gland following pregnancy and lactation. They may thus be part of the physiologic mechanisms associated with the massive apoptosis observed during involution [21, 22]. Unfortunately we understand very little of the relevance of their intriguing de facto silencing in expression, both in in situ as well as in invasive breast cancer lesions.

Interestingly, we also detected a series of transcripts commonly overexpressed in the DCIS samples: polycyctic kidney disease 1-like (PKD1-like), START domain containing 10 (STARD10), EPS8-like2 (EPS8L2), and KIAA0545 protein (Fig. 1d). One of these genes, EPS8-like2, encodes a protein that is related to epidermal growth factor receptor pathway substrate 8 (EPS8), and was shown to be essential in Ras/PI3K to Rac signaling [23]. PKD1-like encodes a member of the polycystin protein family. Members of this protein family may function in cell development and morphology, and may modulate intracellular calcium homoeostasis and other signal transduction pathways [24, 25]. Although the PKD1 gene has been associated with cancer mechanisms, this homologous family member has not been implicated in carcinogenesis processes to the best of our knowledge. KIAA0545, also known as signal-induced proliferation-associated 1 like 3 (SIPA1L3), is a member of the Sipa1 family and encodes a protein bearing a domain highly homologous to the catalytic region of human Rap1 GTPase-activating protein (Rap1GAP). Sipal1 is involved in the regulation of the Ras-mediated signal transduction pathway for cell proliferation and cell cycle progression [26]. These genes could be involved in signaling pathways that lead to cell proliferation, but their potential role in malignant transformation remains unknown.

Differentially expressed genes associated with NF-κB and tumor necrosis factor pathways

One of the transcripts observed to be most differentially expressed when comparing normal tissue with DCIS was NFKBIA (better known as IκBα), demonstrating a 150-fold higher expression (P < 0.0001) in normal mammary epithelial cells (Table 2 and Fig. 1b). NFKBIA is a member of IκB family genes that play a critical role in regulating the activity of the NF-κB transcription factor [27, 28]. NF-κB plays a major role in diverse biological processes such as cell proliferation, differentiation, apoptosis and metastasis [29, 30]. NF-κB is also required to prevent cell death induced by tumor necrosis factor (TNF) [31].

Interestingly, and perhaps pointing to connected pathways and related outcomes, we also detected a strong decrease in the expression levels of TNFRSF10 (29-fold; P < 0.0003), LITAF/PIG7 (29-fold; P < 0.0003) and TNFAIP3 (eightfold; P < 0.0083) transcripts in the DCIS group. The protein encoded by TNFRSF10, also known as TRAIL/APO2, is a member of the TNF-receptor superfamily and contains an intracellular death domain. This receptor can be activated by TNF-related apoptosis inducing ligand and its role is to transduce apoptosis signals [32, 33]. LITAF/PIG7, a transcription factor, termed lipopolysaccharide-induced TNF-alpha factor, also found downregulated, was reported to regulate TNF-alpha gene expression playing a major role in TNF-alpha activation [34]. This gene, also known as P53-induced gene 7 (PIG7), has been shown to be induced by p53 when apoptosis is triggered, and therefore could also play a role in programmed cell death [35]. The concerted decline of these transcripts early in breast tumor progression appears conducive to a virtual silencing of apoptosis induction pathways and a consequential net increase in cell proliferation. In other words, the homeostasis of proliferation cell death normally operating in the breast epithelium is altered and inclined towards a net gain in cell numbers via multiple signaling pathways.

Global comparison of in situ andinvasive carcinomas

We found 149 transcripts differentially expressed between DCIS and IDC at P < 0.01. All of these genes were found overexpressed commonly at the invasive stage (Fig. 2). Table 2 summarizes the 52 most commonly overexpressed genes in invasive carcinoma lesions. We defined and classified the 149 genes differentially expressed in 10 functional categories [13] as shown in Fig. 3b. Interestingly, we found that 37% of these upregulated transcripts are related to the cell cycle (12%), extracellular matrix or secreted proteins (13%), cell adhesion and motility (6%), and signal transduction (6%).

We were also able to detect 31 underexpressed genes in invasive carcinomas when compared with DCIS, but only when the stringency of the statistical comparison was dropped to within the 95% confidence interval (i.e. P < 0.05), reflecting a lower level of consistency in these gene expression changes when comparing DCIS with IDC (Fig. 4). Examples of these genes include: transmembrane 4 superfamily member 1 (TM4SF1) (-26.7-fold; P = 0.04), tumor necrosis factor receptor-associated factor 4 (TRAF4) (-10.7-fold; P = 0.04), PPAR binding protein (PPARBP) (-8.2-fold; P = 0.04), aldo-keto reductase family 1 (AKR1A1) (-6.7-fold; P = 0.03), hypothetical protein dJ465N24.2.1 (-6.4-fold; P = 0.028), microtubule-associated protein 1 (MAP1LC3A) (-3.7-fold; P = 0.02) and retinoblastoma binding protein 6 (RBBP6) (-2.6-fold; P = 0.04).

Figure 4
figure 4

Hierarchical clustering of downregulated genes in invasive ductal carcinoma (IDC) (P < 0.05). Color scale at bottom of picture is used to represent expression level: low expression is represented by green, and high expression is represented by red. DCIS, ductal carcinoma in situ.

The first of these transcripts, TM4SF1, was also the most dramatically downregulated gene in DCIS when compared with normal breast libraries (-442.6-fold; P = 0.0083). The transmembrane proteins TM4SF1, also known as the tetraspanin superfamily, are implicated in diverse signal transduction events that play a role in the regulation of cell development, cell proliferation, differentiation and motility [36]. The tetraspanins are associated with adhesion receptors of the integrin family and regulate integrin-dependent cell migration [36]. In the present study, the loss in gene expression of TM4SF1, from normal breast tissue to invasive carcinomas, appears to be a common event in the progression of breast carcinomas. In addition, downregulated levels of the TRAF4 transcript could cooperate in the evolution from DCIS to invasive carcinomas. TRAF4 is a proapoptotic gene member of the TRAF family of adaptor proteins that mediate cellular signaling by binding to various members of the tumor necrosis family receptor superfamily and interleukin-1/Toll-like receptor superfamily [37]. Interestingly, a recent study showed that overexpression of TRAF4 can induce apoptosis, playing a role in p53-mediated proapoptotic signaling in response to cellular stress [38].

Differentially expressed genes related with extracellular matrix remodeling and invasion processes

During their metastatic conversion, epithelial carcinoma cells acquire the ability to invade the surrounding tissues and later disseminate to secondary organs mostly via lymphatic vessels. The metastatic process is not just a function of acquisition of novel migratory and invasive properties by the epithelial tumor cells; the surrounding stroma also plays a critical role in this process [2]. Dramatic changes take place in order to remodel the extracellular matrix environment in response to the infiltrating cancer cells (desmoplastic reaction) [3941]. In this sense, we identified high expression levels of several transcripts that could be a reflection of the host stromal response, such as collagen 1α1, collagen 1α2, collagen 3α1, collagen 6α1, fibronectin I, fibrilli, microfibrillar-associated protein 2, and Spondin 2.

It is known that the proteolytic degradation of the extracellular matrix is more than the simple removal of a physical barrier to invasion; such processes and the increased expression of the involved genes are known to also significantly influence mechanisms controlling cell proliferation [42]. Matrix metalloproteinases are zinc-dependent endopeptidases involved in matrix degradation and tissue remodeling [43]. These endopeptidases are capable of degrading both the extracellular matrix and basement membrane, physical barriers that play important roles in preventing against expanding growth and migration of cancer cells [44]. It is therefore widely accepted that overexpression of matrix metalloproteinases is associated with cancer-cell invasion and metastasis. A member of the matrix metalloproteinase family (MMP-2) was highly expressed (29.1-fold; P = 0.0008) in IDC libraries in comparison with in situ carcinomas. MMP-2 has been shown overexpressed in various human tumors, including breast cancer [45, 46].

To no surprise and as observed in other studies, we also detected significant increases in SPARC (286-fold; P = 0.0003) and a new related gene SPARC-like1 (21.7-fold; P = 0.005) among the groups of genes upregulated in invasive lesions. The SPARC gene encodes for a secreted protein acid rich in cysteines also known as osteonectin [47]. This protein is involved in a variety of diverse biological processes including tissue remodeling, cell adhesion, proliferation, differentiation, matrix synthesis/turnover, angiogenesis and tumor cell migration and invasion [47]. Overexpression of the SPARC gene has been reported associated with melanoma and metastatic carcinomas of the breast, and increased SPARC expression has been observed in conjunction with increased c-Jun and Fra-1 expression in a panel of invasive breast cancer cell lines [48].

Human SPARC-like1, also known as mast9 or hevin, is a member of the SPARC protein family. Interestingly, previous reports indicated downregulation of SPARC-like1 in prostate and colon carcinomas [49, 50]. Contrary to these observations, we observed consistent high expression of this transcript across all IDC libraries. Functional assays suggest that SPARC-like1 may serve as an antagonist to cell adhesion, playing a key role in the inhibition of attachment, and may facilitate spreading of endothelial cells on fibronectin substrates [51].

Taken together, these expression profiles suggest that MMP-2, SPARC and SPARC-like1 are probably critical mediators of extracellular matrix remodeling and are all important in facilitating breast cancer invasion and progression.

Other genes commonly expressed at high levels in invasive carcinomas and of much lower expression in DCIS and normal breast tissues include lumican (LUM/LDC) (56.7-fold; P = 0.0011), versican (CSPG2) (18.2-fold; P = 0.0017), vimentin (VIM) (17.7-fold; P = 0.0014), decorin (DCN/PG2) (13.9-fold; P = 0.0007) and adlican (DKFZp564I1922) (12-fold; P = 0.0086). Lumican and decorin are members of the small leucine-rich proteoglycan family of proteins [40]. Several studies have demonstrated that small leucine-rich proteoglycan proteins can modulate cellular behavior, including cell migration and proliferation during tumor growth. Furthermore, the high expression level of lumican was associated with high tumor grade and was expressed specifically in breast cancer tissues, but not in normal breast tissues, suggesting that lumican is differentially expressed during breast tumor progression [40, 52]. These findings suggest that lumican may play an important role in breast cancer growth.

Recent studies have suggested that expression of increased amounts of versican, a chondroitin sulphate proteoglycan, in neoplastic tissues may play a role in promoting tumor cell proliferation and migration [53]. Abnormal versican deposition has been observed in a number of tumor types, including breast cancer [54]. Furthermore, it has been suggested that the versican-rich extracellular matrices exert an anti-adhesive effect on cells, thus facilitating tumor-cell migration and invasion [55].

Vimentin is a type III intermediate filament normally expressed in cells of mesenchymal origin [56]. However, numerous studies have now demonstrated that vimentin can also be expressed in epithelial cells involved in physiological or pathological processes requiring epithelial cell migration [57]. Vimentin has indeed been described in migratory epithelial cells involved in embryological and organogenesis processes and tumor invasion [58]. Also, vimentin antisense transfection in vimentin-expressing breast cell lines was shown to reduce their in vitro invasiveness or migration, strongly emphasizing a functional contribution of vimentin to epithelial cell invasion/migration [59].


Using comprehensive gene expression profiling by means of SAGE combined with a recently developed statistical approach, we identified the most consistent and statistically significant changes occurring in breast cancer progression detected by this methodology. A comparison of the genes identified in our DCIS and IDC analysis with previous observations [11, 12, 14, 41] revealed expected similarities. More importantly, several genes were identified in our analysis that were not previously reported or detected in other SAGE studies. This suggests that the comparative analysis we performed of normal breast tissue, DCIS and invasive carcinomas by means of the modified t test appears statistically rigorous and applicable to SAGE studies in which multiple libraries are compared.

In the present study we observed that deregulation of genes involved in the control of cell proliferation, apoptosis and mammary gland development are frequently altered at the in situ stage (Fig. 5). Meanwhile, alterations in the expression of genes related to the cell cycle and extracellular matrix remodeling (proteinases, collagenases, cysteine proteinases), and several transcripts related to cell adhesion and motility, were abundantly deregulated at the invasive carcinoma stage (Fig. 5). Additional analysis and validation of the identified genes will be required to determine the clinical value, and to determine whether they may constitute novel targets for translational research.

Figure 5
figure 5

Schematic model portraying some of the most significant transcriptomic changes observed in breast cancer progression. DCIS, ductal carcinoma in situ; IDC, invasive ductal carcinoma.