Human Genetics

, 126:605

Application of serial analysis of gene expression to the study of human genetic disease


Review Article

DOI: 10.1007/s00439-009-0719-5

Cite this article as:
Horan, M.P. Hum Genet (2009) 126: 605. doi:10.1007/s00439-009-0719-5


Sequence tag analysis using serial analysis of gene expression (SAGE) is a powerful strategy for the quantitative analysis of gene expression in human genetic disorders. SAGE facilitates the measurement of mRNA transcripts and generates a non-biased gene expression profile of normal and pathological disease tissue. In addition, the SAGE technique has the capacity of detecting the expression of novel transcripts allowing for the identification of previously uncharacterised genes, thus providing a unique advantage over the traditional microarray-based approach for expression profiling. The technique has been successful in providing pathological gene expression profiles in a number of common genetic disorders including diabetes, cardiovascular disease, Parkinson disease and Down syndrome. When combined with next generation sequencing platforms, SAGE has the potential to become a more powerful and sensitive technique making it more amenable for diagnostic use. This review will therefore discuss the application of SAGE to several common genetic disorders and will further evaluate its potential use in diagnosing human genetic disease.


The study of human genetic disease is a complex process due in part to the expression of multiple genes and is further complicated by the interaction of environmental factors. Previous studies have therefore tended to rely heavily on the expression changes observed in individual genes. However, sequencing of the mammalian genome has allowed unparalleled resources for advancing our knowledge and understanding of the molecular mechanisms underlying genetic disease. With the advent of DNA microarray and serial analysis of gene expression (SAGE), focus has now turned on the elucidation of gene function (based on the expression profiling of multiple genes) in normal and disease conditions. SAGE is a powerful technique that combines the data generating capacity of DNA microarrays but has the added advantage in that the SAGE data are produced in an unbiased numerical order of expressed sequence tags. SAGE can therefore be more generally viewed as an unbiased digital microarray assay. For the purpose of tissue cDNA analysis, such expression profiling potentially makes SAGE a much more powerful technique (than DNA microarray) and in SAGE no prior knowledge is required for any gene sequence (as is required for DNA microarray which thus produces a bias towards those genes expressed on a DNA chip).

The basic principle of the SAGE strategy is the generation of a short stretch of unbiased DNA sequence (SAGE tag) that is of sufficient size and complexity to uniquely identify a specific gene transcript (Velculescu et al. 1995) (Fig. 1). Sequence tags are obtained from cDNA generated from isolated tissue mRNA which are then concatenated prior to cloning and sequencing. The sequenced tags are then analysed by specific SAGE software (e.g. SAGE2000, available at to tabulate all the generated tag data. Elucidation of transcript identification can then be determined by BLAST (RefSeq) analysis of the identified sequence tag. Several modifications to the original SAGE strategy have been implemented and include the derivative techniques of LongSAGE and SuperSAGE (Saha et al. 2002; Matsumura et al. 2008). These derivative techniques generate larger tags for sequencing analysis, which improves data quality and makes expression analysis more specific. Furthermore, the validity of the SAGE technique has been confirmed by DNA microarrays, revealing good correlations between the two techniques (Nacht et al. 1999; Ishii et al. 2000; Anisimov et al. 2002a; Gnatenko et al. 2003; van Ruissen et al. 2008).
Fig. 1

General simplified outline of the SAGE technique as provided by

The expression profiles of thousands of genes in various organs such as kidney (El-Meanawy et al. 2000), liver (Yamashita et al. 2000) and thyroid (Pauws et al. 2000) have been successfully determined by SAGE analysis. The characterisation of gene expression patterns in normal and disease organs provides a powerful genetic strategy to identify putative novel targets amenable for therapeutic intervention. SAGE transcript sequencing has also been used successfully to determine the cDNA expression profile of various genetic disorders including diabetes (Cras-Méneur et al. 2004; Misu et al.2007; Takamura et al.2008), cardiovascular disease (Gnatenko et al. 2003), Down syndrome (Sommer et al. 2008; Malagó et al. 2005) Gaucher disease (Myerowitz et al. 2004), Parkinson disease (Noureddine et al. 2005; Brochier et al. 2008), retinal disease (Bowes Rickman et al. 2006) and Alzheimer disease (Xu et al. 2007) (Table 1). When combined with the newly developed technique of next generation sequencing, SAGE has the potential to become an even more powerful and sensitive assay with high throughput capabilities making it amenable for diagnostic application. This review will therefore compare the biological insights garnered from the application of SAGE to some common genetic disorders (diabetes, cardiovascular disease, Parkinson disease, Down syndrome) and will discuss its future application as a high throughput diagnostic tool.
Table 1

Examples of human disorders studied using SAGE (or SAGE derivative) transcript analysis

Pathology studied

Transcript analysis


Acute myeloid leukaemia


Lee et al. (2006)

Adrenocortical hyperplasia


Horvath et al. (2006)

Alzheimer disease


Xu et al. (2007)



Patino et al. (2006)

Breast cancer


Yao et al. (2006)

Cervical cancer


Shadeo et al. (2007)

Diabetic retinopathy


Lupien et al. (2007)

Down syndrome


Sommer et al. (2008)



Honda et al. (2008)

Gallbladder cancer


Alvarez et al. (2008)



Buimer et al. (2008)

Hepatocellular carcinoma


Minagawa et al. (2008)

Human cytomegalovirus-infected dentritic cells


Raftery et al. (2009)

Idiopathic pulmonary fibrosis


Boon et al. (2009)

Melanocytic metastatic melanoma


Egidy et al. (2008)



Voth et al. (2007)

Parkinson disease


Noureddine et al. (2005)

Retinal disease


Bowes Rickman et al. (2006)

Type 2 diabetes


Takamura et al. (2008)

Type 2 Gaucher disease


Myerowitz et al. (2004)

Application of SAGE transcript profiling to human genetic disorders


Diabetes is a multifactoral disorder caused by a deficiency of insulin action due to a combination of altered gene expression and environmental factors (Michael et al. 2000). The diabetic phenotype includes peripheral neuropathy, gastropathy, cataract, retinopathy and nephropathy. However, the major causes of morbidity and mortality in diabetic individuals is the development of atherosclerosis of the cerebral arteries, coronary arteries and arteries of the lower extremites (Van Der Feen et al. 2002; Wehrwein et al. 2006; Shashkin et al. 2006). The multifactoral nature of diabetes leads to a disruption of important inter-organ connectivity (Uno et al.2006) leading to insulin resistance (which is a pathological feature of type 2 diabetes). Thus, a comprehensive study of gene expression of critical tissues would be an important step in identifying the underlying molecular mechanisms involved with the development and pathological progression of diabetes.

A clinical characteristic of type 2 diabetes (insulin deficiency and insulin resistance) has been suggested to be correlated with mitochondrial dysfunction (Lowell and Shulman 2005). Previous reports have indicated that the mitochondrial oxidative phosphorylation (OXPHOS) pathway is disrupted in the skeletal muscles of type 2 diabetics (Patti et al. 2003; Mootha et al. 2003). OXPHOS disruption further leads to an increase in reactive oxygen species (ROS) which has been suggested to play a role in multiple forms of insulin resistance (Houstis et al.2006; Imoto et al.2006). In addition, regulation of mitochondrial gene expression in skeletal muscle appears to be under the control of the peroxisome proliferator-activated receptor-γ coactivator-1α (PGC-1α) gene (Puigserver et al. 1999) and previous microarray expression studies on type 2 diabetic skeletal muscle has established that the down-regulation of PGC-1α is associated with reduced expression of genes involved with the mitochondrial OXPHOS pathway (Patti et al. 2003; Mootha et al. 2003). These data therefore appear to suggest that reduced PGC-1α expression in skeletal muscle is associated with type 2 diabetes development. This is further evidenced in recent reports which established that a gain of function mutation (R225Q) in the AMP-activated protein kinase-γ3 (AMPK-γ3) subunit gene is associated with an increase in mitochondrial biogenesis and elevated levels of PGC-1α in glycolytic skeletal muscle in mice and appears to be associated with a protective mechanism against insulin resistance (Garcia-Roves et al. 2008). This suggests the AMPK-γ3 R225Q variant may be utilised as a putative therapeutic agent for the treatment of metabolic disorders including insulin resistance and diabetes (Barnes et al. 2004; Garcia-Roves et al. 2008). However, a recent SAGE report on liver tissue from type 2 diabetics revealed that OXPHOS mitochondrial transcripts and PGC-1α were found to be more abundant in the livers of type 2 diabetic patients (Misu et al.2007). These data are in agreement with previous reports on diabetic animal models which demonstrated elevated expression of PGC-1α in the livers of these animals (Yoon et al. 2001; Saltiel and Kahn 2001; Herzig et al. 2001; Koo et al. 2004). The SAGE generated liver data appears contrary to the data generated from the microarray analysis (on type 2 diabetic skeletal muscle) but this observation may be explained by tissue-specific regulation of gene expression. Nonetheless, the SAGE data suggests that therapeutic activation of PGC-1α could be detrimental to the livers of type 2 diabetic patients.

The SAGE study by Misu et al. (2007) also identified the expression of novel transcripts, which demonstrates a further advantage of the SAGE technique. Of the identified novel transcripts, one in particular (putitavely named “hepatokine X”) appeared to have an association with insulin resistance (Takamura et al.2008). Further work is currently in progress to fully elucidate the role of this novel transcript in the pathology of diabetes (Takamura et al.2008). Thus, the application of SAGE to the study of the diabetic liver not only identified a specific gene expression profile of such a critical organ (which helps provide an understanding of the molecular mechanisms involved with the diabetic phenotype) but also putatively identified a novel candidate gene that may be amenable for future therapeutic intervention for the treatment of type 2 diabetes.

Cardiovascular disease (CVD)

Cardiovascular disease (CVD) is associated with all disease conditions of the heart and blood vessels. The main underlying mechanism is atherosclerosis of the inner lining of the arteries. This causes a restriction of blood supply to critical organs, which can lead to the development of hypertension and stroke. Reports have shown that genetic factors play a role in the development and progression of hypertension and stroke (Tanira and Al Balushi 2005; Schulz et al. 2004; Flossmann and Rothwell 2005). Indeed, a previous study found that single nucleotide polymorphisms (SNPs) in the growth hormone promoter produced a combination of different haplotypes that were associated with both hypertension and stroke (Horan et al. 2006). Arterial hypertension is also a known risk factor for stroke due to its association with the formation of atheromatous deposits in cerebral arteries (Boon et al. 1994). However, in an attempt to elucidate gene expression in CVD, several studies have applied SAGE to a number of tissues to determine specific gene expression profiles of the cardiovascular system. Such reports have revealed an alteration in human cardiac cell gene expression induced by hypoxia (Jiang et al.2002), differentiation of pluripotent cells into cardiomyocytes (Anisimov et al. 2002b), the identification of novel transcripts in atherogenic stimulated human umbilical vein endothelial cells (de Waard et al. 1999) and an up-regulation of transcriptional regulators in macrophages purified from atherosclerotic plaques (Patino et al. 2006).

The important data generating capacity of SAGE in the study of CVD was further demonstrated in a SAGE/microarray comparative study that determined the expression profile of purified human platelets (Gnatenko et al. 2003). Human platelets are known to play a critical role for normal cardiovascular function as well as being involved with the processes of inflammation and wound repair. These blood cells lack a nucleus but retain cytoplasmic mRNA (derived from megakaryocytes) and maintain protein biosynthesis capabilities (Kieffer et al. 1987; Newman et al. 1988). The study by Gnatenko et al. (2003) determined that the expression profile of human platelet derived SAGE tags could be directly compared with the relative expression data derived by microarray analysis. The report found that the majority of expressed SAGE transcripts were of mitochondrial origin (a finding that is consistent with the absence of a nucleus). Indeed, recent reports have further identified associations of sequence variants in mitochondrial genes with CVD (Alonso-Montes et al. 2008; Ingelsson et al. 2008; Lai et al. 2008; Matsunaga et al. 2009). The microarray assay used in the study by Gnatenko et al. (2003) lacked mitochondrial-specific genes (on the chip array) but the data nonetheless confirmed the expression profile of the observed non-mitochondrial SAGE tag profile. Quantitative real-time PCR (QPCR) also independently confirmed the mRNA expression data and further showed that there was a concordance of both the SAGE and microarray techniques for platelet profiling. Furthermore, of the identified non-mitochondrial SAGE transcripts, a significant fraction were determined to be putative novel genes that were absent on the DNA microarray gene chip suggesting that the expression of these transcripts may be of functional significance for the progression of cardiovascular dysfunction (Gnatenko et al. 2003).

The power of SAGE to identify novel transcripts is critical for the study of genetic disorders. By generating a sequence tag (from the expression of a previously unknown gene), the SAGE technique allows for a complete determination of a putative novel gene sequence when combined with the downstream technique of rapid amplification of cDNA ends (RACE). The identification of a SAGE tag sequence allows the design of an extending primer in RACE to determine both the 5′ and 3′ ends of any unknown gene sequence (Matsumura et al. 2008). The functionality and mutational analysis of any such gene can then be fully characterised. However, if the expression of a novel SAGE tag has a corresponding homologous sequence in another species, the full gene sequence can be derived by a using a combination of PCR, sequencing and BLAST analysis. A study by Duka et al. (2006) used such a strategy to identify the sequence of a previously unknown gene found to be expressed in the cardiac tissue of hypertensive mouse models. These animals were rendered hypertensive by infusion of angiotensin II (Ang II) producing hypertensive/ischaemic cardiomyopathy. BLAST analysis of the novel transcript found that the mouse SAGE-derived tag was homologous to the human cardiac expressed XIRP2 (also known as CMYA3) gene. Thus, the SAGE data produced by Duka et al. (2006) revealed an up-regulation of a novel mouse ortholog of the CMYA3 (Cmya3) gene. XIRP2 (CMYA3) is a member of the Xin protein family, which have important roles in cardiac morphogenesis (Wang et al. 1999) and functions as a stabilising actin-binding protein (Pacholsky et al. 2004). However, the study by Duka et al. (2006) is currently the only report to suggest an association of XIRP2 with CVD.

These data appear to suggest that an increase in XIRP2 expression may be an important identifying factor leading to hypertensive/ischaemic cardiac necrosis in humans. Further work is required to identify the molecular pathways involving this candidate gene (and its putative role in the development of human cardiac tissue damage), which may allow for future therapeutic intervention for the treatment of cardiovascular dysfunction.

Parkinson disease (PD)

Parkinson disease (PD) is a common degenerative neurological disorder characterised by resting tremor, shuffling gait and muscle rigidity and weakness. The pathological condition of the disorder is a degeneration of the dopaminergic neurons in the zona compacta of the substantia nigra (Hornykiewicz 1979). PD affects 1–2% of individuals above the age of 60 and there are estimated to be more than 500,000 cases of PD in the EU alone (Gasser 2009). Several gene association studies have identified mutations in PD families, which include α-synuclein (Polymeropoulos et al. 1997), Parkin (Kitada et al. 1998), NR4A2 (Le et al. 2003), DJ-1 (Bonifati et al. 2003), PINK I (Valente et al. 2004), LRRK2 (Zimprich et al. 2004) ATP13A2 (Ramirez et al. 2006) and GIGYF2 (Lautier et al. 2008), thus indicating that PD is a polygenic disorder.

A previous report further indicated the polygenic nature of PD by combining the data generated from the techniques of genomic convergence (the generation of a genomic linkage map for PD families) and SAGE to identify a number of transcripts that appear to be associated with PD (Noureddine et al. 2005). More than 8,000 genes were found to be linked to PD by genomic convergence studies alone (Scott et al. 2001; Li et al. 2002) but by combining genomic convergence with the data generating capacity of SAGE, this number was substantially reduced. A total number of 50 transcripts were found to be directly associated with genomic convergence peaks, thus establishing these genes as high priority candidates for PD risk (Noureddine et al. 2005). Several of these transcripts are associated with cellular metabolic pathways previously implicated in PD and include genes responsible for protein folding, ubiquitination, inflammation, and programmed cell death (Noureddine et al. 2005). However, of particular interest are the recent findings that several SAGE transcripts identified by Noureddine et al. (2005) have now been found to be associated with PD and include the ring finger protein-11 (RNF11) gene (Anderson et al. 2007), the glial fibrillary acidic protein (GFAP) gene (Aponso et al. 2008) and the chemokine (C–C motif) ligand-2 (CCL2) gene (Reale et al. 2009).

The study by Noureddine et al. (2005) also found that the SAGE technique was capable of detecting all UNIGENE sets in linkage regions whereas only a fraction could be detected by microarray analysis (when using the commonly used human Affymetrix U133A or U133b gene chip). Thus, important mapping of transcripts to genomic convergence regions could not be complemented by microarray analysis. This finding demonstrates an increase in genome coverage of SAGE over that of microarray (at least) when the assay is combined with genomic convergence in the study of PD (Noureddine et al. 2005). The report also found that SAGE was capable of detecting a novel missence polymorphism (A5390G, Ile304Met) in the NADH dehydrogenase subunit 3 (ND3) gene (although the functional consequences of this polymorphism was not studied). ND3 is a component of mitochondrial complex I, and biochemical defects of this respiratory chain complex are found in the substantia nigra of Parkinsonian brains (Schapira et al. 1990; Mann et al. 1992). The A5390G polymorphism was detected because of the generation of an NlaIII site at the 3′ end of the ND3 gene allowing for linker ligation and SAGE tag isolation. Interestingly, a previous report also found a highly significant association of another ND3 polymorphism (A10398G, Thr114Ala) with PD (van der Walt et al. 2003). Thus, SAGE also has the capacity to determine the expression profile of novel polymorphic gene variants (but only if the variant creates a 3′-end specific anchoring enzyme site used for a particular SAGE assay). The SAGE expression profile of the substantia nigra nonetheless revealed altered gene expression in major metabolic pathways affected in PD. The 50 identified transcripts all fall within previously published linkage regions and indicate that these genes may play a vital role in the development of the Parkinson disorder. The recent findings that RNF11, GFAP and CCL2 are associated with PD (Anderson et al. 2007; Aponso et al. 2008; Reale et al. 2009) clearly indicate the sensitivity of SAGE in generating candidate genes for downstream disease-association and functional studies.

Down syndrome (DS)

Down syndrome (DS) is the most commonly known congenital disorder with an estimated occurrence of approximately 1 in 700 live births. It is caused by the trisomy of human chromosome 21 and the phenotypic features include mental retardation, immunity and heart defects, muscle hypotonia, gastrointestinal malformations and an increased risk of leukaemia (Epstein et al. 1991). However, the molecular mechanisms underlying the development of DS are not well understood but the primary cause is assumed a dosage imbalance of human chromosome 21 (HSA21) genes. Indeed, a previous study demonstrated that there was an up-regulation of HSA21 genes in DS foetal brain (a finding that may help in explaining some of the clinical variations observed in DS subjects) (Mao et al. 2003). Analysis of DS tissue using SAGE would therefore be of benefit to establish an expression profile of putative genes involved with the pathogenesis of DS.

A recent SAGE study on purified lymphocytes from trisomy 21 children found that a total of 242 SAGE tags were significantly differentially expressed in comparison to controls. Many of these transcripts corresponded to genes involved with RNA processing, signalling, gene transcription, immune response and lipid metabolism (Sommer et al. 2008). Interestingly, HSA21 genes were found not to be significantly over-expressed compared to controls (Sommer et al. 2008). This was also demonstrated by Malagó et al. (2005) in a previous SAGE study who found that while some SAGE tags corresponding to HSA21 genes were indeed up-regulated, there was no overall significant difference in leukocyte expression of HSA21 genes between DS and control individuals. These data therefore appear to contradict the gene-dosage hypothesis. However, both studies used the I-SAGE kit from Invitrogen, which uses an NlaIII site (CATG) as its anchoring position for linker ligation and tag isolation. The use of other SAGE anchoring enzyme sites and/or with a combination with next generation sequencing (see below) may reveal a significant expression profile of HSA21 genes between DS and controls. The study by Malagó et al. (2005) also demonstrated similar data to that of Sommer et al. (2008) but further identified the expression of transcripts with unknown origin indicating the influence of uncharacterised genes in the development of DS. Both studies demonstrated a dysregulation of immune response genes which may help in explaining some of the immunophenotypic abnormalities observed in DS individuals. The SAGE data generated from these reports clearly indicate candidate genes for further investigation into the molecular mechanisms involved with the pathology of DS.

SAGE and next generation transcript sequencing

Although SAGE is a powerful technique for generating tissue-specific gene expression data, it is nonetheless a laborious technique that is further limited by the cloning step (of the generated SAGE tags) and the subsequent cost of DNA sequencing (currently making the technique less applicable for diagnostic use). However, the development of rapid and inexpensive next generation sequencing (NGS) platforms (reviewed by Mardis 2008) provides an alternative strategy for the analysis of the isolated SAGE generated tag. NGS platforms have the capacity to directly sequence individual SAGE tags (thus eliminating the concatenation and cloning step) making SAGE a more cost effective and rapid approach for whole genome gene expression profiling.

Several studies have successfully incorporated SAGE tagging or cDNA based sequencing with NGS platforms (Bainbridge et al. 2006; Cheung et al. 2006; Torres et al. 2008; Weber et al. 2007). However, a recent whole genome gene expression study of transcriptional start sites in a colon cancer cell line (HT-29) demonstrated the power and increased sensitivity of combining SAGE with NGS (Hashimoto et al. 2009). The report found that approximately 70,000 SAGE tags were generated when traditional SAGE was applied to concatenated and cloned sequence tags. In comparison, the application of isolated SAGE tags directly to an NGS platform produced a count of more than 20 million sequence reads. 84% of these unique tags mapped to RefSeq cDNA sequences (73% to representative transcriptional start sites and 11% to gene-coding regions within the genome) representing a total of 14,000 expressed genes in a single cell type. The remaining 16% of expressed tags may be a representation of small RNAs and novel genes. These data correspond to a sensitivity that is up to a 1,000-fold greater than the traditional SAGE cloning/sequencing-based approach alone (Hashimoto et al. 2009). These data clearly indicate a greater depth of transcriptome coverage when SAGE is used in combination with NGS and potentially makes SAGE-NGS a more amenable approach as a diagnostic tool for gene expression profiling of disease tissues.


SAGE has proven to be a powerful technique in evaluating gene expression and has been successfully applied to a number of human genetic diseases. Previous reports have shown that known genes are found to be up- or down-regulated in various pathological disease tissues. Validation of the expression levels of SAGE-identified genes has been provided by microarray and real-time QPCR assays, confirming data concordance with all techniques. However, the ability of SAGE to identify novel uncharacterised genes provides a powerful unique advantage over the traditional microarray-based approach in identifying novel molecular mechanisms involved with disease progression. The advent of NGS provides a further development for SAGE and gene expression profiling. The current reports of SAGE and NGS now demonstrate that it is possible to produce a cost-effective high throughput whole genome-wide gene expressing profiling strategy of disease tissues. The investigation of novel therapeutic agents on disease progression can potentially be rapidly analysed to evaluate the whole genome-wide gene expression profile (as a consequence of drug treatment) and may help provide strategies for future therapeutic intervention. In addition, SAGE combined with NGS now has the capacity to be used as a diagnostic tool that can (at least) be used for the early diagnosis in alterations in gene expression in many genetic disorders. SAGE-NGS may therefore become one of the future tools of choice for the investigation of gene expression profiling of human genetic disorders.

Copyright information

© Springer-Verlag 2009