Background

Identification of a complete set of transcripts expressed in a genome is one of the ultimate goals of transcriptome studies. Such information is essential for genome annotation and for further study of the function of each gene. It is well known that three classes of transcripts are expressed from a genome, including high-abundance, intermediate-abundance and low-abundance transcripts [1]. Whereas most of the high- and intermediate-abundance transcripts have been identified, it remains a serious challenge to identify fully the low-abundance transcripts [24].

Since the beginning of human genome studies, transcript identification has been performed mainly by the use of EST (expressed sequence tag)-based methods [5]. For identification of low-abundance transcripts, extensive subtraction and normalization have been performed in these EST efforts [4, 6]. The number of novel transcripts identified in humans through the EST-based approaches has reached a plateau [2, 7]. Recently, the SAGE (series analysis of gene expression) method has been applied for transcriptome analyses, with the collection of large numbers of 10-base SAGE tags from different species [810]. Although both the EST and the SAGE method are applied to transcriptome study, they use different approaches. The process of the EST method is that of single transcript-single clone-single sequencing; thus, each sequence represents a single transcript. In contrast, the process of SAGE follows the approach of multiple transcripts-multiple tags-single clone-single sequencing; thus, each SAGE sequence represents multiple transcripts. Using the same scale of sequence collection, SAGE should detect far more transcripts than does EST; therefore, SAGE might identify more low-abundance transcripts than does EST. Indeed, it is frequently observed that many SAGE tags have no match among the existing ESTs, and most of these SAGE tags have low copy numbers [1113]. Our previous analyses indicated that the majority of these unmatched SAGE tags are derived from low-abundance transcripts [7]. To determine whether SAGE is indeed more sensitive than the EST method and, if so, to what extent for the detection of low-abundance transcripts, we used existing EST and SAGE data for analysis, and we report our observations.

Results and Discussion

Because a SAGE tag is located at the 3' part of a transcript [8], we used 3' ESTs for comparison. We collected 3' ESTs representing low-abundance transcripts by searching UniGene clusters which contained only a single 3' EST (ftp://ftp.ncbi.nih.gov/repository/UniGene/ Hs.seq.all.gz, UniGene Build #161). We identified 42,500 such UniGene clusters and obtained the same number of 3' ESTs. For comparison with SAGE tags, we extracted virtual tags from these ESTs. We identified 32,587 from the 42,500 3' ESTs that have CATG site(s), a pre-condition for release of a SAGE tag from a transcript, and we extracted 32,587 virtual SAGE tags (10 bases downstream of the last CATG) from the 32,587 sequences. We removed virtual tags that were shared by more than one 3' EST. This resulted in a final set of 22,243 virtual tags from 22,243 3' ESTs representing low-abundance transcripts.

To obtain the experimental SAGE tags for the comparison, we downloaded 477,261 SAGE tags containing 6,847,555 copies collected from 154 SAGE libraries http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL4. Comparison of the 22,243 virtual SAGE tags with the experimental SAGE tag set identified 20,575 tags that were present in both sets. By matching the 20,575 tags in the SAGEmap database (http://www.ncbi.nlm.nih.gov/SAGE/), we identified 2,278 tags that represented the same 3' ESTs detected by both the EST method and the SAGE method. We used the 2,278 tags as the final set for quantitative comparison. Whereas each of the 2,278 virtual tags represents a transcript detected only once by the EST method, the copy number in each of the 2,278 experimental SAGE tags represents the frequency of a transcript detected by SAGE. We observed that the total copy number for the 2,278 experimental SAGE tags appeared 59,754 times; 1,424 (63%) of these SAGE tags appeared between two and more than 100 times. On average, SAGE was 26 times more sensitive than the EST method in detecting these transcripts (Table 1). The data clearly show that the SAGE method is much more sensitive than the EST method for the detection of low-abundance transcripts.

Table 1 Comparison between EST and SAGE methods for the detection of low-abundance transcripts

What could be the explanations for the difference between the EST and SAGE methods for detecting the low abundant transcripts?

It is unlikely that the difference is due to the depth of sequence collection. The current number of human ESTs reaches to 4.5 millions including 131,229 mRNAs and 1,470,982 3' ESTs, whereas the total human SAGE tags has about 8 millions. Considering that over 20 tags can be detected by a single SAGE sequence, the number of sequences collected from SAGE is far less than that from ESTs. In our previous studies [2], we observed the "loss" effect on EST collection due to the non-specific polydA/dT hybridization during subtraction / normalization widely used in EST library construction [6], as evidenced by the quantitative loss of a group of targeted transcripts, although it will be difficult to give an absolute rate of loss at the whole genome level due to the complexity of the transcriptome. Such a phenomenon can explain in part but other possibilities may also exist for the loss, such as the limitation of cloning efficiency when ligating cDNAs into vector during cDNA library construction, and clonal loss during library transformation etc. In the SAGE process, there is no subtraction / normalization step, and all the cDNA fragments at each step during SAGE library construction have nearly the same length with the same ends till being cloned into vector. Therefore, the repertoire of the total transcripts is well preserved in SAGE libraries for the detection.

It is true that SAGE method has many limitations for transcript detection. For example, a 14-base SAGE tag contains less sequence information for the detected transcript comparing with an EST that has hundred bases; the specificity of a SAGE tag representing a unique transcript is also lower than that of EST, particularly for SAGE tags at higher copies [1416]; and SAGE can't detect CATG-negative transcripts, although this number is low as shown that only 151 (7.8%) among the 19,399 full-length human cDNAs in the Refseq (NM) database are CATG-negative. Another issue is related with the error SAGE tags. A SAGE tag has 10 bases. In theory, any base within a single tag could be sequencing error leading to the generation of 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 = 410 mutated tags. However, such event doesn't happen in the real world [7]. We have converted thousand SAGE tags into their 3' cDNA experimentally using the GLGI method. From these studies, we clearly see that over 70% of the low-copy SAGE tags represent the real transcripts expressed at low level (these are experimentally confirmed. The real rate may be higher considering the limitation of the experimental sensitivity). Although there are certainly error SAGE tags, these error SAGE tags cannot be a significant portion in the total SAGE tag collection, particularly for the SAGE tags with low copies. Regardless these limitations, SAGE does have unique features for transcriptome study. Among these is that the presence of a SAGE tag implies in large the presence of a transcript.

It is worth to indicate that we only focused on the known low-abundance transcripts for the analysis. For the unknown low-abundance transcripts, many of them may not be present in EST libraries therefore not detectable as novel ESTs. However, these unknown low-abundance transcripts may be well preserved in SAGE libraries therefore readily detectable as novel SAGE tags.

Conclusions

The high sensitivity of the SAGE method for transcript detection becomes valuable for the isolation of low-abundance transcripts. Coupling amplification-based high-throughput methods such as the GLGI (generation of longer 3'cDNA from SAGE tag for gene identification) methods [17] for converting SAGE tags into the original transcripts provides an efficient way for isolating low-abundance transcripts.

Methods

Sequences used for the analysis

The ESTs were downloaded from UniGene database (Build #161) (ftp://ftp.ncbi.nih.gov/repository/UniGene/ Hs.seq.all.gz). The UniGene clusters containing CATG+ 3' ESTs were identified. Virtual SAGE tags were extracted from these 3' ESTs after their last CATG sites. The virtual SAGE tags were pooled and tags with the same sequences were then combined to generate the final virtual SAGE tag list from the 3' ESTs with quantitative information for each tag.

The experimental SAGE tags were downloaded from GEO database that contained 154 SAGE libraries http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL4. The SAGE tags from different libraries were pooled. The same SAGE tags in the pool were combined with the copy number to generate the final SAGE tags with quantitative information for each SAGE tags.

Computational process

Computational programs were designed using java language for the extraction of virtual SAGE tags from the 3' ESTs, and for the comparison between the experimental SAGE tags and EST-derived virtual SAGE tags. The programs are available upon request.